diff --git a/README_zh.md b/README_zh.md
index 4c3f4412..43f7972a 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -7,54 +7,64 @@
 PTO（Parallel Tile Operation）是昇腾 CANN 定义的一套面向 tile 编程的虚拟 ISA。本仓库提供 PTO Tile 指令的实现、示例、测试与文档，帮助开发者在不同昇腾代际之间更平滑地迁移和优化算子。
 
 [![License](https://img.shields.io/badge/License-CANN%20Open%20Software%20License%202.0-blue.svg)](LICENSE)
-[![Platform](https://img.shields.io/badge/Platform-Ascend%20A2%20%7C%20A3%20%7C%20A5%20%7C%20CPU-green.svg)](#️-平台支持)
-[![Docs](https://img.shields.io/badge/Docs-文档-blue.svg)](docs/README_zh.md)
+[![Platform](https://img.shields.io/badge/Platform-Ascend%20A2%20%7C%20A3%20%7C%20A5%20%7C%20CPU-green.svg)](#-平台支持)
+[![Docs](https://img.shields.io/badge/文档-blue.svg)](docs/README_zh.md)
 
-## 📰 新闻
+## 最新动态
 
-- 🎉 **2025-12-27**：PTO Tile Library 正式开源发布。
-- ✨ **2026-01-30**：新增合轴类指令、MX 指令。
-- 🚀 **2026-02-28**：新增卷积类指令、量化类指令、核间通信类指令。
-- 🔥 **2026-03-30**：支持昇腾 A5 芯片，新增异步通信指令、CostModel 性能仿真。
-- 🛠️ **2026-04-02**：本地工程链路进一步完善，补充了 pre-commit 检查、文档构建校验与 CPU-SIM 验证能力。
+- **2025-12-27**：PTO Tile Library 正式开源发布。
+- **2026-01-30**：新增归约类指令与 MX 指令。
+- **2026-02-28**：新增卷积类指令、量化类指令、核间通信类指令。
+- **2026-03-30**：支持昇腾 A5 芯片，新增异步通信指令与 CostModel 性能仿真。
+- **2026-04-02**：本地工程链路进一步完善，补充 pre-commit 检查、文档构建校验与 CPU-SIM 验证能力。
 
-## 🎯 项目定位
+## 项目定位
 
 PTO ISA 基于昇腾底层硬件与软件抽象，定义 90+ 条标准 tile 指令，用更高层的 tile 编程模型桥接不同代际之间的实现差异。它的目标不是隐藏底层能力，而是在提升抽象层级的同时保留性能调优空间。
 
 - **统一跨代 tile 抽象**：降低不同 Ascend 代际之间的迁移成本。
-- **兼顾可移植性与性能**：在固定 tile shape 下保证正确工作，同时保留 tile size、tile shape、指令顺序等调优能力。
-- **面向框架、算子与工具链**：可作为上层框架、算子实现和编译工具链的共同接口。
-- **支持持续扩展**：当前已定义 90+ 条标准操作，并持续补充实现与生态集成。
+- **兼顾可移植性与性能**：在固定 tile shape 下保证正确工作，同时保留 tile size、tile shape、指令顺序等调优空间。
+- **面向框架、算子与工具链**：作为上层框架、算子实现和编译工具链的共同接口。
+- **持续可扩展**：当前已定义 90+ 条标准操作，并持续补充实现与生态集成。
 
-除计算与数据搬运指令外，PTO ISA 还提供了面向 NPU 间数据传输与同步的**通信扩展指令集**，覆盖点对点通信、信号同步和集合通信三类能力。
+PTO 指令现已集成到以下框架中：
 
-这些通信原语延续了与计算指令一致的 tile 级抽象和跨平台设计，并可驱动昇腾上的多种数据搬移硬件引擎，帮助用户构建计算与通信深度融合的 kernel。通信 ISA 入口见 [docs/isa/comm/README_zh.md](docs/isa/comm/README_zh.md)。
+- [PyPTO](https://gitcode.com/cann/pypto/) — PTO 上层 Python 编程框架
+- [TileLang Ascend](https://github.com/tile-ai/tilelang-ascend/) — TileLang 昇腾后端
+- 更多语言与前端支持持续完善中
 
-目前，PTO 指令已集成到以下框架中：
-
-- [PyPTO](https://gitcode.com/cann/pypto/)
-- [TileLang Ascend](https://github.com/tile-ai/tilelang-ascend/)
-- 更多语言与前端持续完善中
-
-## ✨ 核心特性
+## 核心特性
 
 - **统一的 Tile ISA 抽象**：用标准 PTO 指令描述 tile 级计算与数据流。
-- **跨代迁移与性能调优兼顾**：既提升可移植性，也保留足够的底层控制能力。
-- **Auto / Manual 双模式开发路径**：先快速验证逻辑，再逐步深入优化实现。当前 Auto Mode 主要可用于 CPU 仿真。
-- **CPU Simulator 支持**：支持在 CPU 上进行功能验证与开发调试。
-- **覆盖关键编程要素**：支持 tile shape、tile mask、事件同步、固定功能单元与流水线建模。
-- **文档、测试、示例齐全**：提供 ISA 文档、开发文档、测试脚本和性能案例。
+- **跨代迁移与性能调优兼顾**：既提升可移植性，也保留充足的底层控制能力。
+- **Auto / Manual 双模式开发路径**：先快速验证逻辑，再逐步深入优化。Auto Mode 目前可用于 CPU 仿真。
+- **CPU Simulator 支持**：无需 Ascend 硬件即可在 CPU 上进行功能验证与开发调试。
+- **覆盖关键编程要素**：tile shape、tile mask、事件同步、固定功能单元与流水线建模。
+- **文档、测试、示例齐全**：ISA 文档、开发文档、测试脚本和性能案例全面覆盖。
 
-## 👥 适用人群
+## 适用人群
 
-PTO Tile Lib 主要面向以下开发者：
+PTO Tile Library 主要面向以下开发者：
 
 - 直接对接昇腾硬件的框架或编译器后端开发者
 - 需要跨平台迁移与复用实现的高性能算子开发者
-- 需要显式控制 tile、buffer 与 pipeline 的性能优化工程师
+- 需要显式控制 tile、buffer 与流水线的性能优化工程师
+
+## Tile 指令 vs Vector 指令
+
+| 判断标准 | Tile 指令（`pto.t*`） | Vector 指令（`pto.v*`） |
+|----------|----------------------|------------------------|
+| **典型用途** | 密集张量代数、矩阵乘法、逐元素运算 | 细粒度向量流水线控制、lane 级 mask |
+| **数据搬运** | `TLOAD`/`TSTORE`（隐式 tile↔UB） | `copy_gm_to_ubuf` + `vlds`/`vsts` + `copy_ubuf_to_gm` |
+| **同步方式** | `TSYNC`、`set_flag`/`wait_flag` | `set_flag`/`wait_flag`（向量流水线）、`mem_bar` |
+| **布局控制** | 通过 tile 布局参数（`RowMajor`、`ColMajor`、分形布局） | 通过 distribution mode（`NORM`、`BRC`、`DS` 等） |
+| **谓词** | 无 lane 级 mask（有效区域是粗粒度的） | 每个操作都支持完整 lane 级谓词 mask |
+| **目标可移植性** | 所有 Profile（CPU、A2/A3、A5） | A5 硬件支持；CPU/A2/A3 为仿真 |
+| **抽象层级** | 高层 tile 语义、有效区域 | 低层向量寄存器、显式 UB 暂存 |
 
-## 🚀 快速开始
+> **经验法则：** 张量运算优先使用 tile 指令。只有在需要 lane 级 mask、自定义数据布局或 tile 指令无法表达的性能微调时，才降级到向量指令。
+
+## 快速开始
 
 ### 环境准备
 
@@ -65,7 +75,7 @@ PTO Tile Lib 主要面向以下开发者：
 ### 编译与运行
 
 ```bash
-# CPU Simulator（建议第一步）
+# CPU Simulator（推荐第一步）
 python3 tests/run_cpu.py --clean --verbose
 
 # 运行 GEMM demo
@@ -85,9 +95,9 @@ python3 tests/script/run_st.py -r sim -v a3 -t tadd -g TADDTest.case_float_64x64
 
 ### 推荐样例
 
-- [Auto Mode Add 示例](demos/auto_mode/baseline/add/README_zh.md)：适合第一次了解 PTO 指令组织方式
-- [GEMM 性能示例](kernels/manual/a2a3/gemm_performance/README_zh.md)：适合理解 tile 级算子优化
-- [Flash Attention 示例](kernels/manual/common/flash_atten/README_zh.md)：适合理解复杂算子与性能调优
+- [Auto Mode Add 示例](demos/auto_mode/baseline/add/README_zh.md)：了解 PTO 指令如何组织 tile 级计算与数据搬运
+- [GEMM 性能示例](kernels/manual/a2a3/gemm_performance/README_zh.md)：理解 tile 级算子优化与性能调参
+- [Flash Attention 示例](kernels/manual/common/flash_atten/README_zh.md)：理解复杂算子与性能调优
 
 ### 推荐上手路径
 
@@ -96,30 +106,36 @@ python3 tests/script/run_st.py -r sim -v a3 -t tadd -g TADDTest.case_float_64x64
 3. 将代码移植到昇腾硬件上验证正确性并采集性能数据。参见 [msprof 工具](https://www.hiascend.com/document/detail/zh/canncommercial/850/devaids/Profiling/atlasprofiling_16_0010.html)
 4. 定位性能瓶颈（CUBE Bound / MTE Bound / Vector Bound），开始优化与调参。参见 [性能优化](docs/coding/opt_zh.md)
 
-本仓库也展示了标准 tile 操作如何通过模板参数映射到不同流水线实现：
+本仓库展示了标准 tile 操作如何通过模板参数映射到不同流水线实现：
 
 - [Tile 编程模型](docs/coding/Tile_zh.md)：理解静态 tile shape、动态 tile mask 与数据组织方式
 - [事件与同步](docs/coding/Event_zh.md)：理解 set/wait flag 与流水线同步
+- [Auto Mode](docs/auto_mode/Auto_Mode_Overview_zh.md)：编译器自动管理资源绑定与同步插入
 - [通用约定](docs/isa/conventions_zh.md)：理解 PTO 编程中的通用规则与约束
-- [PTO 指令列表](docs/isa/README_zh.md)：查看 PTO ISA 已定义的标准操作
+- [PTO 指令列表](docs/isa/README_zh.md)：按分类查看 PTO ISA 已定义的标准操作
 
-## 🗂️ 文档导航
+## 文档导航
 
 ### ISA 与编程模型
 
-- [ISA 总览](docs/README_zh.md)：PTO ISA 文档入口与阅读导航
-- [PTO 指令列表](docs/isa/README_zh.md)：按指令分类查看 PTO 标准操作
-- [Tile 编程模型](docs/coding/Tile_zh.md)：理解 tile 的形状、mask 与编程模型
-- [事件与同步](docs/coding/Event_zh.md)：理解事件记录、等待与同步方式
-- [通用约定](docs/isa/conventions_zh.md)：查看命名、约束与通用规则
+| 文档 | 内容 |
+|------|------|
+| [ISA 总览](docs/README_zh.md) | PTO ISA 文档入口与阅读导航 |
+| [PTO 指令列表](docs/isa/README_zh.md) | 按指令分类查看 PTO 标准操作 |
+| [Tile 编程模型](docs/coding/Tile_zh.md) | tile 的形状、mask 与编程模型 |
+| [事件与同步](docs/coding/Event_zh.md) | 事件记录、等待与同步方式 |
+| [通用约定](docs/isa/conventions_zh.md) | 命名、约束与通用规则 |
+| [Auto Mode](docs/auto_mode/Auto_Mode_Overview_zh.md) | 编译器驱动的资源管理与同步 |
 
 ### 开发与优化
 
-- [开发文档索引](docs/coding/README_zh.md)：查看扩展 PTO Tile Lib 的开发文档
-- [性能优化](docs/coding/opt_zh.md)：查看性能分析与调优建议
-- [文档构建说明](docs/mkdocs/README_zh.md)：查看 MkDocs 文档的本地构建方式
+| 文档 | 内容 |
+|------|------|
+| [开发文档索引](docs/coding/README_zh.md) | 扩展 PTO Tile Library 的开发文档入口 |
+| [性能优化](docs/coding/opt_zh.md) | 性能分析与调优建议 |
+| [文档构建说明](docs/mkdocs/README_zh.md) | MkDocs 文档的本地构建方式 |
 
-## 📊 示例与性能参考
+## 性能参考
 
 ### GEMM
 
@@ -137,21 +153,7 @@ python3 tests/script/run_st.py -r sim -v a3 -t tadd -g TADDTest.case_float_64x64
 
 ![Flash Attention 归一化 TFLOPS（A2/A3）](docs/figures/performance/fa_normalized_tflops_a2a3.svg)
 
-### 通信指令带宽
-
-- 参考实现：`kernels/manual/a2a3/tget_bandwidth/`
-- 详细分析与构建运行说明：[TGET / TGET_ASYNC 带宽对比示例](kernels/manual/a2a3/tget_bandwidth/README_zh.md)
-
-该示例在 Ascend A2/A3 上测量点对点远程读带宽，对比 `TGET`（同步，经 UB 中转）与 `TGET_ASYNC`（异步，经 DMA 引擎直接传输）的表现。
-
-### GEMM AllReduce 通算融合
-
-- 参考实现：`kernels/manual/a2a3/gemm_ar/`
-- 详细分析与调参说明：[高性能 GEMM AllReduce 融合算子示例](kernels/manual/a2a3/gemm_ar/README_zh.md)
-
-该示例展示了如何在同一个算子流水线中融合 PTO 通信原语与计算 kernel，实现 GEMM 与 AllReduce 的重叠执行。
-
-## 🖥️ 平台支持
+## 平台支持
 
 - Ascend A2（Ascend 910B）
 - Ascend A3（Ascend 910C）
@@ -160,56 +162,62 @@ python3 tests/script/run_st.py -r sim -v a3 -t tadd -g TADDTest.case_float_64x64
 
 更多细节请参考 [include/README_zh.md](include/README_zh.md)。
 
-## 🛣️ 路线图
-
-未来计划发布的特性：
+## 路线图
 
 | 功能 | 描述 | 范围 |
 | --- | --- | --- |
 | PTO Auto Mode | BiSheng 编译器支持：自动分配 tile buffer 并插入同步。 | 编译器 / 工具链 |
 | PTO Tile Fusion | BiSheng 编译器支持：自动融合 tile 操作。 | 编译器 / 工具链 |
-| PTO-AS | PTO ISA 的字节码（Byte Code）支持。 | 编译器 / 工具链 |
+| PTO-AS | PTO ISA 的字节码支持。 | 编译器 / 工具链 |
 | **卷积扩展** | PTO ISA 对卷积 kernel 的支持。 | ISA 扩展 |
 | **集合通信扩展** | PTO ISA 对集合通信 kernel 的支持。 | ISA 扩展 |
 | **系统调度扩展** | PTO ISA 对 SPMD/MPMD 编程的调度支持。 | ISA 扩展 |
 
-## 🗃️ 目录结构
-
-关键目录如下：
+## 目录结构
 
-```text
+```
+PTO Tile Library
 ├── include/                     # PTO 对外头文件与接口
-│   └── pto/                     # 公共类型、ISA 接口、CPU/NPU 实现
-├── kernels/                     # kernel 与算子实现
-│   ├── manual/                  # 手工优化实现与性能示例
-│   └── custom/                  # 自定义算子示例
-├── docs/                        # ISA、编程模型、快速开始与文档站点源文件
-│   ├── isa/                     # 指令参考与分类索引
-│   ├── coding/                  # 开发与性能优化文档
-│   ├── assembly/                # PTO-AS 汇编语法与规范
-│   └── mkdocs/                  # MkDocs 文档构建配置与源文件
-├── demos/                       # Auto Mode、baseline 与 torch_jit 示例
-├── tests/                       # CPU / NPU 测试、脚本与测试入口
-│   ├── cpu/                     # CPU 仿真测试
-│   ├── npu/                     # 按 SoC 拆分的 NPU 测试
-│   └── script/                  # 测试构建与运行脚本
-├── scripts/                     # 构建、安装与发布脚本
-├── cmake/                       # CMake 公共配置与打包逻辑
-├── build.sh                     # 一键构建与运行入口脚本
-└── CMakeLists.txt               # 顶层 CMake 配置
+│   └── pto/                    # 公共类型、ISA 接口、CPU/NPU 实现
+│       ├── common/             # 平台无关的 Tile 与指令基础设施
+│       ├── cpu/               # CPU 侧仿真支持
+│       └── npu/               # NPU 侧实现（按 SoC 代际划分）
+│           ├── a2a3/          # Ascend A2/A3 系列
+│           └── a5/            # Ascend A5 系列
+├── kernels/                    # kernel 与算子实现
+│   ├── manual/                 # 手工优化实现与性能示例
+│   │   ├── a2a3/             # A2/A3 平台 kernels（GEMM、Conv2D、TopK）
+│   │   ├── a5/               # A5 平台 kernels（Flash Attention、MXFP4/8 Matmul）
+│   │   └── common/            # 跨平台 kernels（Flash Attention）
+│   └── custom/                 # 自定义算子脚手架
+├── demos/                      # Auto Mode、baseline 与 torch_jit 示例
+├── docs/                       # ISA、编程模型、快速开始与文档站点源文件
+│   ├── isa/                   # 指令参考与分类索引
+│   ├── coding/                # 开发与性能优化文档
+│   ├── assembly/              # PTO-AS 汇编语法与规范
+│   ├── auto_mode/             # Auto Mode 文档
+│   └── mkdocs/                # MkDocs 文档构建配置与源文件
+├── tests/                      # CPU / NPU 测试、脚本与测试入口
+│   ├── cpu/                   # CPU 仿真测试
+│   ├── npu/                   # 按 SoC 拆分的 NPU 测试
+│   └── script/                # 测试构建与运行脚本
+├── scripts/                    # 构建、安装与发布脚本
+├── cmake/                      # CMake 公共配置与打包逻辑
+├── build.sh                    # 一键构建与运行入口脚本
+└── CMakeLists.txt              # 顶层 CMake 配置
 ```
 
-## ℹ️ 相关信息
+## 相关信息
 
 - [贡献指南](CONTRIBUTING_zh.md)：参与项目开发与提交流程
 - [安全与漏洞披露](SECURITY_zh.md)：安全问题反馈流程
 - [版本说明](ReleaseNote_zh.md)：版本更新与发布记录
 - [许可证](LICENSE)：CANN Open Software License Agreement Version 2.0
-- [PyPTO](https://gitcode.com/cann/pypto/)：PTO 生态中的上层编程框架
+- [PyPTO](https://gitcode.com/cann/pypto/)：PTO 生态中的上层 Python 编程框架
 - [PTOAS](https://gitcode.com/cann/PTOAS/)：面向 PTO 工作流的汇编器与编译后端
 - [pto-dsl](https://gitcode.com/cann/pto-dsl/)：面向 PTO 的 Python 前端与 JIT 工作流探索
 
-## 📬 联系我们
+## 联系我们
 
 - **问题反馈**：通过仓库 Issues 提交问题
 - **功能建议**：通过仓库 Issues 或讨论区反馈需求
diff --git a/demos/README.md b/demos/README.md
index 435c41ad..f6830096 100644
--- a/demos/README.md
+++ b/demos/README.md
@@ -2,19 +2,35 @@
 
 This directory contains demonstration examples showing how to use PTO Tile Library in different scenarios.
 
+## Choose by Task
+
+| Your goal | Start here |
+|-----------|-----------|
+| Verify algorithms quickly (no hardware needed) | CPU simulation demos — `tests/run_cpu.py --demo` |
+| Learn PTO tile programming | CPU demos — `flash_attn` or `gemm` |
+| Production NPU operators | `baseline/` — full examples with PyTorch integration |
+| Just-in-time compilation and debugging | `torch_jit/` — JIT compilation examples |
+| Auto Mode | `auto_mode/baseline/add/` — Auto Mode example |
+
 ## Directory Structure
 
 ```
 demos/
-├── baseline/         # Production PyTorch operator examples (NPU)
-│   ├── add/          # Basic element-wise addition
-│   ├── gemm_basic/   # GEMM with pipeline optimization
-│   └── flash_atten/  # Flash Attention with dynamic tiling
-├── cpu/              # CPU simulation demos (cross-platform)
+├── baseline/                     # Production-grade PyTorch operator examples (NPU)
+│   ├── add/                   # Element-wise addition
+│   ├── gemm_basic/           # GEMM with pipeline optimization
+│   ├── flash_atten/          # Flash Attention with dynamic tiling
+│   └── allgather_async/      # Asynchronous AllGather
+│
+├── auto_mode/                   # Auto Mode examples (CPU / NPU compatible)
+│   └── baseline/add/          # Auto Mode element-wise addition
+│
+├── cpu/                        # CPU simulation demos (cross-platform, no Ascend hardware)
 │   ├── gemm_demo/
 │   ├── flash_attention_demo/
 │   └── mla_attention_demo/
-└── torch_jit/        # PyTorch JIT compilation examples
+│
+└── torch_jit/                 # PyTorch JIT compilation examples
     ├── add/
     ├── gemm/
     └── flash_atten/
@@ -28,15 +44,25 @@ Production-ready examples showing how to implement custom PTO kernels and expose
 
 **Supported Platforms**: A2/A3/A5
 
-**Examples**: Element-wise addition, GEMM with double-buffering pipeline, Flash Attention with automatic tile size selection.
+**Examples**:
+- Element-wise addition — the most basic PTO operator example
+- GEMM — matrix multiplication with double-buffering pipeline
+- Flash Attention — with automatic tile size selection
+- AllGather-Async — asynchronous AllGather communication
 
 ### 2. CPU Simulation (`cpu/`)
 
 Cross-platform examples that run on CPU (x86_64/AArch64) without requiring Ascend hardware. Ideal for algorithm prototyping, learning PTO programming model, and CI/CD testing.
 
-**Examples**: Basic GEMM, Flash Attention, Multi-Latent Attention.
+**Examples**: Basic GEMM, Flash Attention, Multi-Latent Attention (MLA)
+
+### 3. Auto Mode (`auto_mode/`)
 
-### 3. PyTorch JIT (`torch_jit/`)
+Examples showcasing PTO AUTO mode. In Auto mode, the compiler automatically manages tile buffer address allocation and pipeline synchronization — no manual `TASSIGN` or `set_flag`/`wait_flag` needed.
+
+**Examples**: Auto Mode element-wise addition
+
+### 4. PyTorch JIT (`torch_jit/`)
 
 Examples showing on-the-fly C++ compilation and direct integration with PyTorch tensors. Useful for rapid prototyping without pre-building wheels.
 
@@ -63,6 +89,13 @@ pip install dist/*.whl
 cd test && python3 test.py
 ```
 
+### Auto Mode Example
+
+```bash
+cd demos/auto_mode/baseline/add
+# See the README inside for build and run instructions
+```
+
 ### JIT Example
 
 ```bash
@@ -74,24 +107,21 @@ python add_compile_and_run.py
 ## Prerequisites
 
 **For Baseline and JIT (NPU)**:
-- Ascend AI Processor A2/A3/A5(910B/910C/950)
+- Ascend AI Processor A2/A3/A5 (910B/910C/950)
 - CANN Toolkit 8.5.0+
 - PyTorch with `torch_npu`
 - Python 3.8+, CMake 3.16+
 
 **For CPU Demos**:
-- C++ compiler with C++23 support
+- C++ compiler with C++20 support
 - CMake 3.16+
 - Python 3.8+ (optional)
 
-## Documentation
-
-- Getting Started: [docs/getting-started.md](../docs/getting-started.md)
-- Programming Tutorial: [docs/coding/tutorial.md](../docs/coding/tutorial.md)
-- ISA Reference: [docs/isa/README.md](../docs/isa/README.md)
-
-## Related
+## Related Documents
 
-- Manual Kernels: [kernels/manual/README.md](../kernels/manual/README.md)
-- Custom Operators: [kernels/custom/README.md](../kernels/custom/README.md)
-- Test Cases: [tests/README.md](../tests/README.md)
+| Document | Content |
+|----------|---------|
+| [demos/README_zh.md](./README_zh.md) | 中文版入口 |
+| [docs/getting-started.md](../docs/getting-started.md) | Getting started guide |
+| [docs/coding/tutorial.md](../docs/coding/tutorial.md) | Programming tutorial |
+| [docs/isa/README.md](../docs/isa/README.md) | ISA reference |
diff --git a/demos/README_zh.md b/demos/README_zh.md
index ebdf32f0..4d18766b 100644
--- a/demos/README_zh.md
+++ b/demos/README_zh.md
@@ -1,97 +1,127 @@
-# PTO 演示示例
-
-本目录包含演示示例，展示如何在不同场景中使用 PTO Tile Library。
-
-## 目录结构
-
-```
-demos/
-├── baseline/         # 生产级 PyTorch 算子示例（NPU）
-│   ├── add/          # 基础逐元素加法
-│   ├── gemm_basic/   # 带流水线优化的 GEMM
-│   └── flash_atten/  # 带动态分块的 Flash Attention
-├── cpu/              # CPU 模拟演示（跨平台）
-│   ├── gemm_demo/
-│   ├── flash_attention_demo/
-│   └── mla_attention_demo/
-└── torch_jit/        # PyTorch JIT 编译示例
-    ├── add/
-    ├── gemm/
-    └── flash_atten/
-```
-
-## 演示类别
-
-### 1. Baseline (`baseline/`)
-
-生产级示例，展示如何实现自定义 PTO 内核并通过 `torch_npu` 将其作为 PyTorch 算子公开。包含从内核实现到 Python 集成的完整工作流程，带 CMake 构建系统和 wheel 打包。
-
-**支持平台**：A2/A3/A5
-
-**示例**：逐元素加法、带双缓冲流水线的 GEMM、带自动 tile 大小选择的 Flash Attention。
-
-### 2. CPU 模拟 (`cpu/`)
-
-在 CPU（x86_64/AArch64）上运行的跨平台示例，无需 Ascend 硬件。适用于算法原型设计、学习 PTO 编程模型和 CI/CD 测试。
-
-**示例**：基础 GEMM、Flash Attention、多潜在注意力。
-
-### 3. PyTorch JIT (`torch_jit/`)
-
-展示即时 C++ 编译和与 PyTorch 张量直接集成的示例。适用于快速原型设计，无需预先构建 wheel。
-
-**示例**：JIT 加法、JIT GEMM、带基准测试套件的 JIT Flash Attention。
-
-## 快速开始
-
-### CPU 模拟（推荐第一步）
-
-```bash
-python3 tests/run_cpu.py --demo gemm --verbose
-python3 tests/run_cpu.py --demo flash_attn --verbose
-```
-
-### NPU Baseline 示例
-
-```bash
-cd demos/baseline/add
-python -m venv virEnv && source virEnv/bin/activate
-pip install -r requirements.txt
-export PTO_LIB_PATH=[YOUR_PATH]/pto-isa
-python3 setup.py bdist_wheel
-pip install dist/*.whl
-cd test && python3 test.py
-```
-
-### JIT 示例
-
-```bash
-export PTO_LIB_PATH=[YOUR_PATH]/pto-isa
-cd demos/torch_jit/add
-python add_compile_and_run.py
-```
-
-## 前置要求
-
-**Baseline 和 JIT（NPU）**：
-- Ascend AI 处理器 A2/A3/A5（910B/910C/950）
-- CANN Toolkit 8.5.0+
-- 带 `torch_npu` 的 PyTorch
-- Python 3.8+、CMake 3.16+
-
-**CPU 演示**：
-- 支持 C++23 的 C++ 编译器
-- CMake 3.16+
-- Python 3.8+（可选）
-
-## 文档
-
-- 入门指南：[docs/getting-started.md](../docs/getting-started_zh.md)
-- 编程教程：[docs/coding/tutorial.md](../docs/coding/tutorial_zh.md)
-- ISA 参考：[docs/isa/README.md](../docs/isa/README_zh.md)
-
-## 相关
-
-- 手工内核：[kernels/manual/README.md](../kernels/manual/README_zh.md)
-- 自定义算子：[kernels/custom/README.md](../kernels/custom/README_zh.md)
-- 测试用例：[tests/README.md](../tests/README_zh.md)
+# PTO Demos
+
+本目录包含 PTO Tile Library 在不同场景下的演示示例。
+
+## 按任务选择
+
+| 你的目标 | 从这里开始 |
+|----------|----------|
+| 快速验证算法（无需硬件） | CPU 模拟 demo — `tests/run_cpu.py --demo` |
+| 学习 PTO tile 编程 | CPU demo — `flash_attn` 或 `gemm` |
+| 生产级 NPU 算子 | `baseline/` — 带 PyTorch 集成的完整示例 |
+| 即时编译与调试 | `torch_jit/` — JIT 编译示例 |
+| Auto Mode | `auto_mode/baseline/add/` — Auto Mode 示例 |
+
+## 目录结构
+
+```
+demos/
+├── baseline/                     # 生产级 PyTorch 算子示例（NPU）
+│   ├── add/                   # 逐元素加法
+│   ├── gemm_basic/           # GEMM（含流水线优化）
+│   ├── flash_atten/          # Flash Attention（含动态分块）
+│   └── allgather_async/      # 异步 AllGather
+│
+├── auto_mode/                   # Auto Mode 示例（CPU / NPU 均可）
+│   └── baseline/add/          # Auto Mode 逐元素加法
+│
+├── cpu/                        # CPU 模拟 demo（跨平台，无需 Ascend 硬件）
+│   ├── gemm_demo/
+│   ├── flash_attention_demo/
+│   └── mla_attention_demo/
+│
+└── torch_jit/                 # PyTorch JIT 编译示例
+    ├── add/
+    ├── gemm/
+    └── flash_atten/
+```
+
+## 示例类别
+
+### 1. Baseline（`baseline/`）
+
+生产级示例，展示如何实现自定义 PTO kernel 并通过 `torch_npu` 将其作为 PyTorch 算子公开。包含从 kernel 实现到 Python 集成的完整工作流程，带 CMake 构建系统和 wheel 打包。
+
+**支持平台**：A2/A3/A5
+
+**示例**：
+- 逐元素加法 — 最基础的 PTO 算子示例
+- GEMM — 带双缓冲流水线的矩阵乘法
+- Flash Attention — 带自动 tile 大小选择的 Flash Attention
+- AllGather-Async — 异步 AllGather 通信
+
+### 2. CPU 模拟（`cpu/`）
+
+在 CPU（x86_64/AArch64）上运行的跨平台示例，无需 Ascend 硬件。适用于算法原型设计、学习 PTO 编程模型和 CI/CD 测试。
+
+**示例**：基础 GEMM、Flash Attention、多潜在注意力（MLA）
+
+### 3. Auto Mode（`auto_mode/`）
+
+展示 PTO AUTO 模式的代码。Auto 模式下编译器自动管理 tile buffer 地址分配与流水线同步，无需手动 `TASSIGN` 和 `set_flag`/`wait_flag`。
+
+**示例**：Auto Mode 逐元素加法
+
+### 4. PyTorch JIT（`torch_jit/`）
+
+展示即时 C++ 编译和与 PyTorch 张量直接集成的示例。适用于快速原型设计，无需预先构建 wheel。
+
+**示例**：JIT 加法、JIT GEMM、带基准测试套件的 JIT Flash Attention
+
+## 快速开始
+
+### CPU 模拟（推荐第一步）
+
+```bash
+python3 tests/run_cpu.py --demo gemm --verbose
+python3 tests/run_cpu.py --demo flash_attn --verbose
+```
+
+### NPU Baseline 示例
+
+```bash
+cd demos/baseline/add
+python -m venv virEnv && source virEnv/bin/activate
+pip install -r requirements.txt
+export PTO_LIB_PATH=[YOUR_PATH]/pto-isa
+python3 setup.py bdist_wheel
+pip install dist/*.whl
+cd test && python3 test.py
+```
+
+### Auto Mode 示例
+
+```bash
+cd demos/auto_mode/baseline/add
+# See the README_zh.md inside for build and run instructions
+```
+
+### JIT 示例
+
+```bash
+export PTO_LIB_PATH=[YOUR_PATH]/pto-isa
+cd demos/torch_jit/add
+python add_compile_and_run.py
+```
+
+## 前置要求
+
+**Baseline 和 JIT（NPU）**：
+- Ascend AI 处理器 A2/A3/A5（910B/910C/950）
+- CANN Toolkit 8.5.0+
+- 带 `torch_npu` 的 PyTorch
+- Python 3.8+、CMake 3.16+
+
+**CPU 演示**：
+- 支持 C++20 的 C++ 编译器
+- CMake 3.16+
+- Python 3.8+（可选）
+
+## 相关文档
+
+| 文档 | 内容 |
+|------|------|
+| [demos/README_zh.md](./README_zh.md) | 中文版入口 |
+| [docs/getting-started_zh.md](../docs/getting-started_zh.md) | 入门指南 |
+| [docs/coding/tutorial_zh.md](../docs/coding/tutorial_zh.md) | 编程教程 |
+| [docs/isa/README_zh.md](../docs/isa/README_zh.md) | ISA 参考 |
diff --git a/docs/README.md b/docs/README.md
index a7f2f461..22fadfff 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,81 +1,283 @@
+# PTO ISA Documentation
+
 <p align="center">
-  <img src="figures/pto_logo.svg" alt="PTO Tile Lib" width="200" />
+  <img src="figures/pto_logo.svg" alt="PTO ISA" width="200" />
 </p>
 
-# PTO ISA Documentation Guide
+**PTO ISA** (Parallel Tile Operation Instruction Set Architecture) defines a stable, machine-independent instruction set for Huawei Ascend NPUs. It sits between high-level frontends (C/C++, Python, TileLang, PyPTO) and target-specific backends, providing one versioned instruction language across Ascend generations.
+
+> **Documentation version:** PTO ISA 1.0
+> **Applicable targets:** CPU Simulator · A2/A3 (Ascend 910B/910C) · A5 (Ascend 950)
+
+---
+
+## Quick Navigation
+
+Use this page as a **reading guide**, not a table of contents. The manual is organized into five layers — start at the layer that matches your goal.
+
+### Five-Layer Structure
+
+| Layer | Contents | Audience |
+|-------|----------|----------|
+| **1. Foundations** | Introduction, programming model, machine model | Everyone — start here |
+| **2. Syntax and Semantics** | Assembly model, operands, types, memory model | Kernel authors, compiler developers |
+| **3. Instruction Surface** | Instruction-set overview and contracts | All users |
+| **4. Reference Manual** | Tile, vector, scalar, and communication reference | Performance engineers, kernel authors |
+| **5. Appendices** | Format guidelines, diagnostics, glossary, portability | Everyone |
+
+### By Instruction Set
+
+| Instruction Set | Prefix | Role | Count | Reference |
+|----------------|--------|------|-------|-----------|
+| **Tile** | `pto.t*` | Tile-oriented compute, data movement, layout transforms, synchronization | ~120 ops | [Tile reference](isa/tile/README.md) |
+| **Vector** | `pto.v*` | Low-level vector micro-instructions, per-lane masking, pipeline control | ~99 ops | [Vector reference](isa/vector/README.md) |
+| **Scalar & Control** | `pto.*` | Configuration, control flow, DMA setup, predicate operations | ~60 ops | [Scalar reference](isa/scalar/README.md) |
+| **Communication** | `pto.*` | Multi-NPU collective operations and runtime support | ~24 ops | [Communication reference](isa/other/README.md) |
+
+### By Task
+
+| What you're doing | Start here |
+|-------------------|------------|
+| Understanding PTO's place in the stack | [What is PTO ISA?](isa/introduction/what-is-pto-visa.md) |
+| Writing a matrix multiplication kernel | [Tile → Matrix ops](isa/tile/matrix-and-matrix-vector.md) |
+| Optimizing elementwise operations | [Tile → Elementwise ops](isa/tile/elementwise-tile-tile.md) |
+| Implementing a convolution kernel | [Tile → img2col](isa/tile/ops/layout-and-rearrangement/timg2col.md) |
+| Setting up data movement (GM ↔ tile) | [Tile memory ops](isa/tile/memory-and-data-movement.md) |
+| Hand-tuning vector kernels | [Vector instructions](isa/vector/README.md) |
+| Using per-lane masking and predicates | [Vector → Predicate ops](isa/vector/predicate-and-materialization.md) |
+| Implementing collective communication | [Communication instructions](isa/other/README.md) |
+| Sorting, quantization, or histogram ops | [Irregular ops](isa/tile/irregular-and-complex.md) |
+| Letting the compiler manage synchronization | [Auto vs Manual mode](isa/programming-model/auto-vs-manual.md) |
+| Managing pipeline synchronization manually | [Synchronization model](isa/machine-model/ordering-and-synchronization.md) |
+| Checking which types/features are on A5 vs A2/A3 | [Target profiles](isa/machine-model/execution-agents.md) |
+| Reading a per-instruction page for the first time | [Format of instruction descriptions](isa/reference/format-of-instruction-descriptions.md) |
+
+---
+
+## Get Started
+
+New to PTO? Follow this path:
+
+1. **[What is PTO ISA?](isa/introduction/what-is-pto-visa.md)** — Core concepts, design rationale, and where PTO fits in the software stack
+2. **[Programming Model: Tiles and Valid Regions](isa/programming-model/tiles-and-valid-regions.md)** — The tile abstraction that makes PTO tile-first
+3. **[Machine Model: Execution Agents and Profiles](isa/machine-model/execution-agents.md)** — Execution hierarchy, pipelines, target profiles, and synchronization
+4. **[Instruction Set Overview](isa/instruction-surfaces/README.md)** — High-level map of all four instruction sets and when to use each
+5. **[Per-Instruction Reference](isa/README.md)** — Complete catalog organized by category
+
+---
+
+## What is PTO ISA?
+
+PTO ISA is the stable instruction language for Ascend NPU software. It abstracts away hardware differences across A2/A3/A5 generations while preserving enough control for performance tuning.
+
+```
+Source Languages
+(C/C++, Python, TileLang, PyPTO, code generators)
+        │
+        ▼
+   PTO instructions (.pto text)
+        │
+        ├──► ptoas ──► C++ ──► bisheng ──► binary   (Flow A: via C++ intermediate)
+        │
+        └──► ptoas ────────────────────► binary        (Flow B: direct assemble)
+
+Targets: CPU simulation / A2A3 (Ascend 910B / 910C) / A5 (Ascend 950 PR / 950 DT)
+```
+
+### Tile vs Vector: When To Use Which?
+
+| Criteria | Tile Instructions (`pto.t*`) | Vector Instructions (`pto.v*`) |
+|----------|-------------------------------|--------------------------------|
+| **Typical use** | Dense tensor algebra, matmul, elementwise operations | Fine-grained vector-pipe control, per-lane masking |
+| **Data movement** | `TLOAD`/`TSTORE` (implicit tile↔UB) | `copy_gm_to_ubuf` + `vlds`/`vsts` + `copy_ubuf_to_gm` |
+| **Synchronization** | `TSYNC`, `set_flag`/`wait_flag` | `set_flag`/`wait_flag` on vector pipe, `mem_bar` |
+| **Layout control** | Via tile layout parameters (`RowMajor`, `ColMajor`, fractal) | Via distribution mode (`NORM`, `BRC`, `DS`, etc.) |
+| **Predication** | No per-lane masking (valid region is coarse-grained) | Full per-lane predicate mask on every operation |
+| **Target portability** | All profiles (CPU, A2/A3, A5) | A5 hardware; emulated on CPU/A2/A3 |
+| **Abstraction level** | High-level tile semantics, valid regions | Low-level vector registers, explicit UB staging |
+
+> **Rule of thumb:** Start with tile instructions for tensor operations. Drop to vector instructions only when you need per-lane masking, custom data layouts, or micro-optimization that tile instructions cannot express.
+
+---
+
+## Core Concepts
+
+Understanding these concepts is essential before reading per-instruction pages.
+
+### Tile
+
+A **tile** is a bounded multi-dimensional array fragment with architecturally visible shape, layout, and valid-region metadata. Tiles are the primary programming objects in PTO.
+
+```cpp
+Tile<Vec, float, 16, 16> a;  // 16×16 f32 tile in vector tile buffer (UB)
+Tile<Left, f16, 64, 64> b;   // 64×64 f16 left operand (L0A)
+Tile<Acc, i32, 128, 128> c; // 128×128 i32 accumulator (L0C)
+```
+
+[Learn more →](isa/programming-model/tiles-and-valid-regions.md)
+
+### Valid Region
+
+The **valid region** `(Rv, Cv)` is the subset of a tile's declared shape that contains meaningful data. Operations iterate over the destination tile's valid region; source tiles with smaller valid regions yield implementation-defined values outside their valid region.
+
+### TileType (Location Intent)
+
+The **TileType** determines which hardware buffer backs a tile:
+
+| TileType | Hardware Buffer | Capacity | Typical Use |
+|----------|----------------|----------|-------------|
+| `Vec` | Unified Buffer (UB) | 256 KB | General elementwise operations |
+| `Left` | L0A | 64 KB | Matmul A operand |
+| `Right` | L0B | 64 KB | Matmul B operand |
+| `Acc` | L0C | 256 KB | Matmul accumulator/output |
+| `Mat` | L1 | 512 KB | 2D matrix operands |
+
+### GlobalTensor
+
+A **GlobalTensor** is a view of off-chip device memory (`__gm__` address space). All data movement between GM and tile buffers happens through explicit `TLOAD`/`TSTORE` or DMA operations.
 
-This page is the main documentation entry for PTO Tile Lib. It helps readers locate documents by topic instead of navigating directories one by one.
+[Learn more →](isa/programming-model/globaltensor-and-data-movement.md)
 
-The PTO documentation mainly covers the following areas:
+### Auto vs Manual Mode
 
-- ISA fundamentals and an overall reading path
-- Instruction indexes and per-instruction reference pages
-- PTO assembly syntax and the PTO-AS specification
-- Tile programming model, event synchronization, and performance tuning
-- Getting started, test execution, and documentation build instructions
+| Mode | Resource Binding | Synchronization | Data Movement | Who manages it? |
+|------|-----------------|-----------------|---------------|-----------------|
+| **Auto** | Compiler inserts `TASSIGN` | Compiler inserts `TSYNC` | Compiler inserts `TLOAD`/`TSTORE` | Compiler |
+| **Manual** | Author writes `TASSIGN` explicitly | Author writes `TSYNC` explicitly | Author writes `TLOAD`/`TSTORE` explicitly | You |
 
-## Recommended Reading Path
+[Auto vs Manual →](isa/programming-model/auto-vs-manual.md)
 
-If you are new to PTO Tile Lib, we recommend reading in the following order:
+### Target Profiles
 
-1. [Getting Started](getting-started.md): set up the environment and run the CPU simulator first
-2. [ISA Overview](PTOISA.md): build an overall understanding of the PTO ISA
-3. [PTO Instruction List](isa/README.md): browse the standard operations by category
-4. [Tile Programming Model](coding/Tile.md): understand tile shape, tile mask, and data organization
-5. [Events and Synchronization](coding/Event.md): understand set/wait flag usage and pipeline synchronization
-6. [Performance Optimization](coding/opt.md): understand common bottlenecks and tuning directions
+PTO ISA is instantiated by concrete **target profiles** that narrow the accepted subset for a specific backend.
 
-## Documentation Categories
+| Feature | CPU Simulator | A2/A3 Profile | A5 Profile |
+|---------|:--------------:|:--------------:|:----------:|
+| Tile instructions (`pto.t*`) | Full | Full | Full |
+| Vector instructions (`pto.v*`) | Emulated | Emulated | Full hardware |
+| Matmul / CUBE ops | Software fallback | Hardware | Hardware |
+| Vector width (f32 / f16,bf16 / i8) | Configurable | 64 / 128 / 256 | 64 / 128 / 256 |
+| FP8 types (`f8e4m3`, `f8e5m2`) | — | — | Supported |
+| Fractal layouts (NZ/ZN/FR/RN) | Simulated | Simulated | Full |
+| Block-scoped collective comm | — | Supported | Supported |
 
-### 1. ISA and Instruction Reference
+---
 
-- [Virtual ISA Manual Entry](PTO-Virtual-ISA-Manual.md): top-level entry for the PTO ISA manual
-- [ISA Overview](PTOISA.md): background, goals, and overall structure of the PTO ISA
-- [PTO Instruction List](isa/README.md): index of PTO standard operations organized by category
-- [General Conventions](isa/conventions.md): common naming rules, constraints, and usage conventions
+## Instruction-Set Navigation Map
 
-### 2. PTO Assembly and Representation
+PTO groups its instructions into four named instruction sets. Each set has a **contract page** (shared rules) and **per-op pages** (individual instructions).
 
-- [PTO Assembly Index](assembly/README.md): entry for PTO-AS documentation
-- [PTO Assembly Syntax (PTO-AS)](assembly/PTO-AS.md): PTO assembly syntax and specification
+### Tile Instruction Set — `pto.t*`
 
-### 3. Programming Model and Development Notes
+```
+Tile Instruction Set
+├── Sync and Config             → tassign, tsync, tsettf32mode, tsetfmatrix, tset_img2col_*, tsubview, tget_scale_addr
+├── Elementwise Tile-Tile       → tadd, tsub, tmul, tdiv, tmin, tmax, tcmp, tcvt, tsel, tlog, trecip, texp, tsqrt, trsqrt, trem, tfmod, tabs, tand, tor, txor, tnot, tneg, tprelu, taddc, tsubc, tshl, tshr
+├── Tile-Scalar and Immediate   → tadds, tsubs, tmuls, tdivs, tminmaxs, tcmps, tsels, texpands, tfmods, trems, tands, tors, txors, tshls, tshrs, tlrelu, taddsc, tsubsc
+├── Reduce and Expand           → trowsum, tcolsum, trowprod, tcolprod, tcolmax, tcolmin, trowmax, trowmin, tcolargmax, tcolargmin, trowargmax, trowargmin
+│                               → trowexpand, trowexpandadd, trowexpanddiv, trowexpandmul, trowexpandsub, trowexpandmax, trowexpandmin, trowexpandexpdif
+│                               → tcolexpand, tcolexpandadd, tcolexpanddiv, tcolexpandmul, tcolexpandsub, tcolexpandmax, tcolexpandmin, tcolexpandexpdif
+├── Memory and Data Movement   → tload, tprefetch, tstore, tstore_fp, mgather, mscatter
+├── Matrix and Matrix-Vector    → tgemv, tgemv_mx, tgemv_acc, tgemv_bias, tmatmul, tmatmul_mx, tmatmul_acc, tmatmul_bias
+├── Layout and Rearrangement   → tmov, tmov_fp, ttrans, textract, textract_fp, tinsert, tinsert_fp, timg2col, tfillpad, tfillpad_inplace, tfillpad_expand, treshape
+└── Irregular and Complex      → tprint, tmrgsort, tsort32, tgather, tgatherb, tscatter, tci, ttri, tpartadd, tpartmul, tpartmax, tpartmin, tquant
+```
 
-- [Development Documentation Index](coding/README.md): entry for developer-facing PTO Tile Lib documentation
-- [Tile Programming Model](coding/Tile.md): tile shape, tile mask, and data layout
-- [Events and Synchronization](coding/Event.md): event recording, waiting, and synchronization behavior
-- [Performance Optimization](coding/opt.md): performance analysis and tuning guidance
+[Tile instruction set contract →](isa/instruction-families/tile-families.md)
 
-### 4. Getting Started, Testing, and Documentation Build
+### Vector Instruction Set — `pto.v*`
+
+```
+Vector Instruction Set
+├── Vector Load Store           → vlds, vldas, vldus, vldx2, vsld, vsldb, vgather2, vgatherb, vgather2_bc
+│                               → vsts, vstx2, vsst, vsstb, vscatter, vsta, vstas, vstar, vstu, vstus, vstur
+├── Predicate and Materialization → vbr, vdup
+├── Unary Vector Ops            → vabs, vneg, vexp, vln, vsqrt, vrsqrt, vrec, vrelu, vnot, vbcnt, vcls, vmov
+├── Binary Vector Ops            → vadd, vsub, vmul, vdiv, vmax, vmin, vand, vor, vxor, vshl, vshr, vaddc, vsubc
+├── Vector-Scalar Ops           → vadds, vsubs, vmuls, vmaxs, vmins, vands, vors, vxors, vshls, vshrs, vlrelu, vaddcs, vsubcs
+├── Conversion Ops               → vci, vcvt, vtrc
+├── Reduction Ops               → vcadd, vcmax, vcmin, vcgadd, vcgmax, vcgmin, vcpadd
+├── Compare and Select          → vcmp, vcmps, vsel, vselr, vselrv2
+├── Data Rearrangement          → vintlv, vdintlv, vslide, vshift, vsqz, vusqz, vperm, vpack, vsunpack, vzunpack, vintlvv2, vdintlvv2
+└── SFU and DSA                 → vprelu, vexpdiff, vaddrelu, vsubrelu, vaxpy, vaddreluconv, vmulconv, vmull, vmula, vtranspose, vsort32, vbitsort, vmrgsort
+```
 
-- [Getting Started](getting-started.md): environment setup and CPU / NPU execution guide
-- [Test Guide](../tests/README.md): test entry points, scripts, and common commands
-- [Documentation Build Guide](mkdocs/README.md): how to build the docs locally with MkDocs
+[Vector instruction set contract →](isa/instruction-families/vector-families.md)
 
-### 5. Other Related Documents
+### Scalar and Control Instruction Set — `pto.*`
 
-- [Machine Documentation](machine/README.md): abstract machine model and related notes
+```
+Scalar and Control Instruction Set
+├── Control and Configuration   → nop, barrier, yield; tsetf32mode, tsethf32mode, tsetfmatrix
+├── Pipeline Sync              → set_flag, wait_flag, wait_flag_dev, pipe_barrier, mem_bar, get_buf, rls_buf, set_cross_core, set_intra_block, wait_intra_core
+├── DMA Copy                   → set_loop_size_outtoub, set_loop1/2_stride_outtoub
+│                               → set_loop_size_ubtoout, set_loop1/2_stride_ubtoout
+│                               → copy_gm_to_ubuf, copy_ubuf_to_gm, copy_ubuf_to_ubuf
+├── Predicate Load Store        → pld, plds, pldi, psts, pst, psti, pstu
+├── Predicate Generation        → pset_b8/b16/b32, pge_b8/b16/b32, plt_b8/b16/b32
+│                               → pand, por, pxor, pnot, psel, ppack, punpack
+│                               → pdintlv_b8, pintlv_b16
+├── Shared Arithmetic           → Scalar arithmetic ops shared across instruction sets
+├── Shared SCF                 → Scalar structured control flow
+└── Micro-Instructions          → BlockDim queries, pointer ops, vector scope, alignment state
+    [Micro-instruction summary →](isa/vector/micro-instruction-summary.md)
+```
 
-## Directory Structure
+[Scalar instruction set contract →](isa/instruction-families/scalar-and-control-families.md)
 
-Key entries are listed below:
+### Communication Instruction Set — `pto.*`
 
-```text
-├── isa/                        # PTO instruction reference and category indexes
-├── assembly/                   # PTO assembly syntax and PTO-AS specification
-├── coding/                     # Programming model, development, and optimization docs
-├── auto_mode/                  # Auto Mode related documents
-├── machine/                    # Abstract machine model documents
-├── mkdocs/                     # Documentation site build config and scripts
-├── figures/                    # Images and diagram assets used in docs
-├── README*                     # Documentation entry pages
-├── PTOISA*                     # ISA overview documents
-└── getting-started*            # Getting started guides
 ```
+Communication Instruction Set
+├── Collective Ops              → tbroadcast, tget, tget_async, tput, tput_async
+│                               → tscatter, tgather, treduce, ttest, twait, tnotify
+└── Non-ISA Supporting Ops      → talias, taxpy, tconcat, tdequant, tfree, thistogram
+                                → tpack, tpop, tpush, trandom
+```
+
+[Communication instruction set contract →](isa/instruction-families/other-families.md)
+
+---
+
+## Compilation Flows
+
+### Flow A: High-Level Compile (ptoas → C++ → bisheng → binary)
+
+High-level frontends emit `.pto` text files. `ptoas` parses, validates, and lowers these to C++ code calling the `pto-isa` C++ library. A backend compiler (bisheng) then produces the final binary.
+
+**Who uses this:** Compiler developers, library authors, high-level framework integrators. The `.pto` format is portable and cacheable.
+
+### Flow B: Direct Assemble (ptoas → binary)
+
+`ptoas` assembles directly to target binary, bypassing the C++ intermediate step.
+
+**Who uses this:** Performance engineers who need direct control over the final instruction stream, or toolchains that embed `ptoas` as a pure assembler.
+
+[Learn more about the compilation flows →](isa/introduction/what-is-pto-visa.md#two-compilation-flows)
+
+---
+
+## Key References
+
+| Reference | What it covers |
+|-----------|---------------|
+| **[PTO-AS Specification](assembly/PTO-AS.md)** | Assembly syntax and grammar for `.pto` text files |
+| **[Tile Programming Model](coding/Tile.md)** | Tile shape, tile mask, and data organization |
+| **[Events and Synchronization](coding/Event.md)** | set/wait flag and pipeline synchronization |
+| **[Performance Optimization](coding/opt.md)** | Bottleneck analysis and tuning guidance |
+| **[Auto Mode Overview](auto_mode/Auto_Mode_Overview.md)** | Compiler-driven resource management and synchronization |
+| **[Micro-Instruction Summary](isa/vector/micro-instruction-summary.md)** | Scalar micro-instructions: BlockDim, pointer ops, vector scope |
+| **[Portability and Target Profiles](isa/reference/portability-and-target-profiles.md)** | Which features exist on which target |
+| **[Glossary](isa/reference/glossary.md)** | Terminology reference |
+| **[Source of Truth](isa/reference/source-of-truth.md)** | Which files define authoritative semantics |
+| **[Build the Docs](mkdocs/README.md)** | Generate this site locally |
+
+---
+
+## Contributing
+
+This documentation is generated from the canonical PTO ISA specification at [github.com/PTO-ISA/pto-isa](https://github.com/PTO-ISA/pto-isa). Report issues and submit changes there.
 
-## Related Entry Points
+---
 
-- [Root README](../README.md): project overview, quick start, and repository entry page
-- [kernels Directory Guide](../kernels/README.md): kernel and operator implementation entry point
-- [include Directory Guide](../include/README.md): headers and public interface overview
-- [tests Directory Guide](../tests/README.md): testing and execution entry point
+*PTO ISA is part of the Ascend software stack. Copyright © Huawei Technologies Co., Ltd.*
diff --git a/docs/README_zh.md b/docs/README_zh.md
index 5802cf0e..6b547fb8 100644
--- a/docs/README_zh.md
+++ b/docs/README_zh.md
@@ -1,81 +1,284 @@
+# PTO ISA 文档导航
+
 <p align="center">
-  <img src="figures/pto_logo.svg" alt="PTO Tile Lib" width="200" />
+  <img src="figures/pto_logo.svg" alt="PTO ISA" width="200" />
 </p>
 
-# PTO ISA 文档导航
+**PTO ISA**（Parallel Tile Operation Instruction Set Architecture，平行瓦片操作指令集架构）是昇腾 NPU 的稳定、跨代际的机器无关指令集。它位于高层前端（C/C++、Python、TileLang、PyPTO）与目标特定后端之间，在昇腾各代际（A2/A3/A5）间提供统一版本化的指令语言。
+
+> **文档版本：** PTO ISA 1.0
+> **适用目标：** CPU 模拟器 · A2/A3（Ascend 910B/910C） · A5（Ascend 950）
+
+---
+
+## 快速导航
+
+使用本页面作为**阅读指南**，而非目录索引。手册按五个逻辑层次组织——从与你的目标匹配的层次开始阅读。
+
+### 五层结构
+
+| 层次 | 内容 | 受众 |
+|------|------|------|
+| **1. 基础** | 引言、编程模型、机器模型 | 所有人——从此开始 |
+| **2. 语法与语义** | 汇编模型、操作数、类型系统、内存模型 | 内核作者、编译器开发者 |
+| **3. 指令集概述** | 指令集总览与指令族契约 | 所有用户 |
+| **4. 参考手册** | Tile、Vector、Scalar、通信参考 | 性能工程师、内核作者 |
+| **5. 附录** | 格式指南、诊断、术语表、可移植性 | 所有人 |
+
+### 按指令集导航
+
+| 指令集 | 前缀 | 职责 | 数量 | 参考 |
+|--------|------|------|------|------|
+| **Tile** | `pto.t*` | Tile 级计算、数据搬运、布局变换、同步 | ~120 条 | [Tile 参考](isa/tile/README_zh.md) |
+| **Vector** | `pto.v*` | 向量微指令、lane 级 mask、流水线控制 | ~99 条 | [Vector 参考](isa/vector/README_zh.md) |
+| **标量与控制** | `pto.*` | 配置、控制流、DMA 设置、谓词操作 | ~60 条 | [Scalar 参考](isa/scalar/README_zh.md) |
+| **通信** | `pto.*` | 多 NPU 集体通信与运行时支撑 | ~24 条 | [通信参考](isa/other/README_zh.md) |
+
+### 按任务导航
+
+| 你的任务 | 从这里开始 |
+|----------|----------|
+| 理解 PTO 在软件栈中的位置 | [PTO ISA 是什么？](isa/introduction/what-is-pto-visa_zh.md) |
+| 编写矩阵乘法 kernel | [Tile → 矩阵运算](isa/tile/matrix-and-matrix-vector_zh.md) |
+| 优化逐元素运算 | [Tile → 逐元素](isa/tile/elementwise-tile-tile_zh.md) |
+| 实现卷积 kernel | [Tile → img2col](isa/tile/ops/layout-and-rearrangement/timg2col_zh.md) |
+| 设置数据搬运（GM ↔ tile） | [Tile 内存操作](isa/tile/memory-and-data-movement_zh.md) |
+| 手写向量 kernel | [Vector 指令](isa/vector/README_zh.md) |
+| 使用 lane 级 mask 与谓词 | [Vector → 谓词操作](isa/vector/predicate-and-materialization_zh.md) |
+| 实现多 NPU 集体通信 | [通信指令](isa/other/README_zh.md) |
+| 排序、量化或直方图操作 | [非常规操作](isa/tile/irregular-and-complex_zh.md) |
+| 让编译器管理同步 | [Auto vs Manual 模式](isa/programming-model/auto-vs-manual_zh.md) |
+| 手动管理流水线同步 | [同步模型](isa/machine-model/ordering-and-synchronization_zh.md) |
+| 查询 A5 vs A2/A3 支持的数据类型/特性 | [目标 Profile](isa/machine-model/execution-agents_zh.md) |
+| 首次阅读单条指令页面 | [指令描述格式](isa/reference/format-of-instruction-descriptions_zh.md) |
+
+---
+
+## 新手上路
+
+初次接触 PTO？按以下路径阅读：
+
+1. **[PTO ISA 是什么？](isa/introduction/what-is-pto-visa_zh.md)** — 核心概念、设计理念、PTO 在软件栈中的位置
+2. **[编程模型：Tile 与有效区域](isa/programming-model/tiles-and-valid-regions_zh.md)** — Tile 抽象——PTO 的核心编程对象
+3. **[机器模型：执行代理与 Profile](isa/machine-model/execution-agents_zh.md)** — 执行层次、流水线、目标 Profile 与同步
+4. **[指令集概述](isa/instruction-surfaces/README_zh.md)** — 四大指令集总览及选用指南
+5. **[逐指令参考](isa/README_zh.md)** — 按类别组织的完整指令目录
+
+---
+
+## 什么是 PTO ISA？
+
+PTO ISA 是昇腾 NPU 软件栈的稳定指令语言。它抽象了 A2/A3/A5 各代际间的硬件差异，同时保留了充足的性能调优控制能力。
+
+```
+高层语言
+（C/C++、Python、TileLang、PyPTO、代码生成器）
+        │
+        ▼
+   PTO 指令（.pto 文本）
+        │
+        ├──► ptoas ──► C++ ──► bisheng ──► 二进制   （Flow A：经 C++ 中间层）
+        │
+        └──► ptoas ──────────────────────► 二进制        （Flow B：直接汇编）
+
+目标平台：CPU 模拟器 / A2A3（Ascend 910B / 910C）/ A5（Ascend 950）
+```
+
+### Tile 指令 vs Vector 指令：何时选哪个？
+
+| 判断标准 | Tile 指令（`pto.t*`） | Vector 指令（`pto.v*`） |
+|----------|----------------------|------------------------|
+| **典型用途** | 密集张量代数、矩阵乘法、逐元素运算 | 细粒度向量流水线控制、lane 级 mask |
+| **数据搬运** | `TLOAD`/`TSTORE`（隐式 tile↔UB） | `copy_gm_to_ubuf` + `vlds`/`vsts` + `copy_ubuf_to_gm` |
+| **同步方式** | `TSYNC`、`set_flag`/`wait_flag` | `set_flag`/`wait_flag`（向量流水线）、`mem_bar` |
+| **布局控制** | 通过 tile 布局参数（`RowMajor`、`ColMajor`、分形布局） | 通过 distribution mode（`NORM`、`BRC`、`DS` 等） |
+| **谓词** | 无 lane 级 mask（有效区域是粗粒度的） | 每个操作都支持完整 lane 级谓词 mask |
+| **目标可移植性** | 所有 Profile（CPU、A2/A3、A5） | A5 硬件支持；CPU/A2/A3 为仿真 |
+| **抽象层级** | 高层 tile 语义、有效区域 | 低层向量寄存器、显式 UB 暂存 |
+
+> **经验法则：** 张量运算优先使用 tile 指令。只有在需要 lane 级 mask、自定义数据布局或 tile 指令无法表达的性能微调时，才降级到向量指令。
+
+---
+
+## 核心概念
+
+阅读逐指令页面之前，以下概念必不可少。
+
+### Tile
+
+**Tile** 是带有架构可见 shape、layout 和有效区域元数据的受限多维数组片段。Tile 是 PTO 中的主要编程对象。
+
+```cpp
+Tile<Vec, float, 16, 16> a;  // 16×16 f32 tile，位于向量 tile buffer（UB）
+Tile<Left, f16, 64, 64> b;   // 64×64 f16 左操作数（L0A）
+Tile<Acc, i32, 128, 128> c; // 128×128 i32 累加器（L0C）
+```
+
+[了解更多 →](isa/programming-model/tiles-and-valid-regions_zh.md)
+
+### 有效区域（Valid Region）
 
-这里是 PTO Tile Lib 的文档入口页，用于帮助读者按主题快速定位文档，而不是逐个目录查找。
+**有效区域** `(Rv, Cv)` 是 tile 声明形状中含有有效数据的子集。操作在目标 tile 的有效区域内迭代；源 tile 有效区域外的值为实现定义。
 
-PTO 相关文档主要覆盖以下几类内容：
+### TileType（位置意图）
 
-- ISA 基础概念与整体阅读路径
-- 指令索引与逐条指令参考
-- PTO 汇编语法与 PTO-AS 规范
-- Tile 编程模型、事件同步与性能优化
-- 快速开始、测试运行与文档构建说明
+**TileType** 决定 tile 由哪种硬件 buffer 支撑：
 
-## 建议阅读路径
+| TileType | 硬件 Buffer | 容量 | 典型用途 |
+|----------|------------|------|----------|
+| `Vec` | Unified Buffer（UB） | 256 KB | 通用逐元素运算 |
+| `Left` | L0A | 64 KB | 矩阵乘法 A 操作数 |
+| `Right` | L0B | 64 KB | 矩阵乘法 B 操作数 |
+| `Acc` | L0C | 256 KB | 矩阵乘法累加器/输出 |
+| `Mat` | L1 | 512 KB | 2D 矩阵操作数 |
 
-如果您第一次接触 PTO Tile Lib，建议按以下顺序阅读：
+### GlobalTensor
 
-1. [快速开始指南](getting-started_zh.md)：先完成环境准备并运行 CPU Simulator
-2. [ISA 总览](PTOISA_zh.md)：建立对 PTO ISA 的整体认识
-3. [PTO 指令列表](isa/README_zh.md)：按类别浏览已定义的标准操作
-4. [Tile 编程模型](coding/Tile_zh.md)：理解 tile shape、tile mask 与数据组织方式
-5. [事件与同步](coding/Event_zh.md)：理解 set/wait flag 与流水线同步
-6. [性能优化](coding/opt_zh.md)：理解常见瓶颈与调优方向
+**GlobalTensor** 是片外设备内存（`__gm__` 地址空间）的视图。GM 与 tile buffer 之间的所有数据搬运均通过显式的 `TLOAD`/`TSTORE` 或 DMA 操作完成。
 
-## 文档分类
+[了解更多 →](isa/programming-model/globaltensor-and-data-movement_zh.md)
 
-### 1. ISA 与指令参考
+### Auto vs Manual 模式
 
-- [虚拟 ISA 手册入口](PTO-Virtual-ISA-Manual_zh.md)：PTO ISA 手册总入口
-- [ISA 总览](PTOISA_zh.md)：介绍 PTO ISA 的背景、目标与整体结构
-- [PTO 指令列表](isa/README_zh.md)：按类别组织的 PTO 标准操作索引
-- [通用约定](isa/conventions_zh.md)：命名、约束、使用规范等通用规则
+| 模式 | 资源绑定 | 同步 | 数据搬运 | 管理方 |
+|------|---------|------|----------|--------|
+| **Auto** | 编译器插入 `TASSIGN` | 编译器插入 `TSYNC` | 编译器插入 `TLOAD`/`TSTORE` | 编译器 |
+| **Manual** | 作者显式写 `TASSIGN` | 作者显式写 `TSYNC` | 作者显式写 `TLOAD`/`TSTORE` | 你 |
 
-### 2. PTO 汇编与表示形式
+[Auto vs Manual →](isa/programming-model/auto-vs-manual_zh.md)
 
-- [PTO 汇编索引](assembly/README_zh.md)：PTO-AS 文档入口
-- [PTO 汇编语法（PTO-AS）](assembly/PTO-AS_zh.md)：PTO 汇编语法与规范说明
+### 目标 Profile
 
-### 3. 编程模型与开发文档
+PTO ISA 由具体的**目标 Profile** 实例化，为特定后端限定可接受的子集。
 
-- [开发文档索引](coding/README_zh.md)：扩展 PTO Tile Lib 的开发文档入口
-- [Tile 编程模型](coding/Tile_zh.md)：介绍 tile shape、tile mask 与数据布局
-- [事件与同步](coding/Event_zh.md)：介绍事件记录、等待与同步机制
-- [性能优化](coding/opt_zh.md)：介绍性能分析与调优建议
+| 特性 | CPU 模拟器 | A2/A3 Profile | A5 Profile |
+|------|:---------:|:-------------:|:----------:|
+| Tile 指令（`pto.t*`） | 完整 | 完整 | 完整 |
+| Vector 指令（`pto.v*`） | 仿真 | 仿真 | 硬件完整支持 |
+| 矩阵乘法 / CUBE 运算 | 软件回退 | 硬件 | 硬件 |
+| 向量宽度（f32 / f16,bf16 / i8） | 可配置 | 64 / 128 / 256 | 64 / 128 / 256 |
+| FP8 类型（`f8e4m3`、`f8e5m2`） | — | — | 支持 |
+| 分形布局（NZ/ZN/FR/RN） | 仿真 | 仿真 | 完整 |
+| 分块级集体通信 | — | 支持 | 支持 |
 
-### 4. 入门、测试与文档构建
+---
 
-- [快速开始指南](getting-started_zh.md)：环境准备、CPU / NPU 运行说明
-- [测试说明](../tests/README_zh.md)：测试入口、测试脚本与常用命令
-- [文档构建说明](mkdocs/README_zh.md)：MkDocs 文档本地构建说明
+## 指令集导航地图
 
-### 5. 其他相关文档
+PTO 将其指令分为四个命名指令集。每个指令集有**契约页面**（共享规则）和**逐指令页面**（单条指令说明）。
 
-- [Machine 文档](machine/README_zh.md)：抽象机器模型与相关说明
+### Tile 指令集 — `pto.t*`
 
-## 目录结构
+```
+Tile 指令集
+├── 同步与配置             → tassign、tsync、tsetf32mode、tsetfmatrix、tset_img2col_*、tsubview、tget_scale_addr
+├── 逐元素 Tile-Tile       → tadd、tsub、tmul、tdiv、tmin、tmax、tcmp、tcvt、tsel、tlog、trecip、texp、tsqrt、trsqrt、trem、tfmod、tabs、tand、tor、txor、tnot、tneg、tprelu、taddc、tsubc、tshl、tshr
+├── Tile-标量与立即数       → tadds、tsubs、tmuls、tdi等等vs、tcmps、tsels、texpands、tfmods、trems、tands、tors、txors、tshls、tshrs、tlrelu、taddsc、tsubsc
+├── 归约与扩展             → trowsum、tcolsum、trowprod、tcolprod、tcolmax、tcolmin、trowmax、trowmin、tcolargmax、tcolargmin、trowargmax、trowargmin
+│                             → trowexpand、trowexpandadd、trowexpanddiv、trowexpandmul、trowexpandsub、trowexpandmax、trowexpandmin、trowexpandexpdif
+│                             → tcolexpand、tcolexpandadd、tcolexpanddiv、tcolexpandmul、tcolexpandsub、tcolexpandmax、tcolexpandmin、tcolexpandexpdif
+├── 内存与数据搬运         → tload、tprefetch、tstore、tstore_fp、mgather、mscatter
+├── 矩阵与矩阵-向量         → tgemv、tgemv_mx、tgemv_acc、tgemv_bias、tmatmul、tmatmul_mx、tmatmul_acc、tmatmul_bias
+├── 布局与重排             → tmov、tmov_fp、ttrans、textract、textract_fp、tinsert、tinsert_fp、timg2col、tfillpad、tfillpad_inplace、tfillpad_expand、treshape
+└── 非常规与复杂操作       → tprint、tmrgsort、tsort32、tgather、tgatherb、tscatter、tci、ttri、tpartadd、tpartmul、tpartmax、tpartmin、tquant
+```
 
-关键目录如下：
+[Tile 指令族契约 →](isa/instruction-families/tile-families_zh.md)
+
+### Vector 指令集 — `pto.v*`
+
+```
+Vector 指令集
+├── 向量加载存储             → vlds、vldas、vldus、vldx2、vsld、vsldb、vgather2、vgatherb、vgather2_bc
+│                             → vsts、vstx2、vsst、vsstb、vscatter、vsta、vstas、vstar、vstu、vstus、vstur
+├── 谓词与物化              → vbr、vdup
+├── 一元向量运算            → vabs、vneg、vexp、vln、vsqrt、vrsqrt、vrec、vrelu、vnot、vbcnt、vcls、vmov
+├── 二元向量运算            → vadd、vsub、vmul、vdiv、vmax、vmin、vand、vor、vxor、vshl、vshr、vaddc、vsubc
+├── 向量-标量运算           → vadds、vsubs、vmuls、vmaxs、vmins、vands、vors、vxors、vshls、vshrs、vlrelu、vaddcs、vsubcs
+├── 类型转换                → vci、vcvt、vtrc
+├── 归约指令                → vcadd、vcmax、vcmin、vcgadd、vcgmax、vcgmin、vcpadd
+├── 比较与选择              → vcmp、vcmps、vsel、vselr、vselrv2
+├── 数据重排                → vintlv、vdintlv、vslide、vshift、vsqz、vusqz、vperm、vpack、vsunpack、vzunpack、vintlvv2、vdintlvv2
+└── SFU 与 DSA             → vprelu、vexpdiff、vaddrelu、vsubrelu、vaxpy、vaddreluconv、vmulconv、vmull、vmula、vtranspose、vsort32、vbitsort、vmrgsort
+```
 
-```text
-├── isa/                        # PTO 指令参考与分类索引
-├── assembly/                   # PTO 汇编语法与 PTO-AS 规范
-├── coding/                     # 编程模型、开发与性能优化文档
-├── auto_mode/                  # Auto Mode 相关文档
-├── machine/                    # 抽象机器模型相关文档
-├── mkdocs/                     # 文档站点构建配置与脚本
-├── figures/                    # 文档中使用的图片与图示资源
-├── README*                     # 文档入口页
-├── PTOISA*                     # ISA 总览文档
-└── getting-started*            # 快速开始指南
+[Vector 指令族契约 →](isa/instruction-families/vector-families_zh.md)
+
+### 标量与控制指令集 — `pto.*`
+
+```
+标量与控制指令集
+├── 控制与配置              → nop、barrier、yield；tsetf32mode、tsethf32mode、tsetfmatrix
+├── 流水线同步             → set_flag、wait_flag、wait_flag_dev、pipe_barrier、mem_bar、get_buf、rls_buf
+│                             → set_cross_core、set_intra_block、wait_intra_core
+├── DMA 拷贝               → set_loop_size_outtoub、set_loop1/2_stride_outtoub
+│                             → set_loop_size_ubtoout、set_loop1/2_stride_ubtoout
+│                             → copy_gm_to_ubuf、copy_ubuf_to_gm、copy_ubuf_to_ubuf
+├── 谓词加载存储            → pld、plds、pldi、psts、pst、psti、pstu
+├── 谓词生成                → pset_b8/b16/b32、pge_b8/b16/b32、plt_b8/b16/b32
+│                             → pand、por、pxor、pnot、psel、ppack、punpack
+│                             → pdintlv_b8、pintlv_b16
+├── 共享标量算术            → 跨指令集共享的标量算术运算
+├── 共享结构化控制流        → 标量结构化控制流
+└── 微指令                  → BlockDim 查询、指针操作、向量作用域、对齐状态
+    [微指令汇总 →](isa/vector/micro-instruction-summary.md)
+```
+
+[标量指令族契约 →](isa/instruction-families/scalar-and-control-families_zh.md)
+
+### 通信指令集 — `pto.*`
+
+```
+通信指令集
+├── 集体操作                → tbroadcast、tget、tget_async、tput、tput_async
+│                             → tscatter、tgather、treduce、ttest、twait、tnotify
+└── 非 ISA 支撑操作          → talias、taxpy、tconcat、tdequant、tfree、thistogram
+                              → tpack、tpop、tpush、trandom
 ```
 
-## 相关入口
+[通信指令族契约 →](isa/instruction-families/other-families_zh.md)
+
+---
+
+## 编译流程
+
+### Flow A：高层编译（ptoas → C++ → bisheng → 二进制）
+
+高层前端发出 `.pto` 文本文件。`ptoas` 解析、验证并降级为调用 `pto-isa` C++ 库的 C++ 代码。后端编译器（bisheng）再生成最终二进制。
+
+**适用人群：** 编译器开发者、库作者、高层框架集成商。`.pto` 格式可移植、可缓存。
+
+### Flow B：直接汇编（ptoas → 二进制）
+
+`ptoas` 直接汇编为目标二进制，跳过 C++ 中间步骤。
+
+**适用人群：** 需要直接控制最终指令流的性能工程师，或将 `ptoas` 作为纯汇编器使用的工具链。
+
+[了解更多编译流程 →](isa/introduction/what-is-pto-visa_zh.md#two-compilation-flows)
+
+---
+
+## 关键参考
+
+| 参考资料 | 内容 |
+|----------|------|
+| **[PTO-AS 规范](assembly/PTO-AS_zh.md)** | `.pto` 文本文件的汇编语法与文法 |
+| **[Tile 编程模型](coding/Tile_zh.md)** | Tile shape、mask 与数据组织 |
+| **[事件与同步](coding/Event_zh.md)** | set/wait flag 与流水线同步 |
+| **[性能优化](coding/opt_zh.md)** | 瓶颈分析与调优指导 |
+| **[Auto Mode 概述](auto_mode/Auto_Mode_Overview_zh.md)** | 编译器驱动的资源管理与同步插入 |
+| **[微指令汇总](isa/vector/micro-instruction-summary.md)** | 标量微指令：BlockDim、指针操作、向量作用域 |
+| **[可移植性与目标 Profile](isa/reference/portability-and-target-profiles_zh.md)** | 各目标支持哪些特性 |
+| **[术语表](isa/reference/glossary_zh.md)** | 术语定义参考 |
+| **[规范来源](isa/reference/source-of-truth_zh.md)** | 哪些文件定义权威语义 |
+| **[构建文档](mkdocs/README_zh.md)** | 本地生成文档站点 |
+
+---
+
+## 参与贡献
+
+本文档源自 [github.com/PTO-ISA/pto-isa](https://github.com/PTO-ISA/pto-isa) 的权威 PTO ISA 规范。通过仓库 Issues 反馈问题，通过 Pull Request 提交更改。
+
+---
 
-- [根目录 README_zh](../README_zh.md)：项目总览、快速开始与仓库入口
-- [kernels 目录说明](../kernels/README_zh.md)：kernel 与算子实现入口
-- [include 目录说明](../include/README_zh.md)：头文件与接口说明
-- [tests 目录说明](../tests/README_zh.md)：测试与运行入口
+*PTO ISA 是昇腾软件栈的一部分。版权所有 © Huawei Technologies Co., Ltd.*
diff --git a/docs/assembly/README_zh.md b/docs/assembly/README_zh.md
index 492028ad..f8562a60 100644
--- a/docs/assembly/README_zh.md
+++ b/docs/assembly/README_zh.md
@@ -2,74 +2,56 @@
 
 这里是 PTO AS 文档的主入口页，用于帮助读者按主题快速定位汇编相关文档，而不是逐个文件查找。
 
-PTO AS 文档主要覆盖以下几类内容：
+## 按任务选择
 
-- PTO-AS 语法、文法与文本表示形式
-- ISA 级 tile 操作与辅助 AS 构造
-- 从 MLIR 复用的标量算术与控制流操作
-- 汇编相关约定与配套参考资料
-
-## 建议阅读路径
-
-如果您第一次接触 PTO-AS，建议按以下顺序阅读：
-
-1. [PTO-AS 规范](PTO-AS_zh.md)：先理解文本格式、语法与 directives
-2. [PTO AS 操作参考](README_zh.md)：建立对操作分类及链接入口的整体认识
-3. [PTO-AS 约定](conventions_zh.md)：理解命名与文档编写约定
-4. 各类操作文档：按任务需要继续阅读对应分类页面
+| 你的需求 | 从这里开始 |
+|----------|----------|
+| 理解 PTO-AS 语法与文法 | [PTO-AS 规范](PTO-AS_zh.md) |
+| 了解操作分类与链接入口 | [PTO AS 操作参考](README_zh.md) |
+| 理解命名与文档编写约定 | [PTO-AS 约定](conventions_zh.md) |
+| 按类别查找操作 | 见下方文档分类 |
 
 ## 文档分类
 
 ### 1. PTO-AS 语法与核心规范
 
-- [PTO-AS 规范](PTO-AS_zh.md)：文本格式、SSA 风格命名、directives 与文法概览
-- [PTO-AS 约定](conventions_zh.md)：汇编语法约定与相关文档规则
-- `PTO-AS.bnf`：PTO-AS 的 BNF 形式文法定义
+| 文档 | 内容 |
+|------|------|
+| [PTO-AS 规范](PTO-AS_zh.md) | 文本格式、SSA 风格命名、directives 与文法概览 |
+| [PTO-AS 约定](conventions_zh.md) | 汇编语法约定与相关文档规则 |
+| `PTO-AS.bnf` | PTO-AS 的 BNF 形式文法定义 |
 
 ### 2. PTO Tile 操作分类
 
-- [逐元素操作](elementwise-ops_zh.md)：tile-tile 逐元素操作
-- [Tile-标量操作](tile-scalar-ops_zh.md)：tile 与标量之间的算术、比较与激活操作
-- [轴归约和扩展](axis-ops_zh.md)：行/列归约与广播式扩展操作
-- [内存操作](memory-ops_zh.md)：GM 与 tile 之间的数据搬运操作
-- [矩阵乘法](matrix-ops_zh.md)：GEMM 与 GEMV 相关操作
-- [数据移动和布局](data-movement-ops_zh.md)：提取、插入、转置、reshape 与 padding 操作
-- [复杂操作](complex-ops_zh.md)：排序、gather/scatter、随机数、量化与工具类操作
-- [手动资源绑定](manual-binding-ops_zh.md)：赋值与硬件/资源配置类操作
+| 文档 | 内容 |
+|------|------|
+| [逐元素操作](elementwise-ops_zh.md) | tile-tile 逐元素操作 |
+| [Tile-标量操作](tile-scalar-ops_zh.md) | tile 与标量之间的算术、比较与激活操作 |
+| [轴归约和扩展](axis-ops_zh.md) | 行/列归约与广播式扩展操作 |
+| [内存操作](memory-ops_zh.md) | GM 与 tile 之间的数据搬运操作 |
+| [矩阵乘法](matrix-ops_zh.md) | GEMM 与 GEMV 相关操作 |
+| [数据移动和布局](data-movement-ops_zh.md) | 提取、插入、转置、reshape 与 padding 操作 |
+| [复杂操作](complex-ops_zh.md) | 排序、gather/scatter、随机数、量化与工具类操作 |
+| [手动资源绑定](manual-binding-ops_zh.md) | 赋值与硬件/资源配置类操作 |
 
 ### 3. 辅助 AS 与 MLIR 派生操作
 
-- [辅助函数](nonisa-ops_zh.md)：张量视图、tile 分配、索引与同步辅助构造
-- [标量算术操作](scalar-arith-ops_zh.md)：来自 MLIR `arith` 的标量算术操作
-- [控制流操作](control-flow-ops_zh.md)：来自 MLIR `scf` 的结构化控制流操作
+| 文档 | 内容 |
+|------|------|
+| [辅助函数](nonisa-ops_zh.md) | 张量视图、tile 分配、索引与同步辅助构造 |
+| [标量算术操作](scalar-arith-ops_zh.md) | 来自 MLIR `arith` 的标量算术操作 |
+| [控制流操作](control-flow-ops_zh.md) | 来自 MLIR `scf` 的结构化控制流操作 |
 
 ### 4. 相关参考
 
-- [ISA 指令参考](../isa/README_zh.md)：逐条指令的规范语义
-- [docs 文档入口](../README_zh.md)：返回 PTO Tile Lib 文档总导航页
-
-## 目录结构
-
-关键条目如下：
-
-```text
-├── PTO-AS*                     # PTO-AS 语法与规范文档
-├── conventions*                # 汇编约定文档
-├── elementwise-ops*            # 逐元素操作参考
-├── tile-scalar-ops*            # Tile-标量操作参考
-├── axis-ops*                   # 轴归约与扩展参考
-├── memory-ops*                 # 内存操作参考
-├── matrix-ops*                 # 矩阵乘法参考
-├── data-movement-ops*          # 数据移动与布局参考
-├── complex-ops*                # 复杂操作参考
-├── manual-binding-ops*         # 手动资源绑定参考
-├── scalar-arith-ops*           # 标量算术参考
-├── control-flow-ops*           # 控制流参考
-└── nonisa-ops*                 # 辅助 AS 构造参考
-```
+| 文档 | 内容 |
+|------|------|
+| [ISA 指令参考](../isa/README_zh.md) | 逐条指令的规范语义 |
+| [docs 文档入口](../README_zh.md) | PTO Tile Library 文档总导航页 |
+| [Machine 文档](../machine/README_zh.md) | 抽象执行模型 |
 
 ## 相关入口
 
-- [ISA 指令参考](../isa/README_zh.md)：查看 PTO 指令的规范语义
-- [docs 文档入口](../README_zh.md)：返回文档总导航页
-- [Machine 文档](../machine/README_zh.md)：了解抽象执行模型
+- [ISA 指令参考](../isa/README_zh.md)
+- [docs 文档入口](../README_zh.md)
+- [Machine 文档](../machine/README_zh.md)
diff --git a/docs/auto_mode/README.md b/docs/auto_mode/README.md
index 6ff6140f..3ecfcb90 100644
--- a/docs/auto_mode/README.md
+++ b/docs/auto_mode/README.md
@@ -1,105 +1,51 @@
 # PTO AUTO Mode
 
-## What is PTO AUTO
-
-PTO AUTO is a programming mode for PTO that provides two major benefits:
-
-* Simpify developing efficient PTO code while providing kernel developers with the mechanisms that are necessary to implement their optimizations.
-* Compatibility across different generations of the Ascend architecture.
-
-More specifically, in PTO AUTO, the kernel developer does not need to explicitly specify tile memory addresses or synchronization between different pipes. Instead the PTO AUTO compiler automatically allocates optimal memory addressess for the tiles in different chip buffers. Moreover, the compiler automatically synchronizes the PTO tile operations in order to maximize parallelism among different pipes. Finally, the kernel developer does not need to be concerned with the minor differences between various generations of the Ascend architecture (particulary in terms of the way Cube and Vector computations are coordinated).
-
-Note: auto mode currently only supports the compiler `-O2` option.
-
-## Simple Example
-
-A simple example, elementwise multiplication demonstrates the key differences between the PTO AUTO and manual modes:
-
-### TMUL Manual Mode
-
-```cpp
-template <typename T, int kGRows_, int kGCols_, int kTRows_, int kTCols_>
-__global__ AICORE void runTMul(__gm__ T __out__ *out, __gm__ T __in__ *src0, __gm__ T __in__ *src1)
-{
-    using DynShapeDim5 = Shape<1, 1, 1, kGRows_, kGCols_>;
-    using DynStridDim5 = Stride<1, 1, 1, kGCols_, 1>;
-    using GlobalData = GlobalTensor<T, DynShapeDim5, DynStridDim5>;
-    using TileData = Tile<TileType::Vec, T, kTRows_, kTCols_, BLayout::RowMajor, -1, -1>;
-    TileData src0Tile(kGRows_, kGCols_);
-    TileData src1Tile(kGRows_, kGCols_);
-    TileData dstTile(kGRows_, kGCols_);
+This directory contains detailed documentation for PTO AUTO Mode, helping developers understand and use Auto Mode for PTO programming.
 
-    TASSIGN(src0Tile, 0x0 + 0x400 * block_idx);
-    TASSIGN(src1Tile, 0x4000 + 0x400 * block_idx);
-    TASSIGN(dstTile, 0x8000 + 0x400 * block_idx);
+## Choose by Task
 
-    int offset = (block_idx / 4) * (64 * 16) + (block_idx % 4) * 16;
-    GlobalData src0Global(src0 + offset);
-    GlobalData src1Global(src1 + offset);
-    GlobalData dstGlobal(out + offset);
+| Your need | Start here |
+|-----------|-----------|
+| What is Auto Mode | [Auto Mode Overview](Auto_Mode_Overview.md) |
+| Kernel development rules and limitations | [Kernel Developer Rules](Kernel_Developer_Rules_And_Limitations.md) |
+| Library development rules and limitations | [Library Developer Rules](Library_Developer_Rules_And_Limitations.md) |
+| Code examples | [Examples](Examples.md) |
 
-    TLOAD(src0Tile, src0Global);
-    TLOAD(src1Tile, src1Global);
-    set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
-    wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
-    TMUL(dstTile, src0Tile, src1Tile);
-    set_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
-    wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
-    TSTORE(dstGlobal, dstTile);
-
-    out = dstGlobal.data();
-}
-```
-
-### TMUL AUTO Mode
-
-```cpp
-template <typename T, int kGRows_, int kGCols_, int kTRows_, int kTCols_>
-__global__ AICORE void runTMul(__gm__ T __out__ *out, __gm__ T __in__ *src0, __gm__ T __in__ *src1)
-{
-    using DynShapeDim5 = Shape<1, 1, 1, kGRows_, kGCols_>;
-    using DynStridDim5 = Stride<1, 1, 1, kGCols_, 1>;
-    using GlobalData = GlobalTensor<T, DynShapeDim5, DynStridDim5>;
-    using TileData = Tile<TileType::Vec, T, kTRows_, kTCols_, BLayout::RowMajor, -1, -1>;
-
-    TileData src0Tile(kGRows_, kGCols_);
-    TileData src1Tile(kGRows_, kGCols_);
-    TileData dstTile(kGRows_, kGCols_);
-
-    int offset = (block_idx / 4) * (64 * 16) + (block_idx % 4) * 16;
-    GlobalData src0Global(src0 + offset);
-    GlobalData src1Global(src1 + offset);
-    GlobalData dstGlobal(out + offset);
-
-    TLOAD(src0Tile, src0Global);
-    TLOAD(src1Tile, src1Global);
-    TMUL(dstTile, src0Tile, src1Tile);
-    TSTORE(dstGlobal, dstTile);
-
-    out = dstGlobal.data();
-}
-```
-
-## PTO AUTO Compiler Features
+## What is PTO AUTO
 
-### Cross-Architecture Compatibility
+PTO AUTO is a programming mode that provides two major benefits:
 
-PTO AUTO Compiler ensures a single source PTO program can be compiled for different Ascend architecture generations without requiring any source-level modifications while maintaining performance.
+1. **Simplifies development** while enabling necessary optimizations.
+2. **Ensures cross-generation compatibility** across Ascend hardware.
 
-### Automatic Synchronization
+In AUTO mode, kernel developers **do not need to** manually assign tile memory (`TASSIGN`) or manage synchronization between pipes (`set_flag`/`wait_flag`). The compiler handles these automatically while maintaining good performance.
 
-In manual mode, user would normally have to keep track of the asynchronous nature of the hardware by using PTO's [`event model`](../coding/Event.md) at precise code locations in order to ensure both functional correctness and high performance in execution. This might be tedious and error prone.
+## Auto vs Manual Mode Comparison
 
-Auto mode compilation will allow users to avoid having to use the event model to synchronize their code. The compiler will automatically determine the locations to insert synchronization under the hood - ensuring functional correctness and competitive performance.
+| Aspect | Auto Mode | Manual Mode |
+|--------|-----------|-------------|
+| Tile address allocation | Compiler automatic | Author explicit `TASSIGN` |
+| Synchronization management | Compiler automatic | Author explicit `set_flag`/`wait_flag` |
+| Data movement | Compiler automatic `TLOAD`/`TSTORE` | Author explicit `TLOAD`/`TSTORE` |
+| Performance | Near hand-tuned | Highest performance |
+| Development difficulty | Low | High |
+| Cross-generation compatibility | Best | Requires per-generation tuning |
 
-### Tile Memory Allocation
+> Note: auto mode currently only supports the compiler `-O2` option.
 
-In the default mode of PTO compilation, after instantiating `Tile` variables, we would need to complement them with a `TASSIGN` instruction to manually assign a dedicated buffer address that it operates on. However in auto mode, this is not required anymore. By simply instantiating the `Tile` variable the compiler will automatically allocate the buffer addresses under the hood for the user.
+## Document Index
 
-## PTO AUTO Documents
+| Document | Content |
+|----------|---------|
+| [Auto Mode Overview](Auto_Mode_Overview.md) | Core concepts, compiler features, comparison with Manual mode |
+| [Kernel Developer Rules](Kernel_Developer_Rules_And_Limitations.md) | Programming rules and limitations for kernel developers in Auto Mode |
+| [Library Developer Rules](Library_Developer_Rules_And_Limitations.md) | Programming rules and limitations for library developers in Auto Mode |
+| [Examples](Examples.md) | Auto Mode code examples |
 
-More detailed documentations of the PTO AUTO programming and compilations are organized into the following documents.
+## Related Docs
 
-* [PTO_AUTO_kernel_developer_rules_and_limitations](Kernel_Developer_Rules_And_Limitations_zh.md)
-* [PTO_AUTO_library_developer_rules_and_limitations](Library_Developer_Rules_And_Limitations.md)
-* [PTO AUTO Code Examples](Examples.md)
+| Document | Content |
+|----------|---------|
+| [docs/README.md](../README.md) | Documentation hub |
+| [docs/coding/README.md](../coding/README.md) | Programming model docs entry |
+| [docs/isa/README.md](../isa/README.md) | ISA instruction reference |
diff --git a/docs/auto_mode/README_zh.md b/docs/auto_mode/README_zh.md
index 474e4e90..81ad0f9e 100644
--- a/docs/auto_mode/README_zh.md
+++ b/docs/auto_mode/README_zh.md
@@ -1,106 +1,51 @@
-# PTO AUTO模式
+# PTO AUTO 文档
 
-## auto模式是什么
+本目录包含 PTO AUTO 模式的详细文档，帮助开发者理解并使用 Auto Mode 进行 PTO 编程。
 
-PTO AUTO是一种新的编程模式，主要提供以下两点优势：
+## 按任务选择
 
-* 降低kernel的开发难度的同时能使开发者实现必要的优化
-* 确保跨代兼容不同的昇腾硬件架构
+| 你的需求 | 从这里开始 |
+|----------|----------|
+| 什么是 Auto Mode | [Auto Mode 概述](Auto_Mode_Overview_zh.md) |
+| Kernel 开发规范与限制 | [Kernel Developer Rules](Kernel_Developer_Rules_And_Limitations_zh.md) |
+| Library 开发规范与限制 | [Library Developer Rules](Library_Developer_Rules_And_Limitations_zh.md) |
+| 代码示例 | [Examples](Examples_zh.md) |
 
-更具体来说，在AUTO模式下，kernel开发者不用手动为tile分配内存，也不用亲自手写不同pipe间的同步。作为替代，PTO AUTO编译器会帮助kernel开发者在不同buffer上分配内存。而且，编译器也会在PTO指令之间自动插入同步，最大化pipe之间的流水线并行。最后，kernel开发者也不用关心不同昇腾硬件架构之间的区别（尤其是关于CUBE和VECTOR交流和同步的机制）。
+## Auto Mode 是什么
 
-注意：auto模式目前仅支持编译器`-O2`选项。
+PTO AUTO 是一种新的编程模式，主要提供以下两点优势：
 
-## 简单示例
+1. **降低开发难度**的同时使开发者实现必要的优化。
+2. **确保跨代兼容**不同的昇腾硬件架构。
 
-一个简单的示例：逐元素的乘法。这展示了最关键的auto模式与manual模式的区别：
+在 AUTO 模式下，kernel 开发者**无需手动**为 tile 分配内存（`TASSIGN`），也**无需手动**管理不同 pipe 间的同步（`set_flag`/`wait_flag`）。编译器自动完成这些工作，同时保持良好的性能。
 
-### TMUL manual模式
+## Auto vs Manual 模式对比
 
-```cpp
-template <typename T, int kGRows_, int kGCols_, int kTRows_, int kTCols_>
-__global__ AICORE void runTMul(__gm__ T __out__ *out, __gm__ T __in__ *src0, __gm__ T __in__ *src1)
-{
-    using DynShapeDim5 = Shape<1, 1, 1, kGRows_, kGCols_>;
-    using DynStridDim5 = Stride<1, 1, 1, kGCols_, 1>;
-    using GlobalData = GlobalTensor<T, DynShapeDim5, DynStridDim5>;
-    using TileData = Tile<TileType::Vec, T, kTRows_, kTCols_, BLayout::RowMajor, -1, -1>;
-    TileData src0Tile(kGRows_, kGCols_);
-    TileData src1Tile(kGRows_, kGCols_);
-    TileData dstTile(kGRows_, kGCols_);
+| 方面 | Auto Mode | Manual Mode |
+|------|-----------|-------------|
+| Tile 地址分配 | 编译器自动 | 作者显式 `TASSIGN` |
+| 同步管理 | 编译器自动 | 作者显式 `set_flag`/`wait_flag` |
+| 数据搬运 | 编译器自动 `TLOAD`/`TSTORE` | 作者显式 `TLOAD`/`TSTORE` |
+| 性能 | 接近手工调优 | 最高性能 |
+| 开发难度 | 低 | 高 |
+| 跨代兼容性 | 最好 | 需要针对不同代际调整 |
 
-    TASSIGN(src0Tile, 0x0 + 0x400 * block_idx);
-    TASSIGN(src1Tile, 0x4000 + 0x400 * block_idx);
-    TASSIGN(dstTile, 0x8000 + 0x400 * block_idx);
+> 注意：auto 模式目前仅支持编译器 `-O2` 选项。
 
-    int offset = (block_idx / 4) * (64 * 16) + (block_idx % 4) * 16;
-    GlobalData src0Global(src0 + offset);
-    GlobalData src1Global(src1 + offset);
-    GlobalData dstGlobal(out + offset);
+## 文档列表
 
-    TLOAD(src0Tile, src0Global);
-    TLOAD(src1Tile, src1Global);
-    set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
-    wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
-    TMUL(dstTile, src0Tile, src1Tile);
-    set_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
-    wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
-    TSTORE(dstGlobal, dstTile);
+| 文档 | 内容 |
+|------|------|
+| [Auto Mode 概述](Auto_Mode_Overview_zh.md) | Auto Mode 核心概念、编译器特性、与 Manual 模式对比 |
+| [Kernel Developer Rules](Kernel_Developer_Rules_And_Limitations_zh.md) | kernel 开发者在 Auto Mode 下的编程规范与限制 |
+| [Library Developer Rules](Library_Developer_Rules_And_Limitations_zh.md) | 库开发者在 Auto Mode 下的编程规范与限制 |
+| [Examples](Examples_zh.md) | Auto Mode 代码示例 |
 
-    out = dstGlobal.data();
-}
-```
+## 相关文档
 
-### TMUL AUTO模式
-
-```cpp
-template <typename T, int kGRows_, int kGCols_, int kTRows_, int kTCols_>
-__global__ AICORE void runTMul(__gm__ T __out__ *out, __gm__ T __in__ *src0, __gm__ T __in__ *src1)
-{
-    using DynShapeDim5 = Shape<1, 1, 1, kGRows_, kGCols_>;
-    using DynStridDim5 = Stride<1, 1, 1, kGCols_, 1>;
-    using GlobalData = GlobalTensor<T, DynShapeDim5, DynStridDim5>;
-    using TileData = Tile<TileType::Vec, T, kTRows_, kTCols_, BLayout::RowMajor, -1, -1>;
-
-    TileData src0Tile(kGRows_, kGCols_);
-    TileData src1Tile(kGRows_, kGCols_);
-    TileData dstTile(kGRows_, kGCols_);
-
-    int offset = (block_idx / 4) * (64 * 16) + (block_idx % 4) * 16;
-    GlobalData src0Global(src0 + offset);
-    GlobalData src1Global(src1 + offset);
-    GlobalData dstGlobal(out + offset);
-
-    TLOAD(src0Tile, src0Global);
-    TLOAD(src1Tile, src1Global);
-    TMUL(dstTile, src0Tile, src1Tile);
-    TSTORE(dstGlobal, dstTile);
-
-    out = dstGlobal.data();
-}
-```
-
-## PTO AUTO编译器特性
-
-### 不同架构之间的兼容性
-
-PTO AUTO编译器确保同一个PTO源码程序能在不同昇腾架构上编译和运行，同时确保性能。
-
-### 自动同步
-
-在manual模式下，程序员需要使用PTO的Event机制，精准地在必要的源码位置上手动调用同步指令来确保正确的结果和性能。这非常繁琐且容易出错。
-
-Auto模式编译器让程序员避免了这个麻烦。编译器会自动在需要的位置插入同步，确保正确的结果和较好的性能。
-
-### Tile内存分配
-
-在manual模式下，当用户声定义一个Tile变量后，需要显式调用`TASSIGN`来为这个Tile分配在指定内存空间里的内存地址。
-在auto模式下，用户不需要手动分配内存，只需要定义Tile变量即可；编译器会自动为所有Tile在正确的buffer上分配内存地址。
-
-## PTO AUTO文档
-
-更多PTO AUTO的详细文档如下：
-
-* [PTO_AUTO_kernel_developer_rules_and_limitations](Kernel_Developer_Rules_And_Limitations_zh.md)
-* [PTO_AUTO_library_developer_rules_and_limitations](Library_Developer_Rules_And_Limitations_zh.md)
-* [PTO AUTO Code Examples](Examples_zh.md)
+| 文档 | 内容 |
+|------|------|
+| [docs/README_zh.md](../README_zh.md) | 文档总入口 |
+| [docs/coding/README_zh.md](../coding/README_zh.md) | 编程模型文档入口 |
+| [docs/isa/README_zh.md](../isa/README_zh.md) | ISA 指令参考 |
diff --git a/docs/coding/README.md b/docs/coding/README.md
index 31d772af..fb51684c 100644
--- a/docs/coding/README.md
+++ b/docs/coding/README.md
@@ -1,20 +1,69 @@
 # docs/coding/
 
-This directory describes the **PTO Tile Lib programming model as seen from C++** (Tiles, GlobalTensor, events, scalar parameters) and provides guidance for extending the library.
+This directory describes the **PTO Tile Library programming model as seen from C++** (Tiles, GlobalTensor, events, scalar parameters) and provides guidance for extending the library.
 
-If you are looking for the *ISA reference*, start from [docs/isa/README.md](../isa/README.md).
+If you are looking for the **ISA reference**, start from [docs/isa/README.md](../isa/README.md).
 
-## Documents
+## Choose by Task
 
-- [High-level model and PTO-Auto/PTO-Manual](ProgrammingModel.md)
-- [Hands-on tutorial (write your first kernels)](tutorial.md)
-- [More tutorial examples](tutorials/README.md)
-- [Debugging and assertion lookup](debug.md)
-- [Tile abstraction and layout/valid-region rules](Tile.md)
-- [Global memory tensors (shape/stride/layout)](GlobalTensor.md)
-- [Events and synchronization model](Event.md)
-- [Scalar values, type mnemonics, and enums](Scalar.md)
+| Your goal | Start here |
+|-----------|-----------|
+| First time learning PTO | [Hands-on tutorial](tutorial.md) |
+| Understanding Tile abstraction and valid regions | [Tile programming model](Tile.md) |
+| Understanding global memory tensors | [GlobalTensor](GlobalTensor.md) |
+| Understanding events and synchronization | [Events and synchronization](Event.md) |
+| Understanding Auto Mode | [Auto Mode overview](../auto_mode/Auto_Mode_Overview.md) |
+| Understanding the compilation pipeline | [Compilation process](compilation-process.md) |
+| Finding performance bottlenecks | [Performance optimization](opt.md) |
+| Understanding operator fusion | [Operator fusion](operator-fusion.md) |
+| Debugging and error handling | [Debugging guide](debug.md), [Error codes](error-codes.md) |
+| Multi-core programming | [Multi-core programming](multi-core-programming.md) |
+| Memory optimization | [Memory optimization](memory-optimization.md) |
+| PTO vs other frameworks | [PTO comparison](pto-comparison.md) |
 
-## Related
+## Document Index
 
-- [PTO abstract machine model](../machine/README.md)
+### Foundations
+
+- [Hands-on tutorial (write your first kernel)](tutorial.md) — Step-by-step guide to your first PTO kernel
+- [More tutorial examples](tutorials/README.md) — Additional getting-started examples
+- [Tile abstraction and layout/valid-region rules](Tile.md) — Tile model in depth
+- [Global memory tensors (shape/stride/layout)](GlobalTensor.md) — GM tensor types
+- [Events and synchronization model](Event.md) — Event recording, waiting, and synchronization
+- [Scalar values, type mnemonics, and enums](Scalar.md) — Scalar parameters and type aliases
+- [Auto Mode overview](../auto_mode/Auto_Mode_Overview.md) — Compiler-driven resource and sync management
+
+### Build and Compilation
+
+- [Compilation process](compilation-process.md) — Full pipeline from source to binary
+- [CPU Simulator](cpu_sim.md) — Running PTO code on CPU
+
+### Debugging and Error Handling
+
+- [Debugging and assertion lookup](debug.md) — Debugging strategies
+- [Error codes](error-codes.md) — Error code reference
+
+### Advanced Topics
+
+- [Performance optimization](opt.md) — Performance analysis and tuning guidance
+- [Performance best practices](performance-best-practices.md) — Best practices and performance tips
+- [Operator fusion](operator-fusion.md) — Tensor fusion techniques
+- [Memory optimization](memory-optimization.md) — Memory optimization strategies
+- [Pipeline parallelism](pipeline-parallel.md) — Pipeline-parallel programming
+- [Multi-core programming](multi-core-programming.md) — Multi-core programming model
+- [Version compatibility](version-compatibility.md) — Compatibility and migration
+- [Framework integration](framework-integration.md) — Integration with PyTorch and other frameworks
+
+### Reference
+
+- [PTO comparison with other frameworks](pto-comparison.md) — PTO vs TVM, CUTLASS, etc.
+- [References](references.md) — Bibliography and further reading
+- [ConvTile](ConvTile.md) — Conv2D tile optimization
+
+## Related Docs
+
+| Document | Content |
+|----------|---------|
+| [PTO abstract machine model](../machine/README.md) | Abstract execution model |
+| [docs/README.md](../README.md) | Documentation hub |
+| [docs/isa/README.md](../isa/README.md) | ISA instruction reference |
diff --git a/docs/coding/README_zh.md b/docs/coding/README_zh.md
index 55bdc4c5..e0e32892 100644
--- a/docs/coding/README_zh.md
+++ b/docs/coding/README_zh.md
@@ -1,20 +1,69 @@
 # docs/coding/
 
-本目录描述 **从 C++ 角度看到的 PTO Tile Lib 编程模型**（Tiles、GlobalTensor、事件、标量参数），并提供扩展库的指导。
+本目录从 **C++ 编程视角**描述 PTO Tile Library 的编程模型（Tile、GlobalTensor、事件、标量参数），并提供扩展库的指导。
 
-如果你在寻找 *ISA 参考*，请从 [docs/isa/README_zh.md](../isa/README_zh.md) 开始。
+如果你在寻找 **ISA 参考**，请从 [docs/isa/README_zh.md](../isa/README_zh.md) 开始。
 
-## 文档列表
+## 按任务选择
 
-- [高层模型与 PTO-Auto / PTO-Manual](ProgrammingModel_zh.md)
-- [上手教程（编写第一个 kernel）](tutorial_zh.md)
-- [更多教程示例](tutorials/README_zh.md)
-- [调试与断言查找](debug_zh.md)
-- [Tile 抽象与布局/有效区域规则](Tile_zh.md)
-- [全局内存张量（shape/stride/layout）](GlobalTensor_zh.md)
-- [事件与同步模型](Event_zh.md)
-- [标量值、类型助记符与枚举](Scalar_zh.md)
+| 你的目标 | 从这里开始 |
+|----------|----------|
+| 学习 PTO 编程（第一次） | [上手教程](tutorial_zh.md) |
+| 理解 Tile 抽象与有效区域 | [Tile 编程模型](Tile_zh.md) |
+| 理解全局内存张量 | [GlobalTensor](GlobalTensor_zh.md) |
+| 理解事件与同步 | [事件与同步](Event_zh.md) |
+| 理解 Auto Mode | [Auto Mode 概述](../auto_mode/Auto_Mode_Overview_zh.md) |
+| 理解编译流程 | [编译流程](compilation-process_zh.md) |
+| 定位性能瓶颈 | [性能优化](opt_zh.md) |
+| 理解张量融合 | [算子融合](operator-fusion_zh.md) |
+| 调试与错误处理 | [调试指南](debug_zh.md)、[错误码](error-codes_zh.md) |
+| 多核编程 | [多核编程](multi-core-programming_zh.md) |
+| 内存优化 | [内存优化](memory-optimization_zh.md) |
+| PTO 与其他框架对比 | [PTO 对比](pto-comparison_zh.md) |
+
+## 文档索引
+
+### 基础
+
+- [上手教程（编写第一个 kernel）](tutorial_zh.md) — 一步一步写出你的第一个 PTO kernel
+- [更多教程示例](tutorials/README_zh.md) — 更多上手示例
+- [Tile 编程模型](Tile_zh.md) — Tile 抽象与布局/有效区域规则
+- [GlobalTensor](GlobalTensor_zh.md) — 全局内存张量（shape/stride/layout）
+- [事件与同步](Event_zh.md) — 事件与同步模型
+- [标量值、类型与枚举](Scalar_zh.md) — 标量参数、类型助记符与枚举
+- [Auto Mode 概述](../auto_mode/Auto_Mode_Overview_zh.md) — 编译器自动管理资源与同步
+
+### 编译与构建
+
+- [编译流程](compilation-process_zh.md) — PTO 程序从源码到二进制的完整流程
+- [CPU Simulator](cpu_sim.md) — 如何在 CPU 上运行 PTO 代码
+
+### 调试与错误处理
+
+- [调试指南](debug_zh.md) — 调试与断言查找
+- [错误码](error-codes_zh.md) — 错误码说明
+
+### 高级话题
+
+- [性能优化](opt_zh.md) — 性能分析与调优建议
+- [性能最佳实践](performance-best-practices_zh.md) — 最佳实践与性能要点
+- [算子融合](operator-fusion_zh.md) — 张量融合技术
+- [内存优化](memory-optimization_zh.md) — 内存优化策略
+- [流水线并行](pipeline-parallel_zh.md) — 流水线并行编程
+- [多核编程](multi-core-programming_zh.md) — 多核编程模型
+- [版本兼容性](version-compatibility_zh.md) — 版本兼容性与迁移
+- [框架集成](framework-integration_zh.md) — 与 PyTorch 等框架集成
+
+### 参考
+
+- [PTO 对比其他框架](pto-comparison_zh.md) — PTO 与 TVM、CUTLASS 等的对比
+- [参考资料](references_zh.md) — 参考资料汇总
+- [ConvTile](ConvTile.md) — Conv2D tile 优化
 
 ## 相关文档
 
-- [PTO 抽象机器模型](../machine/README_zh.md)
+| 文档 | 内容 |
+|------|------|
+| [PTO 抽象机器模型](../machine/README_zh.md) | 抽象执行模型 |
+| [docs/README_zh.md](../README_zh.md) | 文档总入口 |
+| [docs/isa/README_zh.md](../isa/README_zh.md) | ISA 指令参考 |
diff --git a/docs/coding/tutorials/README_zh.md b/docs/coding/tutorials/README_zh.md
index 83511595..aa2d919e 100644
--- a/docs/coding/tutorials/README_zh.md
+++ b/docs/coding/tutorials/README_zh.md
@@ -1,9 +1,27 @@
-# PTO 教程（更多示例）
+# PTO Tutorials (更多示例)
 
-本目录收集更长、更偏实战的示例讲解，用于补充 `docs/coding/tutorial_zh.md`。
+本目录收集更长、更偏实战的示例讲解，用于补充 `docs/coding/tutorial.md`。
 
-## 内容
+## 按任务选择
 
-- 向量加法：分块、边界 mask、以及 ping-pong（双缓冲）概念结构：`docs/coding/tutorials/vec-add_zh.md`
-- 行 softmax 模式（attention 的基础组件）：`docs/coding/tutorials/row-softmax_zh.md`
-- GEMM 模式与常见 tile 类型/布局：`docs/coding/tutorials/gemm_zh.md`
+| 示例 | 说明 | 难度 |
+|------|------|------|
+| [向量加法 + ping-pong](./vec-add_zh.md) | 向量加法、分块、边界 mask、双缓冲流水线 | 入门 |
+| [行 softmax](./row-softmax_zh.md) | attention 的基础组件，行级归一化模式 | 进阶 |
+| [GEMM](./gemm_zh.md) | 矩阵乘模式与常见 tile 类型 / 布局 | 进阶 |
+
+## 文档
+
+| 文档 | 说明 |
+|------|------|
+| [向量加法 + ping-pong](./vec-add_zh.md) | 向量加法：分块、边界 mask、以及 ping-pong（双缓冲）概念结构 |
+| [行 softmax](./row-softmax_zh.md) | 行 softmax 模式（attention 的基础组件） |
+| [GEMM](./gemm_zh.md) | GEMM 模式与常见 tile 类型/布局 |
+
+## 相关文档
+
+| 文档 | 内容 |
+|------|------|
+| [docs/coding/tutorial_zh.md](../tutorial_zh.md) | 上手指南 |
+| [docs/coding/Tile_zh.md](../Tile_zh.md) | Tile 编程模型 |
+| [docs/coding/opt_zh.md](../opt_zh.md) | 性能优化 |
diff --git a/docs/isa/README.md b/docs/isa/README.md
index 8b436c46..f751b4c1 100644
--- a/docs/isa/README.md
+++ b/docs/isa/README.md
@@ -1,98 +1,54 @@
-<p align="center">
-  <img src="../figures/pto_logo.svg" alt="PTO Tile Lib" width="180" />
-</p>
-
-# PTO ISA Manual And Reference
-
-This directory is the canonical PTO ISA tree. It combines the architecture manual, the instruction set guides, the instruction set contracts, and the exact instruction-reference groupings in one place.
-
-## Textual Assembly Inside PTO ISA
-
-This tree is the canonical PTO ISA manual. Textual assembly spelling belongs to the PTO ISA syntax instruction set, not to a second parallel architecture manual.
-
-- PTO ISA defines architecture-visible semantics, legality, state, ordering, target-profile boundaries, and the visible behavior of `pto.t*`, `pto.v*`, `pto.*`, and other operations.
-- PTO-AS is the assembler-facing spelling used to write those operations and operands. It is part of how PTO ISA is expressed, not a separate ISA with different semantics.
-
-If the question is "what does this legal PTO program mean across CPU, A2/A3, and A5?", stay in this tree. If the question is "what is the operand shape or textual spelling of this operation?", use the syntax-and-operands pages in this same tree.
-
-## Start Here
-
-## Axis Reduce / Expand
-- [TROWSUM](TROWSUM.md) - Reduce each row by summing across columns.
-- [TROWPROD](TROWPROD.md) - Reduce each row by multiplying across columns.
-- [TCOLSUM](TCOLSUM.md) - Reduce each column by summing across rows.
-- [TCOLPROD](TCOLPROD.md) - Reduce each column by multiplying across rows.
-- [TCOLMAX](TCOLMAX.md) - Reduce each column by taking the maximum across rows.
-- [TROWMAX](TROWMAX.md) - Reduce each row by taking the maximum across columns.
-- [TROWMIN](TROWMIN.md) - Reduce each row by taking the minimum across columns.
-- [TROWARGMAX](TROWARGMAX.md) - Get the column index of the maximum element for each row.
-- [TROWARGMIN](TROWARGMIN.md) - Get the column index of the minimum element for each row.
-- [TCOLARGMAX](TCOLARGMAX.md) - Get the row index of the maximum element for each column.
-- [TCOLARGMIN](TCOLARGMIN.md) - Get the row index of the minimum element for each column.
-- [TROWEXPAND](TROWEXPAND.md) - Broadcast the first element of each source row across the destination row.
-- [TROWEXPANDDIV](TROWEXPANDDIV.md) - Row-wise broadcast divide: divide each row of `src0` by a per-row scalar vector `src1`.
-- [TROWEXPANDMUL](TROWEXPANDMUL.md) - Row-wise broadcast multiply: multiply each row of `src0` by a per-row scalar vector `src1`.
-- [TROWEXPANDSUB](TROWEXPANDSUB.md) - Row-wise broadcast subtract: subtract a per-row scalar vector `src1` from each row of `src0`.
-- [TROWEXPANDADD](TROWEXPANDADD.md) - Row-wise broadcast add: add a per-row scalar vector.
-- [TROWEXPANDMAX](TROWEXPANDMAX.md) - Row-wise broadcast max with a per-row scalar vector.
-- [TROWEXPANDMIN](TROWEXPANDMIN.md) - Row-wise broadcast min with a per-row scalar vector.
-- [TROWEXPANDEXPDIF](TROWEXPANDEXPDIF.md) - Row-wise exp-diff: compute exp(src0 - src1) with per-row scalars.
-- [TCOLMIN](TCOLMIN.md) - Reduce each column by taking the minimum across rows.
-- [TCOLEXPAND](TCOLEXPAND.md) - Broadcast the first element of each source column across the destination column.
-- [TCOLEXPANDDIV](TCOLEXPANDDIV.md) - Column-wise broadcast divide: divide each column by a per-column scalar vector.
-- [TCOLEXPANDMUL](TCOLEXPANDMUL.md) - Column-wise broadcast multiply: multiply each column by a per-column scalar vector.
-- [TCOLEXPANDADD](TCOLEXPANDADD.md) - Column-wise broadcast add with per-column scalar vector.
-- [TCOLEXPANDMAX](TCOLEXPANDMAX.md) - Column-wise broadcast max with per-column scalar vector.
-- [TCOLEXPANDMIN](TCOLEXPANDMIN.md) - Column-wise broadcast min with per-column scalar vector.
-- [TCOLEXPANDSUB](TCOLEXPANDSUB.md) - Column-wise broadcast subtract: subtract a per-column scalar vector from each column.
-- [TCOLEXPANDEXPDIF](TCOLEXPANDEXPDIF.md) - Column-wise exp-diff: compute exp(src0 - src1) with per-column scalars.
-
-## Model Layers
-
-Reading order matches the manual chapter map: programming and machine models, then syntax and state, then memory, then opcode reference.
-
-- [Programming model](programming-model/tiles-and-valid-regions.md)
-- [Machine model](machine-model/execution-agents.md)
-- [Syntax and operands](syntax-and-operands/assembly-model.md)
-- [Type system](state-and-types/type-system.md)
-- [Location intent and legality](state-and-types/location-intent-and-legality.md)
-- [Memory model](memory-model/consistency-baseline.md)
-
-## Complex
-- [TPRINT](TPRINT.md) - Debug/print elements from a tile (implementation-defined).
-- [TMRGSORT](TMRGSORT.md) - Merge sort for multiple sorted lists (implementation-defined element format and layout).
-- [TSORT32](TSORT32.md) - Sort each 32-element block of `src` together with the corresponding indices from `idx`, and write the sorted value-index pairs into `dst`.
-- [TGATHER](TGATHER.md) - Gather/select elements using either an index tile or a compile-time mask pattern.
-- [TCI](TCI.md) - Generate a contiguous integer sequence into a destination tile.
-- [TTRI](TTRI.md) - Generate a triangular (lower/upper) mask tile.
-- [TRANDOM](TRANDOM.md) - Generates random numbers in the destination tile using a counter-based cipher algorithm.
-- [TPARTADD](TPARTADD.md) - Partial elementwise add with implementation-defined handling of mismatched valid regions.
-- [TPARTMUL](TPARTMUL.md) - Partial elementwise multiply with implementation-defined handling of mismatched valid regions.
-- [TPARTMAX](TPARTMAX.md) - Partial elementwise max with implementation-defined handling of mismatched valid regions.
-- [TPARTMIN](TPARTMIN.md) - Partial elementwise min with implementation-defined handling of mismatched valid regions.
-- [TGATHERB](TGATHERB.md) - Gather elements using byte offsets.
-- [TSCATTER](TSCATTER.md) - Scatter rows of a source tile into a destination tile using per-element row indices.
-- [TQUANT](TQUANT.md) - Quantize a tile (e.g. FP32 to FP8) producing exponent/scaling/max outputs.
-
-- [Instruction overview](instruction-surfaces/README.md)
-- [Instruction set contracts](instruction-families/README.md)
-- [Format of instruction descriptions](reference/format-of-instruction-descriptions.md)
-- [Tile instruction reference](tile/README.md)
-- [Vector instruction reference](vector/README.md)
-- [Scalar and control reference](scalar/README.md)
-- [Other and communication reference](other/README.md)
-- [Common conventions](conventions.md)
-
-## Supporting Reference
-
-- [Reference notes](reference/README.md) (glossary, diagnostics, portability, source of truth)
-
-## Compatibility Wrappers
-
-The grouped instruction set trees under `tile/`, `vector/`, `scalar/`, and `other/` are the canonical PTO ISA paths.
-
-Some older root-level tile pages such as `TADD.md`, `TLOAD.md`, and `TMATMUL.md` now remain only as compatibility wrappers so existing links do not break immediately. New PTO ISA documentation should link to the grouped instruction set paths, especially the standalone per-op pages under:
-
-- `docs/isa/tile/ops/`
-- `docs/isa/vector/ops/`
-- `docs/isa/scalar/ops/`
+# docs/isa/
+
+This directory is the top-level index for the PTO ISA manual. It provides a guided navigation through all instruction references and conceptual documentation.
+
+## How This Manual Is Organized
+
+The manual is organized into five logical layers:
+
+| Layer | Contents | Audience |
+|-------|----------|----------|
+| **1. Foundations** | Introduction, programming model, machine model | Everyone — start here |
+| **2. Syntax and Semantics** | Assembly model, operands, types, memory model | Compiler developers, kernel authors |
+| **3. Instruction Surface** | Instruction-set overview and contracts | All users |
+| **4. Reference Manual** | Tile, vector, scalar, and communication reference | Performance engineers, kernel authors |
+| **5. Appendices** | Format guidelines, diagnostics, glossary, portability | Everyone |
+
+## Key Entry Points
+
+| Document | Content |
+|----------|---------|
+| [What is PTO ISA?](./introduction/what-is-pto-visa.md) | Core concepts and where PTO fits in the software stack |
+| [Tiles and Valid Regions](./programming-model/tiles-and-valid-regions.md) | The tile abstraction — PTO's primary programming object |
+| [Execution Agents and Profiles](./machine-model/execution-agents.md) | Execution hierarchy, pipelines, target profiles |
+| [Instruction Surfaces Overview](./instruction-surfaces/README.md) | Map of all four instruction sets and when to use each |
+| [Per-Instruction Reference](./tile/README.md) | Complete catalog organized by category |
+| [Format of instruction descriptions](./reference/format-of-instruction-descriptions.md) | How to read each per-op page |
+
+## Four Instruction Sets
+
+| Instruction Set | Prefix | Operations | Reference |
+|----------------|--------|------------|-----------|
+| **Tile** | `pto.t*` | ~120 | [Tile reference](./tile/README.md) |
+| **Vector** | `pto.v*` | ~99 | [Vector reference](./vector/README.md) |
+| **Scalar & Control** | `pto.*` | ~60 | [Scalar reference](./scalar/README.md) |
+| **Communication** | `pto.*` | ~24 | [Communication reference](./other/README.md) |
+
+## By Task
+
+| What you're doing | Start here |
+|-------------------|------------|
+| Writing a matrix multiplication kernel | [Tile → Matrix ops](./tile/matrix-and-matrix-vector.md) |
+| Optimizing elementwise operations | [Tile → Elementwise ops](./tile/elementwise-tile-tile.md) |
+| Setting up data movement (GM ↔ tile) | [Tile memory ops](./tile/memory-and-data-movement.md) |
+| Hand-tuning vector kernels | [Vector instructions](./vector/README.md) |
+| Using per-lane masking and predicates | [Vector → Predicate ops](./vector/predicate-and-materialization.md) |
+| Implementing collective communication | [Communication instructions](./other/README.md) |
+| Understanding Auto vs Manual mode | [Auto vs Manual](./programming-model/auto-vs-manual.md) |
+| Checking target profile support | [Execution agents](./machine-model/execution-agents.md) |
+
+## See Also
+
+- [docs/README.md](../README.md) — Documentation hub
+- [docs/coding/README.md](../coding/README.md) — Programming model docs
+- [PTO-AS specification](../assembly/PTO-AS.md) — Assembly syntax and grammar
diff --git a/docs/isa/TALIAS_zh.md b/docs/isa/TALIAS_zh.md
index e7cde27a..df548b0c 100644
--- a/docs/isa/TALIAS_zh.md
+++ b/docs/isa/TALIAS_zh.md
@@ -1,38 +1,36 @@
-# TALIAS
+# pto.talias
 
-## 指令示意图
+`pto.talias` 属于[布局与重排指令](./tile/layout-and-rearrangement_zh.md)集。
 
-![TALIAS tile operation](../figures/isa/TALIAS.svg)
+## 概述
 
-## 简介
+`TALIAS` 创建一个与源 Tile 共享底层存储的别名视图。它不复制数据，只改变后续代码观察这块存储的逻辑方式。`TALIAS` 的结果与源 Tile 指向同一块底层存储，通过任一别名写入的数据都会对共享该存储的其他别名立即可见。
 
-`TALIAS` 创建一个与源 Tile 共享底层存储的别名视图。它不复制数据，只改变后续代码观察这块存储的逻辑方式。
+## 机制
 
-这类操作通常用于：
+### 数学语义
 
-- 在同一块 tile buffer 上建立新的 shape 或切片视图
-- 把同一份数据交给不同的后续操作，以不同逻辑边界读取
-- 避免为了得到子视图或重解释视图而额外搬运数据
-
-## 语义
+```math
+\mathrm{dst} \equiv \mathrm{src} \quad \text{其中 } \mathrm{storage}(\mathrm{dst}) = \mathrm{storage}(\mathrm{src})
+```
 
-`TALIAS` 的结果与源 Tile 指向同一块底层存储。通过任一别名写入的数据，都会对共享该存储的其他别名立即可见。
+别名视图共享同一底层 buffer，但可具有不同的 shape、stride 或 layout 解释方式。
 
-它本身不定义新的数值变换；后续行为仍由消费该别名的具体指令决定，并且默认只在目标 valid region 内具有架构意义。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
-PTO-AS 形式见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.talias ...
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.talias ins(...) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -40,17 +38,61 @@ pto.talias ins(...) outs(%dst : !pto.tile_buf<...>)
 
 声明于 `include/pto/common/pto_instr.hpp`。
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 | 源 Tile，提供底层存储 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 与源共享底层存储的别名视图 |
+
+## 副作用
+
+通过任一别名写入的数据，对共享该存储的其他别名立即可见。
+
 ## 约束
 
-- `TALIAS` 不会修复原始 Tile 的非法 shape、layout 或 location intent。
-- alias 后得到的 Tile 仍需满足后续消费者指令的合法性要求。
-- 若多个别名在没有额外同步的情况下并发读写，共享存储的可见顺序由后续使用这些别名的指令负责建立。
+- `TALIAS` 不会修复原始 Tile 的非法 shape、layout 或 location intent
+- alias 后得到的 Tile 仍需满足后续消费者指令的合法性要求
+- 若多个别名在没有额外同步的情况下并发读写，共享存储的可见顺序由后续使用这些别名的指令负责建立
+
+## 异常与非法情形
+
+- 若源 Tile 本身未绑定有效存储，行为未定义
+- 若别名视图被后续指令以不兼容的 shape 或 layout 使用，结果未定义
 
 ## 示例
 
-`TALIAS` 常与子视图、局部重排和“同存储多视图”模式配合使用。更具体的用法见相关的 tile 搬运与布局页。
+### C++
+
+`TALIAS` 常与子视图、局部重排和"同存储多视图"模式配合使用。
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_alias() {
+  using TileT = Tile<TileType::Vec, float, 32, 32>;
+  TileT src;
+  // 创建别名视图
+  auto alias_view = TALIAS(src);
+  // alias_view 与 src 共享底层存储
+}
+```
+
+### PTO-AS
+
+```mlir
+# 创建别名视图
+%dst = pto.talias %src : (!pto.tile<...>) -> !pto.tile<...>
+```
 
 ## 相关页面
 
-- [布局与重排指令集](./tile/layout-and-rearrangement_zh.md)
+- 指令集总览：[布局与重排指令](./tile/layout-and-rearrangement_zh.md)
 - [Tile 与有效区域](./programming-model/tiles-and-valid-regions_zh.md)
diff --git a/docs/isa/TASSIGN_zh.md b/docs/isa/TASSIGN_zh.md
index c5395285..552e07be 100644
--- a/docs/isa/TASSIGN_zh.md
+++ b/docs/isa/TASSIGN_zh.md
@@ -1,24 +1,18 @@
-﻿# TASSIGN
+﻿# pto.tassign
 
-## 指令示意图
+`pto.tassign` 属于[同步与配置](./tile/sync-and-config_zh.md)指令集。
 
-![TASSIGN tile operation](../figures/isa/TASSIGN.svg)
+## 概述
 
-## 简介
+将 Tile 对象绑定到实现定义的片上地址（手动放置）。TASSIGN 通常在将 SSA tile 映射到物理存储时由缓冲化/降级引入。
 
-将 Tile 对象绑定到实现定义的片上地址（手动放置）。
+## 机制
 
-## 数学语义
+该指令将 Tile 或 GlobalTensor 绑定到指定地址。对于 Tile，它设置片上缓冲区地址；对于 GlobalTensor，它绑定主机内存指针。编译时地址重载执行静态边界检查以确保地址有效性。
 
-不适用。
+## 语法
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-`TASSIGN` 通常在将 SSA tile 映射到物理存储时由缓冲化/降级引入。
-
-同步形式：
+### PTO-AS
 
 ```text
 tassign %tile, %addr : !pto.tile<...>, index
@@ -26,13 +20,13 @@ tassign %tile, %addr : !pto.tile<...>, index
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 pto.tassign %tile, %addr : !pto.tile<...>, dtype
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
 ```
 
@@ -56,22 +50,21 @@ template <std::size_t Addr, typename T>
 PTO_INST void TASSIGN(T& obj);
 ```
 
-将 `obj` 绑定到片上地址 `Addr`。由于 `Addr` 是非类型模板参数，编译器通过 `static_assert`
-执行以下**编译时**检查：
+将 `obj` 绑定到片上地址 `Addr`。由于 `Addr` 是非类型模板参数，编译器通过 `static_assert` 执行编译时检查：
 
 | 检查项 | 条件 | 断言 ID | 错误信息 |
-|--------|------|---------|----------|
-| 内存空间存在 | `capacity > 0` | SA-0351 | 当前架构不支持该内存空间。 |
-| Tile 可放入内存 | `tile_size <= capacity` | SA-0352 | Tile 存储大小超出内存空间容量。 |
-| 地址未越界 | `Addr + tile_size <= capacity` | SA-0353 | addr + tile_size 超出内存空间容量（越界）。 |
-| 地址对齐 | `Addr % alignment == 0` | SA-0354 | addr 未按目标内存空间要求对齐。 |
+| --- | --- | --- | --- |
+| 内存空间存在 | capacity > 0 | SA-0351 | 当前架构不支持该内存空间。 |
+| Tile 可放入内存 | tile_size <= capacity | SA-0352 | Tile 存储大小超出内存空间容量。 |
+| 地址未越界 | Addr + tile_size <= capacity | SA-0353 | addr + tile_size 超出内存空间容量（越界）。 |
+| 地址对齐 | Addr % alignment == 0 | SA-0354 | addr 未按目标内存空间要求对齐。 |
 
 修复建议请参阅 `docs/coding/debug.md`（修复方案 `FIX-A12`）。
 
-内存空间、容量和对齐由 Tile 的 `TileType`（即 `Loc` 模板参数）自动确定：
+内存空间、容量和对齐由 Tile 的 `TileType` 自动确定：
 
 | TileType | 内存空间 | 容量 (A2A3) | 容量 (A5) | 容量 (Kirin9030) | 容量 (KirinX90) | 对齐 |
-|----------|----------|-------------|-----------|------------------|-----------------|------|
+| --- | --- | --- | --- | --- | --- | --- |
 | Vec | UB | 192 KB | 256 KB | 128 KB | 128 KB | 32 B |
 | Mat | L1 | 512 KB | 512 KB | 512 KB | 1024 KB | 32 B |
 | Left | L0A | 64 KB | 64 KB | 32 KB | 64 KB | 32 B |
@@ -86,15 +79,43 @@ PTO_INST void TASSIGN(T& obj);
 
 **注意：** 该重载仅适用于 `Tile` 和 `ConvTile` 类型。对于 `GlobalTensor`，请使用 `TASSIGN(obj, pointer)`（形式 1）。
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| obj | 输入/输出 | Tile 或 GlobalTensor 对象 |
+| addr | 输入 | 片上地址（形式 1）或编译时常量地址（形式 2） |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 无 | void | 该指令为原地操作 |
+
+## 副作用
+
+该指令修改 Tile 的内部存储地址绑定。
+
 ## 约束
 
-- **实现检查**:
-    - 如果 `obj` 是 Tile：
-    - 在手动模式下（未定义 `__PTO_AUTO__` 时），`addr` 必须是整数类型，并被重新解释为 tile 的存储地址。
-    - 在自动模式下（定义了 `__PTO_AUTO__` 时），`TASSIGN(tile, addr)` 是空操作。
-    - 如果 `obj` 是 `GlobalTensor`：
-    - `addr` 必须是指针类型。
-    - 指向的元素类型必须匹配 `GlobalTensor::DType`。
+- 实现检查:
+    - 如果 obj 是 Tile：
+    - 在手动模式下（未定义 __PTO_AUTO__ 时），addr 必须是整数类型，并被重新解释为 tile 的存储地址
+    - 在自动模式下（定义了 __PTO_AUTO__ 时），TASSIGN(tile, addr) 是空操作
+    - 如果 obj 是 GlobalTensor：
+    - addr 必须是指针类型
+    - 指向的元素类型必须匹配 GlobalTensor::DType
+
+## 异常与非法情形
+
+- 形式 2 编译时检查失败会触发 static_assert，断言 ID 为 SA-0351、SA-0352、SA-0353 或 SA-0354
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 运行时地址形式 | 支持 | 支持 | 支持 |
+| 编译时地址形式 | 支持 | 支持 | 支持 |
 
 ## 示例
 
@@ -173,29 +194,21 @@ void example_pingpong() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 pto.tassign %tile, %addr : !pto.tile<...>, dtype
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
 pto.tassign %tile, %addr : !pto.tile<...>, dtype
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 tassign %tile, %addr : !pto.tile<...>, index
 # AS Level 2 (DPS)
 pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
 ```
+
+## 相关页面
+
+- 指令集总览：[同步与配置](./tile/sync-and-config_zh.md)
diff --git a/docs/isa/TAXPY_zh.md b/docs/isa/TAXPY_zh.md
index abc19743..1ceacd46 100644
--- a/docs/isa/TAXPY_zh.md
+++ b/docs/isa/TAXPY_zh.md
@@ -1,41 +1,91 @@
-# TAXPY
+# pto.taxpy
 
-## 指令示意图
+`pto.taxpy` 属于[矩阵与矩阵向量指令](./tile/matrix-and-matrix-vector_zh.md)集。
 
-![TAXPY tile operation](../figures/isa/TAXPY.svg)
+## 概述
 
-## 简介
+AXPY 风格融合更新：将 Tile 乘以标量并累加到目标 Tile。TAXPY 计算 `dst = dst + alpha * src`，其中 alpha 为标量，src 为源 Tile。
 
-AXPY 风格融合更新：将 Tile 乘以标量并累加到目标 Tile。
+## 机制
 
-## 数学语义
+### 数学语义
 
-语义随指令而变化。 除非另有说明，行为都按目标 valid region 定义。
+对有效区域中的每个元素 `(i, j)`：
 
-## 汇编语法
+$$ \mathrm{dst}_{i,j} = \mathrm{dst}_{i,j} + \alpha \cdot \mathrm{src}_{i,j} $$
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+其中 $\alpha$ 为标量系数。行为按目标 valid region 定义。
+
+## 语法
+
+### PTO-AS
+
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
-%dst = pto.taxpy ...
+```mlir
+%dst = pto.taxpy %src, %alpha : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
-pto.taxpy ins(...) outs(%dst : !pto.tile_buf<...>)
+```mlir
+pto.taxpy ins(%src, %alpha : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`.
+声明于 `include/pto/common/pto_instr.hpp`。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 | 源 Tile |
+| `alpha` | 标量 | 缩放系数 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile（累加器） | 原地更新：$dst = dst + \alpha \cdot src$ |
+
+## 副作用
+
+`dst` 原地被修改。
 
 ## 约束
 
-数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准
+- `dst` 和 `src` 的 valid region 必须兼容
+- 标量 `alpha` 的类型必须与 tile 元素类型匹配或可隐式转换
+
+## 异常与非法情形
+
+- 若 `dst` 和 `src` 的 valid region 不匹配，行为未定义
+- 若 `alpha` 类型不兼容，编译失败
 
 ## 示例
 
-具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+### C++
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT dst, src;
+  float alpha = 2.5f;
+  TAXPY(dst, src, alpha);
+}
+```
+
+### PTO-AS
+
+```mlir
+# 自动模式
+%dst = pto.taxpy %src, %alpha : (!pto.tile<f32, 16, 16>, f32) -> !pto.tile<f32, 16, 16>
+```
diff --git a/docs/isa/TCMP_zh.md b/docs/isa/TCMP_zh.md
index fa239a17..15b6f60b 100644
--- a/docs/isa/TCMP_zh.md
+++ b/docs/isa/TCMP_zh.md
@@ -1,26 +1,24 @@
-﻿# TCMP
-
-## 指令示意图
+﻿# pto.tcmp
 
 ![TCMP tile operation](../figures/isa/TCMP.svg)
 
-## 简介
+`pto.tcmp` 属于[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)指令集。
 
-比较两个 Tile 并写入一个打包的谓词掩码。
+## 概述
 
-## 数学语义
+比较两个 tile 并将结果写成打包谓词 tile。迭代域由目标 tile 的 valid region 决定。
 
-概念上，对于有效区域中的每个元素 `(i, j)`，定义一个谓词：
+## 机制
 
-$$ p_{i,j} = \left(\mathrm{src0}_{i,j}\ \mathrm{cmpMode}\ \mathrm{src1}_{i,j}\right) $$
+从语义上看，对目标 tile 的 valid region 中每个 `(i, j)`，先定义一个谓词：
 
-谓词掩码使用实现定义的打包布局存储在 `dst` 中。
+$$ p_{i,j} = \left(\mathrm{src0}_{i,j}\ \mathrm{cmpMode}\ \mathrm{src1}_{i,j}\right) $$
 
-## 汇编语法
+随后把这些谓词位按目标定义的 packed layout 写入 `dst`。也就是说，`dst` 不是"每个 lane 一个布尔数"的朴素 tile，而是某种目标定义的压缩谓词表示。程序不能假设谓词 tile 的具体编码。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tcmp %src0, %src1 {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
@@ -28,50 +26,87 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
-%dst = pto.tcmp %src0, %src1{cmpMode = #pto<cmp xx>}: (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```mlir
+%dst = pto.tcmp %src0, %src1 {cmpMode = #pto.cmp<EQ>} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
-pto.tcmp ins(%src0, %src1{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```mlir
+pto.tcmp ins(%src0, %src1 {cmpMode = #pto.cmp<EQ>}: !pto.tile_buf<...>, !pto.tile_buf<...>)
+         outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp` 和 `include/pto/common/type.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TCMP(TileDataDst &dst, TileDataSrc &src0, TileDataSrc &src1, CmpMode cmpMode, WaitEvents &... events);
+PTO_INST RecordEvent TCMP(TileDataDst &dst, TileDataSrc &src0, TileDataSrc &src1,
+                          CmpMode cmpMode, WaitEvents &... events);
 ```
 
+### 比较模式
+
+| 模式 | 含义 |
+| --- | --- |
+| `EQ` | 等于 |
+| `NE` | 不等于 |
+| `LT` | 小于 |
+| `LE` | 小于等于 |
+| `GT` | 大于 |
+| `GE` | 大于等于 |
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src0` | 左 tile | 在 `dst` valid region 上逐坐标参与比较 |
+| `%src1` | 右 tile | 在 `dst` valid region 上逐坐标参与比较 |
+| `%dst` | 谓词 tile | 保存打包后的比较结果 |
+| `cmpMode` | 比较谓词 | 选择 EQ / NE / LT / LE / GT / GE |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | 打包后的谓词结果 tile，具体编码由目标 profile 定义 |
+
+## 副作用
+
+除产生谓词 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - 输入类型必须是以下之一：`int32_t`、`half`、`float`。
-    - 输出类型必须是 `uint8_t`。
-    - `src0/src1/dst` tile 位置必须是 `TileType::Vec`。
-    - 静态有效边界：`TileDataSrc::ValidRow <= TileDataSrc::Rows` 且 `TileDataSrc::ValidCol <= TileDataSrc::Cols`。
-    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
-    - 注意：`src1` 的形状/有效性在此实现中不通过显式运行时断言进行验证。
-    - 对于 `TileDataSrc::DType == int32_t`，实现使用 `EQ` 比较路径，无论 `cmpMode` 如何。
-- **实现检查 (A5)**:
-    - 输入类型必须是以下之一：`uint32_t`、`int32_t`、`uint16_t`、`int16_t`、`uint8_t`、`int8_t`、`float`、`half`。
-    - 输出类型必须是 `uint32_t`。
-    - 已实现（参见 `include/pto/npu/a5/TCmp.hpp`）。
-    - A5 实现使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域，并将打包的谓词掩码写入 `dst`（目标定义的打包方式）。
-- **掩码编码**:
-    - 掩码 tile 被解释为目标定义布局中的打包谓词位。
+- 迭代域是 `dst.GetValidRow() × dst.GetValidCol()`。
+- `src0` 的 validRow / validCol 必须与 `dst` 一致。
+- `src1` 的 shape / valid 在某些实现里不会做完整运行时校验，因此域外读值属于 implementation-defined。
+- 程序不能假设谓词 tile 的具体编码。
+
+## 异常与非法情形
+
+- 假设谓词 tile 是"一位一元素"的普通展开布尔 tile，会导致行为不可预期。
+- 对 `dst` 使用不符合目标定义的谓词输出 dtype，会被 verifier 或后端拒绝。
+
+## Target-Profile 限制
+
+| 检查项 | A2/A3 | A5 |
+| --- | :---: | :---: |
+| 支持输入类型 | `int32_t`、`half`、`float` | `uint32_t`、`int32_t`、`uint16_t`、`int16_t`、`uint8_t`、`int8_t`、`float`、`half` |
+| 输出谓词 dtype | `uint8_t` | `uint32_t` |
+| tile 位置 | `TileType::Vec` | `TileType::Vec` |
+| 布局 | RowMajor | RowMajor |
+| `src0` valid == `dst` valid | Required | Required |
+| `src1` validity 校验 | Not fully verified | Not fully verified |
+
+对 A2A3，若输入类型是 `int32_t`，实现里可能只走 `EQ` 比较路径；A5 则支持完整比较模式。
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
-
 using namespace pto;
 
 void example_auto() {
@@ -83,11 +118,10 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
-
 using namespace pto;
 
 void example_manual() {
@@ -102,29 +136,25 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcmp %src0, %src1{cmpMode = #pto<cmp xx>}: (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
+# 自动模式
+%dst = pto.tcmp %src0, %src1 {cmpMode = #pto.cmp<GT>} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 
-### 手动模式
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcmp %src0, %src1 {cmpMode = #pto.cmp<GT>} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcmp %src0, %src1{cmpMode = #pto<cmp xx>}: (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
+# PTO 汇编形式
 %dst = tcmp %src0, %src1 {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
-pto.tcmp ins(%src0, %src1{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+pto.tcmp ins(%src0, %src1 {cmpMode = #pto.cmp<EQ>}: !pto.tile_buf<...>, !pto.tile_buf<...>)
+         outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)
+- 规范页：[pto.tcmp](./tile/ops/elementwise-tile-tile/tcmp_zh.md)
diff --git a/docs/isa/TCOLARGMAX_zh.md b/docs/isa/TCOLARGMAX_zh.md
index 4d90318f..599eac89 100644
--- a/docs/isa/TCOLARGMAX_zh.md
+++ b/docs/isa/TCOLARGMAX_zh.md
@@ -1,103 +1,96 @@
-# TCOLARGMAX
+# pto.tcolargmax
 
-## 指令示意图
+`pto.tcolargmax` 属于[归约与扩展](./tile/reduce-and-expand_zh.md)指令集。
 
-![TCOLARGMAX tile operation](../figures/isa/TCOLARGMAX.svg)
-
-## 简介
+## 概述
 
 获取每列最大值对应行索引。
 
-## 数学语义
+## 机制
 
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= j < C`：
 
 $$ \mathrm{dst}_{0,j} = \underset{0 \le i < R}{\operatorname{argmax}} \; \mathrm{src}_{i,j} $$
 
-## 汇编语法
+输出 tile 中每个元素为源 tile 对应列中最大值的行索引（0-based）。`tmp` 操作数用于存储中间结果（当前行索引、argmax 索引、当前最大值元素）。若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+A2A3 实现中 `tmp` 始终被使用；A5 实现中 `tmp` 保留但不使用。
 
-同步形式：
+## 语法
 
-```text
-%dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+### PTO-AS
 
-### IR Level 1（SSA）
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
-```text
+### AS Level 1（SSA）
+
+```mlir
 %dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### IR Level 2（DPS）
+### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`:
+声明于 `include/pto/common/pto_instr.hpp`：
 
 ```cpp
 template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TCOLARGMAX(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
+PTO_INST RecordEvent TCOLARGMAX(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
 
-### 通用约束或检查
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile，源元素类型支持数值类型 |
+| `%dst` | 目标 tile | 接收 argmax 索引结果，目标元素类型为 `uint32_t` 或 `int32_t` |
+| `%tmp` | 临时 tile | A2A3 必需（存储中间结果）；A5 保留但不使用 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
 
-- `dst` 和 `src` 必须为 `TileType::Vec`。
-- 由于已检查到的辅助检查仅要求 `SLayout::NoneBox`，因此 `src` 可使用 ND 或 DN 的非分形布局。
-- `dst` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- 支持的目标元素类型：`uint32_t`、`int32_t`。
-- 编译时检查：`TileDataIn::ValidCol == 1 || TileDataIn::ValidCol == -1`。
-- 运行时检查：
-    - `src.GetValidRow() != 0`
-    - `src.GetValidCol() != 0`
-    - `dst.GetValidRow() == 1`
-    - `src.GetValidCol() == dst.GetValidCol()`
+## 预期输出
 
-### A2A3 实现检查
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<1, C>` | 每列最大值的行索引（0-based） |
 
-- 支持的源元素类型：`half`、`float`、`uint16_t`、`uint32_t`。
-- `tmp` 的元素类型必须与 `src` 一致。
-- 在已检查到的 A2A3 实现路径中，`tmp` 用作索引跟踪和当前比较值的临时存储。
+## 副作用
 
-### A5 实现检查
+除产生目标 tile 外，没有额外架构副作用。
 
-- 支持的源元素宽度为 8 位、16 位或 32 位，因此已检查到的实现覆盖 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float`。
-- 在已检查到的 A5 实现路径中，接口仍接收 `tmp`，但 `TCOLARGMAX_IMPL` 实际并不使用它。
+## 约束
 
-### A2A3 `tmp` 临时 Tile 相关说明
+- `dst` 和 `src` 必须为 `TileType::Vec`。
+- 由于辅助检查仅要求 `SLayout::NoneBox`，`src` 可使用 ND 或 DN 的非分形布局。
+- `dst` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
+- 目标元素类型：`uint32_t`、`int32_t`。
+- 编译时检查：`TileDataIn::ValidCol == 1 || TileDataIn::ValidCol == -1`。
+- 运行时检查：`src.GetValidRow() != 0`、`src.GetValidCol() != 0`、`dst.GetValidRow() == 1`、`src.GetValidCol() == dst.GetValidCol()`。
+- A2A3：`tmp` 元素类型必须与 `src` 一致，`tmp` 用作索引跟踪和当前比较值的临时存储。
+- A5：`tmp` 保留但不使用，A5 使用基于向量寄存器的计算方式。
 
-* A2A3 实现中 `tmp` **始终被使用**，作为中间结果的临时存储空间（当前行索引、argmax 索引、当前最大值元素）。
-* `tmp` Tile 的数据类型必须与 `src` 的数据类型一致。
-* `tmp` Tile 在单行内被划分为三个区域：
-  - 区域 0（`[0, tmpGapEles)`）：当前行索引计数器（每行递增）。
-  - 区域 1（`[tmpGapEles, 2 * tmpGapEles)`）：当前最大值元素，用于比较。
-  - 区域 2（`[2 * tmpGapEles, 3 * tmpGapEles)`）：argmax 索引结果（最终转换后写入 `dst`）。
-* `tmpGapEles` 的确定方式：
-  - 当 `srcValidCol >= elemPerRpt` 时：`tmpGapEles = elemPerRpt`。
-  - 当 `srcValidCol < elemPerRpt` 时：`tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock`。
-* 当 `src` 较小时，可直接将 `tmp` Tile 大小设为与 `src` 相同；也可按以下公式根据 `src` 的 `validCol` 算出 `tmp` Tile 所需 stride：
+## 异常与非法情形
 
-```text
-repeats = ceil(validCol / elementPerRepeat)
-stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
-```
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
 
-### A5 `tmp` 临时 Tile 相关说明
+## Target-Profile 限制
 
-* A5 实现中 `tmp` 临时 Tile **不使用**。A5 使用基于向量寄存器的计算方式（`__VEC_SCOPE__`），不需要临时 Tile 存储。
-* `tmp` 在 C++ 内建接口签名中保留，仅为了与 A2A3 的 API 兼容。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 源 `float` | Simulated | Supported | Supported |
+| 源 `half` | Simulated | Supported | Supported |
+| 源 `uint16_t` / `uint32_t` | — | Supported | Supported |
+| 源 `int8_t` / `uint8_t` / `int16_t` / `int32_t` | — | — | Supported |
+| 目标 `uint32_t` / `int32_t` | Simulated | Supported | Supported |
+| `tmp` 临时 tile | — | Required | Unused (API compat) |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -115,7 +108,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -136,40 +129,23 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
 %dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 %dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
+# AS Level 2 (DPS)
 pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
-</task_progress>
-- [x] Explore existing docs/isa for documentation style and format
-- [x] Read tcolargmax and tcolargmin A2A3 implementation in include/
-- [x] Read tcolargmax and tcolargmin A5 implementation in include/
-- [x] Read test cases for tcolargmax and tcolargmin
-- [x] Understand A2A3 vs A5 differences and tmp handling
-- [x] Write tcolargmax English documentation (docs/isa/TCOLARGMAX.md)
-- [x] Write tcolargmax Chinese documentation (docs/isa/TCOLARGMAX_zh.md)
-- [ ] Verify documentation completeness and accuracy
-</task_progress>
-</write_to_file>
+
+## 相关页面
+
+- 指令集总览：[归约与扩展](./tile/reduce-and-expand_zh.md)
diff --git a/docs/isa/TCOLARGMIN_zh.md b/docs/isa/TCOLARGMIN_zh.md
index 612b5d79..258464e9 100644
--- a/docs/isa/TCOLARGMIN_zh.md
+++ b/docs/isa/TCOLARGMIN_zh.md
@@ -1,103 +1,96 @@
-# TCOLARGMIN
+# pto.tcolargmin
 
-## 指令示意图
+`pto.tcolargmin` 属于[归约与扩展](./tile/reduce-and-expand_zh.md)指令集。
 
-![TCOLARGMIN tile operation](../figures/isa/TCOLARGMIN.svg)
-
-## 简介
+## 概述
 
 获取每列最小值对应行索引。
 
-## 数学语义
+## 机制
 
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= j < C`：
 
 $$ \mathrm{dst}_{0,j} = \underset{0 \le i < R}{\operatorname{argmin}} \; \mathrm{src}_{i,j} $$
 
-## 汇编语法
+输出 tile 中每个元素为源 tile 对应列中最小值的行索引（0-based）。`tmp` 操作数用于存储中间结果（当前行索引、argmin 索引、当前最小值元素）。若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+A2A3 实现中 `tmp` 始终被使用；A5 实现中 `tmp` 保留但不使用。
 
-同步形式：
+## 语法
 
-```text
-%dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+### PTO-AS
 
-### IR Level 1（SSA）
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
-```text
+### AS Level 1（SSA）
+
+```mlir
 %dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### IR Level 2（DPS）
+### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`:
+声明于 `include/pto/common/pto_instr.hpp`：
 
 ```cpp
 template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TCOLARGMIN(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
+PTO_INST RecordEvent TCOLARGMIN(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
 
-### 通用约束或检查
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile，源元素类型支持数值类型 |
+| `%dst` | 目标 tile | 接收 argmin 索引结果，目标元素类型为 `uint32_t` 或 `int32_t` |
+| `%tmp` | 临时 tile | A2A3 必需（存储中间结果）；A5 保留但不使用 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
 
-- `dst` 和 `src` 必须为 `TileType::Vec`。
-- 由于已检查到的辅助检查仅要求 `SLayout::NoneBox`，因此 `src` 可使用 ND 或 DN 的非分形布局。
-- `dst` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- 支持的目标元素类型：`uint32_t`、`int32_t`。
-- 编译时检查：`TileDataIn::ValidCol == 1 || TileDataIn::ValidCol == -1`。
-- 运行时检查：
-    - `src.GetValidRow() != 0`
-    - `src.GetValidCol() != 0`
-    - `dst.GetValidRow() == 1`
-    - `src.GetValidCol() == dst.GetValidCol()`
+## 预期输出
 
-### A2A3 实现检查
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<1, C>` | 每列最小值的行索引（0-based） |
 
-- 支持的源元素类型：`half`、`float`、`uint16_t`、`uint32_t`。
-- `tmp` 的元素类型必须与 `src` 一致。
-- 在已检查到的 A2A3 实现路径中，`tmp` 用作索引跟踪和当前比较值的临时存储。
+## 副作用
 
-### A5 实现检查
+除产生目标 tile 外，没有额外架构副作用。
 
-- 支持的源元素宽度为 8 位、16 位或 32 位，因此已检查到的实现覆盖 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float`。
-- 在已检查到的 A5 实现路径中，接口仍接收 `tmp`，但 `TCOLARGMIN_IMPL` 实际并不使用它。
+## 约束
 
-### A2A3 `tmp` 临时 Tile 相关说明
+- `dst` 和 `src` 必须为 `TileType::Vec`。
+- 由于辅助检查仅要求 `SLayout::NoneBox`，`src` 可使用 ND 或 DN 的非分形布局。
+- `dst` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
+- 目标元素类型：`uint32_t`、`int32_t`。
+- 编译时检查：`TileDataIn::ValidCol == 1 || TileDataIn::ValidCol == -1`。
+- 运行时检查：`src.GetValidRow() != 0`、`src.GetValidCol() != 0`、`dst.GetValidRow() == 1`、`src.GetValidCol() == dst.GetValidCol()`。
+- A2A3：`tmp` 元素类型必须与 `src` 一致，`tmp` 用作索引跟踪和当前比较值的临时存储。
+- A5：`tmp` 保留但不使用，A5 使用基于向量寄存器的计算方式。
 
-* A2A3 实现中 `tmp` **始终被使用**，作为中间结果的临时存储空间（当前行索引、argmin 索引、当前最小值元素）。
-* `tmp` Tile 的数据类型必须与 `src` 的数据类型一致。
-* `tmp` Tile 在单行内被划分为三个区域：
-  - 区域 0（`[0, tmpGapEles)`）：当前行索引计数器（每行递增）。
-  - 区域 1（`[tmpGapEles, 2 * tmpGapEles)`）：当前最小值元素，用于比较。
-  - 区域 2（`[2 * tmpGapEles, 3 * tmpGapEles)`）：argmin 索引结果（最终转换后写入 `dst`）。
-* `tmpGapEles` 的确定方式：
-  - 当 `srcValidCol >= elemPerRpt` 时：`tmpGapEles = elemPerRpt`。
-  - 当 `srcValidCol < elemPerRpt` 时：`tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock`。
-* 当 `src` 较小时，可直接将 `tmp` Tile 大小设为与 `src` 相同；也可按以下公式根据 `src` 的 `validCol` 算出 `tmp` Tile 所需 stride：
+## 异常与非法情形
 
-```text
-repeats = ceil(validCol / elementPerRepeat)
-stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
-```
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
 
-### A5 `tmp` 临时 Tile 相关说明
+## Target-Profile 限制
 
-* A5 实现中 `tmp` 临时 Tile **不使用**。A5 使用基于向量寄存器的计算方式（`__VEC_SCOPE__`），不需要临时 Tile 存储。
-* `tmp` 在 C++ 内建接口签名中保留，仅为了与 A2A3 的 API 兼容。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 源 `float` | Simulated | Supported | Supported |
+| 源 `half` | Simulated | Supported | Supported |
+| 源 `uint16_t` / `uint32_t` | — | Supported | Supported |
+| 源 `int8_t` / `uint8_t` / `int16_t` / `int32_t` | — | — | Supported |
+| 目标 `uint32_t` / `int32_t` | Simulated | Supported | Supported |
+| `tmp` 临时 tile | — | Required | Unused (API compat) |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -115,7 +108,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -136,34 +129,23 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
 %dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 %dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
+# AS Level 2 (DPS)
 pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
-</task_progress>
-- [x] Write tcolargmin English documentation (docs/isa/TCOLARGMIN.md)
-- [x] Write tcolargmin Chinese documentation (docs/isa/TCOLARGMIN_zh.md)
-</task_progress>
-</write_to_file>
+
+## 相关页面
+
+- 指令集总览：[归约与扩展](./tile/reduce-and-expand_zh.md)
diff --git a/docs/isa/TCOLMAX_zh.md b/docs/isa/TCOLMAX_zh.md
index daf6e76b..353b16d6 100644
--- a/docs/isa/TCOLMAX_zh.md
+++ b/docs/isa/TCOLMAX_zh.md
@@ -1,38 +1,34 @@
-﻿# TCOLMAX
+﻿# pto.tcolmax
 
-## 指令示意图
+`pto.tcolmax` 属于[归约与扩展](./tile/reduce-and-expand_zh.md)指令集。
 
-![TCOLMAX tile operation](../figures/isa/TCOLMAX.svg)
-
-## 简介
+## 概述
 
 通过取行间最大值来归约每一列。
 
-## 数学语义
+## 机制
 
 设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= j < C`：
 
 $$ \mathrm{dst}_{0,j} = \max_{0 \le i < R} \mathrm{src}_{i,j} $$
 
-## 汇编语法
+迭代域由 `src` 的 valid region 决定。若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
-```text
-%dst = tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-```
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tcolmax ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -45,28 +41,49 @@ template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
 PTO_INST RecordEvent TCOLMAX(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile |
+| `%dst` | 目标 tile | 接收按列取最大值结果 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<1, C>` | 每列的最大元素值 |
+
+## 副作用
 
-### 通用约束或检查
+除产生目标 tile 外，没有额外架构副作用。
+
+## 约束
 
 - `dst` 和 `src` 必须为 `TileType::Vec`。
 - `dst` 和 `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
 - `dst` 和 `src` 的元素类型必须一致。
-- 运行时检查：
-    - `src.GetValidCol() == dst.GetValidCol()`
-- 若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
+- 运行时检查：`src.GetValidCol() == dst.GetValidCol()`。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`half`、`float`、`int16_t`、`int32_t`。
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`half`、`float`、`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`bfloat16_t`。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `float` | Simulated | Supported | Supported |
+| `half` | Simulated | Supported | Supported |
+| `int16_t` / `int32_t` | — | Supported | — |
+| `int8_t` / `uint8_t` | — | — | Supported |
+| `uint16_t` / `uint32_t` | — | — | Supported |
+| `bfloat16_t` | — | — | Supported |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -82,7 +99,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -100,29 +117,23 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
 %dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 %dst = tcolmax %src : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tcolmax ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[归约与扩展](./tile/reduce-and-expand_zh.md)
diff --git a/docs/isa/TCOLMIN_zh.md b/docs/isa/TCOLMIN_zh.md
index f9089695..e0fa7a3a 100644
--- a/docs/isa/TCOLMIN_zh.md
+++ b/docs/isa/TCOLMIN_zh.md
@@ -1,38 +1,34 @@
-﻿# TCOLMIN
+﻿# pto.tcolmin
 
-## 指令示意图
+`pto.tcolmin` 属于[归约与扩展](./tile/reduce-and-expand_zh.md)指令集。
 
-![TCOLMIN tile operation](../figures/isa/TCOLMIN.svg)
-
-## 简介
+## 概述
 
 通过取行间最小值来归约每一列。
 
-## 数学语义
+## 机制
 
 设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= j < C`：
 
 $$ \mathrm{dst}_{0,j} = \min_{0 \le i < R} \mathrm{src}_{i,j} $$
 
-## 汇编语法
+迭代域由 `src` 的 valid region 决定。若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
-```text
-%dst = tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-```
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tcolmin ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -45,28 +41,49 @@ template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
 PTO_INST RecordEvent TCOLMIN(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile |
+| `%dst` | 目标 tile | 接收按列取最小值结果 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<1, C>` | 每列的最小元素值 |
+
+## 副作用
 
-### 通用约束或检查
+除产生目标 tile 外，没有额外架构副作用。
+
+## 约束
 
 - `dst` 和 `src` 必须为 `TileType::Vec`。
 - `dst` 和 `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
 - `dst` 和 `src` 的元素类型必须一致。
-- 运行时检查：
-    - `src.GetValidCol() == dst.GetValidCol()`
-- 若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
+- 运行时检查：`src.GetValidCol() == dst.GetValidCol()`。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`half`、`float`、`int16_t`、`int32_t`。
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`half`、`float`、`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`bfloat16_t`。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `float` | Simulated | Supported | Supported |
+| `half` | Simulated | Supported | Supported |
+| `int16_t` / `int32_t` | — | Supported | — |
+| `int8_t` / `uint8_t` | — | — | Supported |
+| `uint16_t` / `uint32_t` | — | — | Supported |
+| `bfloat16_t` | — | — | Supported |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -82,7 +99,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -100,29 +117,23 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
 %dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 %dst = tcolmin %src : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tcolmin ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[归约与扩展](./tile/reduce-and-expand_zh.md)
diff --git a/docs/isa/TCOLPROD_zh.md b/docs/isa/TCOLPROD_zh.md
index 9c706e65..b00b5bab 100644
--- a/docs/isa/TCOLPROD_zh.md
+++ b/docs/isa/TCOLPROD_zh.md
@@ -1,38 +1,34 @@
-﻿# TCOLPROD
+﻿# pto.tcolprod
 
-## 指令示意图
+`pto.tcolprod` 属于[归约与扩展](./tile/reduce-and-expand_zh.md)指令集。
 
-![TCOLPROD tile operation](../figures/isa/TCOLPROD.svg)
-
-## 简介
+## 概述
 
 通过跨行乘积来归约每一列。
 
-## 数学语义
+## 机制
 
 设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= j < C`：
 
 $$ \mathrm{dst}_{0,j} = \prod_{i=0}^{R-1} \mathrm{src}_{i,j} $$
 
-## 汇编语法
+迭代域由 `src` 的 valid region 决定。若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
-```text
-%dst = tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-```
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tcolprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -45,28 +41,48 @@ template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
 PTO_INST RecordEvent TCOLPROD(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile |
+| `%dst` | 目标 tile | 接收按列乘积结果 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<1, C>` | 每列的 `R` 个元素乘积 |
+
+## 副作用
 
-### 通用约束或检查
+除产生目标 tile 外，没有额外架构副作用。
+
+## 约束
 
 - `dst` 和 `src` 必须为 `TileType::Vec`。
 - `dst` 和 `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
 - `dst` 和 `src` 的元素类型必须一致。
-- 运行时检查：
-    - `src.GetValidCol() == dst.GetValidCol()`
-- 若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
+- 运行时检查：`src.GetValidCol() == dst.GetValidCol()`。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`half`、`float`、`int16_t`、`int32_t`。
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`half`、`float`、`bfloat16_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `float` | Simulated | Supported | Supported |
+| `half` | Simulated | Supported | Supported |
+| `bfloat16_t` | — | — | Supported |
+| `int16_t` / `uint16_t` | — | Supported | Supported |
+| `int32_t` / `uint32_t` | — | Supported | Supported |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -82,7 +98,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -100,29 +116,23 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
 %dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 %dst = tcolprod %src : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tcolprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[归约与扩展](./tile/reduce-and-expand_zh.md)
diff --git a/docs/isa/TCOLSUM_zh.md b/docs/isa/TCOLSUM_zh.md
index d9dbce4c..58f464d5 100644
--- a/docs/isa/TCOLSUM_zh.md
+++ b/docs/isa/TCOLSUM_zh.md
@@ -1,42 +1,35 @@
-﻿# TCOLSUM
+﻿# pto.tcolsum
 
-## 指令示意图
+`pto.tcolsum` 属于[归约与扩展](./tile/reduce-and-expand_zh.md)指令集。
 
-![TCOLSUM tile operation](../figures/isa/TCOLSUM.svg)
+## 概述
 
-## 简介
+通过对行求和来归约每一列，`isBinary` 选择实现路径（二叉树累加 vs. 顺序累加）。
 
-通过对行求和来归约每一列。
-
-## 数学语义
+## 机制
 
 设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= j < C`：
 
 $$ \mathrm{dst}_{0,j} = \sum_{i=0}^{R-1} \mathrm{src}_{i,j} $$
 
-`isBinary` 选择实现路径（二叉树累加 vs. 顺序累加）。
-
-## 汇编语法
+迭代域由 `src` 的 valid region 决定，`isBinary = true` 时使用 `tmp` 做二叉树累加，`isBinary = false` 时直接在 `dst` 上做顺序累加。若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
-```text
-%dst = tcolsum %src {isBinary = false} : !pto.tile<...> -> !pto.tile<...>
-```
-降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
 %dst = pto.tcolsum %src, %tmp {isBinary = false} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tcolsum ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 pto.tcolsum ins(%src, %tmp {isBinary = false} : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
@@ -53,37 +46,54 @@ template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typen
 PTO_INST RecordEvent TCOLSUM(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, bool isBinary, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile |
+| `%dst` | 目标 tile | 接收按列求和结果 |
+| `%tmp` | 临时 tile | 仅 `isBinary = true` 时使用，做二叉树累加 |
+| `isBinary` | 配置参数 | `true`：二叉树累加；`false`：顺序累加（默认 false） |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<1, C>` | 每列的 `R` 个元素求和 |
+
+## 副作用
 
-### 通用约束或检查
+除产生目标 tile 外，没有额外架构副作用。
+
+## 约束
 
 - `dst` 和 `src` 必须为 `TileType::Vec`。
 - `dst` 和 `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
 - `dst` 和 `src` 的元素类型必须一致。
-- 运行时检查：
-    - `src.GetValidCol() == dst.GetValidCol()`
-    - `src.GetValidRow() != 0`
-    - `src.GetValidCol() != 0`
-    - `src.GetValidCol()` 必须不大于按 `src` 元素计的 `tmp` 行跨度
-- `isBinary` 选择已检查到的后端路径：
-    - `true`：使用 `tmp` 做二叉树累加
-    - `false`：直接在 `dst` 上做顺序累加
+- 运行时检查：`src.GetValidCol() == dst.GetValidCol()`、`src.GetValidRow() != 0`、`src.GetValidCol() != 0`。
+- `src.GetValidCol()` 必须不大于按 `src` 元素计的 `tmp` 行跨度。
+- A2A3：`tmp` 必须为 `TileType::Vec`，使用标准 ND 布局，元素类型与 `src` 和 `dst` 一致。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`half`、`float`、`int16_t`、`int32_t`。
-- `tmp` 必须为 `TileType::Vec`，且使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `tmp` 的元素类型必须与 `src` 和 `dst` 一致。
-- 若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
 
-### A5 实现检查
+## Target-Profile 限制
 
-- A5 共享列归约检查允许的元素类型为：`half`、`float`、`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`bfloat16_t`。
-- 已检查到的 A5 `TCOLSUM` 路径中，`tmp` 仍只用于二叉累加路径；`TCOLSUM_IMPL` 中没有额外显式加入 `tmp` 的编译期类型/布局断言。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `float` | Simulated | Supported | Supported |
+| `half` | Simulated | Supported | Supported |
+| `int16_t` | — | Supported | — |
+| `int32_t` | — | Supported | — |
+| `int8_t` / `uint8_t` | — | — | Supported |
+| `uint16_t` / `uint32_t` | — | — | Supported |
+| `bfloat16_t` | — | — | Supported |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -101,7 +111,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -122,29 +132,23 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
 %dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 %dst = tcolsum %src {isBinary = false} : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tcolsum ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[归约与扩展](./tile/reduce-and-expand_zh.md)
diff --git a/docs/isa/TCONCAT_zh.md b/docs/isa/TCONCAT_zh.md
index 6a8232ac..ee91c1b8 100644
--- a/docs/isa/TCONCAT_zh.md
+++ b/docs/isa/TCONCAT_zh.md
@@ -1,41 +1,90 @@
-# TCONCAT
+# pto.tconcat
 
-## 指令示意图
+`pto.tconcat` 属于[布局与重排](./tile/layout-and-rearrangement_zh.md)指令集。
 
-![TCONCAT tile operation](../figures/isa/TCONCAT.svg)
+## 概述
 
-## 简介
+沿列维度将两个源 Tile 拼接到目标 Tile 中，形成更宽的 Tile。
 
-沿列维将两个源 Tile 拼接到目标 Tile。
+## 机制
 
-## 数学语义
+语义随具体指令变体而变化。除非另有说明，行为都按目标 valid region 定义。
 
-语义随指令而变化。 除非另有说明，行为都按目标 valid region 定义。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
 PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tconcat ...
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tconcat ins(...) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`.
+声明于 `include/pto/common/pto_instr.hpp`。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| 源 Tile 1 | 输入 | 待拼接的第一个 Tile |
+| 源 Tile 2 | 输入 | 待拼接的第二个 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile_buf<...>` | 拼接后的目标 Tile |
+
+## 副作用
+
+拼接操作会产生目标 Tile，无额外架构副作用。
 
 ## 约束
 
-数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+
+## 异常与非法情形
+
+- 不支持的 Tile 形状/布局组合会被 verifier 或后端拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 拼接操作 | Simulated | Supported | Supported |
 
 ## 示例
 
+### C++ 自动模式
+
 具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### C++ 手动模式
+
+具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### PTO-AS
+
+```text
+%dst = pto.tconcat %src1, %src2 : !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```mlir
+pto.tconcat ins(%src1, %src2 : ...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## 相关页面
+
+- 指令集总览：[布局与重排](./tile/layout-and-rearrangement_zh.md)
diff --git a/docs/isa/TDEQUANT_zh.md b/docs/isa/TDEQUANT_zh.md
index 2b4373f7..8f598fb0 100644
--- a/docs/isa/TDEQUANT_zh.md
+++ b/docs/isa/TDEQUANT_zh.md
@@ -1,41 +1,91 @@
-# TDEQUANT
+# pto.tdequant
 
-## 指令示意图
+`pto.tdequant` 属于[不规则与复杂](./tile/irregular-and-complex_zh.md)指令集。
 
-![TDEQUANT tile operation](../figures/isa/TDEQUANT.svg)
-
-## 简介
+## 概述
 
 使用 scale 与 offset Tile 将整数量化 Tile 反量化为浮点 Tile。
 
-## 数学语义
+## 机制
+
+语义随具体指令变体而变化。除非另有说明，行为都按目标 valid region 定义。
 
-语义随指令而变化。 除非另有说明，行为都按目标 valid region 定义。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
 PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tdequant ...
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tdequant ins(...) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`.
+声明于 `include/pto/common/pto_instr.hpp`。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| 量化 Tile | 输入 | 待反量化的整数量化 Tile |
+| Scale Tile | 输入 | 反量化使用的缩放系数 |
+| Offset Tile | 输入 | 反量化使用的偏移量（可选） |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile_buf<f32>` | 反量化后的浮点 Tile |
+
+## 副作用
+
+反量化操作会产生目标浮点 Tile，无额外架构副作用。
 
 ## 约束
 
-数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+
+## 异常与非法情形
+
+- 不支持的量化格式或数据类型组合会被 verifier 或后端拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 反量化操作 | Simulated | Supported | Supported |
 
 ## 示例
 
+### C++ 自动模式
+
+具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### C++ 手动模式
+
 具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### PTO-AS
+
+```text
+%dst = pto.tdequant %quant_tile, %scale_tile, %offset_tile : ...
+```
+
+### AS Level 2（DPS）
+
+```mlir
+pto.tdequant ins(%quant, %scale : ...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## 相关页面
+
+- 指令集总览：[不规则与复杂](./tile/irregular-and-complex_zh.md)
diff --git a/docs/isa/TDIVS_zh.md b/docs/isa/TDIVS_zh.md
index ef120aec..58856deb 100644
--- a/docs/isa/TDIVS_zh.md
+++ b/docs/isa/TDIVS_zh.md
@@ -1,159 +1,171 @@
-﻿# TDIVS
-
-## 指令示意图
-
-![TDIVS tile operation](../figures/isa/TDIVS.svg)
-
-## 简介
-
-与标量的逐元素除法（Tile/标量 或 标量/Tile）。
-
-## 数学语义
-
-对有效区域内的每个元素 `(i, j)`：
-
-- Tile/标量形式：
-
-  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{src}_{i,j}}{\mathrm{scalar}} $$
-
-- 标量/Tile 形式：
-
-  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{scalar}}{\mathrm{src}_{i,j}} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-Tile/标量形式：
-
-```text
-%dst = tdivs %src, %scalar : !pto.tile<...>, f32
-```
-
-标量/Tile 形式：
-
-```text
-%dst = tdivs %scalar, %src : f32, !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-%dst = pto.tdivs %scalar, %src : (dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-pto.tdivs ins(%scalar, %src : dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TDIVS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar,
-                           WaitEvents &... events);
-
-template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TDIVS(TileDataDst &dst, typename TileDataDst::DType scalar, TileDataSrc &src0,
-                           WaitEvents &... events)
-```
-
-`PrecisionType`可指定以下值：
-
-* `DivAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
-* `DivAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
-
-## 约束
-
-- **实现检查 (A2A3)**（两个重载）:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **实现检查 (A5)**（两个重载）:
-    - `TileData::DType` 必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **除零**:
-    - 行为由目标定义；在 A5 上，Tile/标量形式映射到乘以倒数，并对 `scalar == 0` 使用 `1/0 -> +inf`。dst.GetValidRow()`且`src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域.
-- **除零**:
-    - 行为由目标定义；在 A5 上，tile/标量形式映射到乘以倒数，并对 `scalar == 0` 使用 `1/0 -> +inf`。
-- **高精度算法**
-    - 仅在A5上有效，`PrecisionType`选项A3上将被忽略。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TDIVS(dst, src, 2.0f);
-  TDIVS<DivAlgorithm::HIGH_PRECISION>(dst, src, 2.0f);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TDIVS(dst, 2.0f, src);
-  TDIVS<DivAlgorithm::HIGH_PRECISION>(dst, 2.0f, src);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
+﻿# pto.tdivs
+
+![TDIVS tile operation](../figures/isa/TDIVS.svg)
+
+`pto.tdivs` 属于[Tile-标量与立即数](./tile/tile-scalar-and-immediate_zh.md)指令集。
+
+## 概述
+
+带标量的逐元素除法，支持 tile / scalar 和 scalar / tile 两种方向，标量广播到 tile 有效区域的所有元素。
+
+## 机制
+
+对目标 tile 的 valid region 中每个 `(i, j)`：
+
+- tile / scalar 形式：
+
+  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{src}_{i,j}}{\mathrm{scalar}} $$
+
+- scalar / tile 形式：
+
+  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{scalar}}{\mathrm{src}_{i,j}} $$
+
+除零行为由目标 profile 定义。在 A5 上，tile / scalar 形式通常映射到"乘以倒数"的实现路径。
+
+## 语法
+
+### PTO-AS
+
+tile / scalar 形式：
+
+```text
+%dst = tdivs %src, %scalar : !pto.tile<...>, f32
+```
+
+scalar / tile 形式：
+
+```text
+%dst = tdivs %scalar, %src : f32, !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```mlir
+%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = pto.tdivs %scalar, %src : (dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```mlir
+pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+pto.tdivs ins(%scalar, %src : dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+```cpp
+template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
+          typename... WaitEvents>
+PTO_INST RecordEvent TDIVS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar,
+                           WaitEvents &... events);
+
+template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
+          typename... WaitEvents>
+PTO_INST RecordEvent TDIVS(TileDataDst &dst, typename TileDataDst::DType scalar, TileDataSrc &src0,
+                           WaitEvents &... events);
+```
+
+`PrecisionType` 可选：
+
+- `DivAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
+- `DivAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | tile / scalar 形式中被除数 |
+| `%scalar` | 标量 | 广播到所有元素的标量值 |
+| `%dst` | 目标 tile | 接收逐元素除法结果 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | `dst` valid region 内的每个元素都等于对应形式的除法结果 |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
+## 约束
+
+- 操作迭代域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定。
+- 除零行为由目标 profile 定义。
+- `HIGH_PRECISION` 只在 A5 可用，A3 上该选项会被忽略。
+
+## 异常与非法情形
+
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `int32_t` / `uint32_t` | Simulated | Supported | Supported |
+| `int16_t` / `uint16_t` | Simulated | Supported | Supported |
+| `float` | Simulated | Supported | Supported |
+| `half` | Simulated | Supported | Supported |
+| `int8_t` / `uint8_t` | Simulated | No | Supported |
+| 布局 | Any | RowMajor only | RowMajor only |
+
+A2/A3 支持：`int32_t`、`int16_t`、`half`、`float`；A5 额外支持 `uint32_t`、`uint16_t`、`int8_t`、`uint8_t`。
+
+## 示例
+
+### C++ 自动模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TDIVS(dst, src, 2.0f);
+  TDIVS<DivAlgorithm::HIGH_PRECISION>(dst, src, 2.0f);
+}
+```
+
+### C++ 手动模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TDIVS(dst, 2.0f, src);
+  TDIVS<DivAlgorithm::HIGH_PRECISION>(dst, 2.0f, src);
+}
+```
+
+### PTO-AS
+
+```text
+# tile / scalar 自动模式
+%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+
+# scalar / tile 自动模式
+%dst = pto.tdivs %scalar, %src : (dtype, !pto.tile<...>) -> !pto.tile<...>
+
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+
+# PTO 汇编形式
+%dst = tdivs %src, %scalar : !pto.tile<...>, f32
+%dst = tdivs %scalar, %src : f32, !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## 相关页面
+
+- 指令集总览：[Tile-标量与立即数](./tile/tile-scalar-and-immediate_zh.md)
+- 规范页：[pto.tdivs](./tile/ops/tile-scalar-and-immediate/tdivs_zh.md)
diff --git a/docs/isa/TDIV_zh.md b/docs/isa/TDIV_zh.md
index 7d0b14ca..1506c01b 100644
--- a/docs/isa/TDIV_zh.md
+++ b/docs/isa/TDIV_zh.md
@@ -1,24 +1,24 @@
-﻿# TDIV
-
-## 指令示意图
+﻿# pto.tdiv
 
 ![TDIV tile operation](../figures/isa/TDIV.svg)
 
-## 简介
+`pto.tdiv` 属于[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)指令集。
+
+## 概述
 
-两个 Tile 的逐元素除法。
+对两个 tile 做逐元素除法，结果写入目标 tile。迭代域由目标 tile 的 valid region 决定。
 
-## 数学语义
+## 机制
 
-对每个元素 `(i, j)` 在有效区域内：
+对目标 tile 的 valid region 中每个 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \frac{\mathrm{src0}_{i,j}}{\mathrm{src1}_{i,j}} $$
 
-## 汇编语法
+除零行为由目标 profile 定义。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tdiv %src0, %src1 : !pto.tile<...>
@@ -26,59 +26,78 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
-pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```mlir
+pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>)
+         outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc0,
           typename TileDataSrc1, typename... WaitEvents>
 PTO_INST RecordEvent TDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
 ```
 
-`PrecisionType`可指定以下值：
+`PrecisionType` 可选：
+
+- `DivAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
+- `DivAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
 
-* `DivAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
-* `DivAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src0` | 左 tile | 被除数 tile，在 `dst` valid region 上逐坐标读取 |
+| `%src1` | 右 tile | 除数 tile，在 `dst` valid region 上逐坐标读取 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | `dst` valid region 内的每个元素都等于 `src0 / src1` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
 
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `TileData::DType` 必须是以下之一： `half`, `float`.
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界： `TileData::ValidRow <= TileData::Rows`且`TileData::ValidCol <= TileData::Cols`.
-    - 运行时： `src0`, `src1`且`dst` tiles 应具有相同的 `validRow/validCol`.
-- **实现检查 (A5)**:
-    - `TileData::DType` 必须是以下之一： `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`.
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界： `TileData::ValidRow <= TileData::Rows`且`TileData::ValidCol <= TileData::Cols`.
-    - 运行时： `src0`, `src1`且`dst` tiles 应具有相同的 `validRow/validCol`.
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域;.
-- **除零**:
-    - 行为由目标定义。
-- **高精度算法**
-    - 仅在A5上有效，`PrecisionType`选项A3上将被忽略。
+- 操作迭代域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定。
+- 除零行为由目标 profile 定义。
+- `HIGH_PRECISION` 选项只在 A5 有效，A3 上会被忽略。
+
+## 异常与非法情形
+
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `float` | Simulated | Supported | Supported |
+| `half` | Simulated | Supported | Supported |
+| `int32_t` | Simulated | No | Supported |
+| `uint32_t` | Simulated | No | Supported |
+| `int16_t` | Simulated | No | Supported |
+| `uint16_t` | Simulated | No | Supported |
+| 布局 | Any | RowMajor only | RowMajor only |
+
+A2/A3 当前要求行主序布局；A5 支持更多整数类型。
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
-
 using namespace pto;
 
 void example_auto() {
@@ -89,11 +108,10 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
-
 using namespace pto;
 
 void example_manual() {
@@ -107,29 +125,24 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
 
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
 %dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 %dst = tdiv %src0, %src1 : !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)
+- 规范页：[pto.tdiv](./tile/ops/elementwise-tile-tile/tdiv_zh.md)
diff --git a/docs/isa/TEXPANDS_zh.md b/docs/isa/TEXPANDS_zh.md
index 9a61e790..33998c89 100644
--- a/docs/isa/TEXPANDS_zh.md
+++ b/docs/isa/TEXPANDS_zh.md
@@ -1,24 +1,22 @@
-﻿# TEXPANDS
+﻿# pto.texpands
 
-## 指令示意图
+`pto.texpands` 属于[逐元素 Tile-标量](./tile/tile-scalar-and-immediate_zh.md)指令集。
 
-![TEXPANDS tile operation](../figures/isa/TEXPANDS.svg)
+## 概述
 
-## 简介
+将标量广播到目标 Tile 中所有有效位置。
 
-将标量广播到目标 Tile 中。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{scalar} $$
 
-## 汇编语法
+对于向量 Tile，迭代域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定；对于 Mat Tile，迭代域由 `TileData::Rows` / `TileData::Cols` 决定。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = texpands %scalar : f32, !pto.tile<...>
@@ -26,54 +24,70 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.texpands %scalar : dtype -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileData, typename... WaitEvents>
 PTO_INST RecordEvent TEXPANDS(TileData &dst, typename TileData::DType scalar, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%scalar` | 标量 | 广播到目标 tile 的值 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | 有效区域内所有元素等于标量值 |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - 对于Tile位置是向量（`TileData::Loc == TileType::Vec`）:
-    - `TileData::DType` 必须是以下之一：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
-    - 静态有效边界： `TileData::ValidRow <= TileData::Rows`且`TileData::ValidCol <= TileData::Cols`.
-    - 对于Tile位置是Mat（`TileData::Loc == TileType::Mat`）:
-    - `TileData::DType` 必须是以下之一：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
-    - 有效边界：`TileData::Rows * TileData::Cols * sizeof(T) / 32` 必须在`[1, 32767]`范围内。
-- **实现检查 (A5)**:
-    - 对于Tile位置是向量（`TileData::Loc == TileType::Vec`）:
-    - 静态有效边界： `TileData::ValidRow <= TileData::Rows`且`TileData::ValidCol <= TileData::Cols`.
-    - `TileData::DType` 必须是以下之一： `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`.
-    - 对于Tile位置是Mat（`TileData::Loc == TileType::Mat`）:
-    - `TileData::DType` 必须是以下之一： `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`.
-    - 对于`TileDataDst::layout == pto::Layout::NC1HWC0 || TileDataDst::layout == pto::Layout::FRACTAL_Z`:
-      - `TileData::shape0 * TileData::shape1 * TileData::shape2 * TileData::shape3` 必须在`[1, 32767]`范围内。
-    - 对于`TileDataDst::layout == pto::Layout::NDC1HWC0 || TileDataDst::layout == pto::Layout::FRACTAL_Z_3D`:
-      - `TileData::shape0 * TileData::shape1 * TileData::shape2 * TileData::shape3 * TileData::shape4` 必须在`[1, 32767]`范围内。
-- **有效区域**:
-    - 对于Tile位置是向量（`TileData::Loc == TileType::Vec`）:
-    - 该操作在 `dst.GetValidRow()` / `dst.GetValidCol()` 上填充 `dst`。
-    - 对于Tile位置是Mat（`TileData::Loc == TileType::Mat`）:
-    - 对于Tile，该操作在 `TileData::Rows` / `TileData::Cols` 上填充 `dst`。
-    - 对于convTile，该操作在`ConvTileData`的`shape`内填充`dst`。
+- Tile 位置可以是向量或 Mat。
+- A2/A3 向量支持：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
+- A5 向量支持：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`。
+- A2/A3 Mat 要求：`TileData::Rows * TileData::Cols * sizeof(T) / 32` 在 `[1, 32767]` 范围内。
+- A5 向量静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+- A5 Mat 约束因布局而异。
+
+## 异常与非法情形
+
+- 不支持的元素类型会被 verifier 拒绝。
+- 所选 target profile 不支持的形状/布局约束会被后端拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | Supported | Supported |
+| `bf16` | Simulated | Supported | No |
+| `i32 / u32` | Simulated | Supported | Supported |
+| `i16 / u16` | Simulated | Supported | Supported |
+| `i8 / u8` | Simulated | Supported | Supported |
+| Vec Layout | Any | Any | RowMajor |
+| Mat Layout | Any | Supported | Supported |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -87,7 +101,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -102,29 +116,18 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
+%dst = texpands %scalar : f32, !pto.tile<...>
 ```
 
-### 手动模式
+### AS Level 2（DPS）
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
+```mlir
+pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = texpands %scalar : f32, !pto.tile<...>
-# AS Level 2 (DPS)
-pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[逐元素 Tile-标量](./tile/tile-scalar-and-immediate_zh.md)
diff --git a/docs/isa/TEXP_zh.md b/docs/isa/TEXP_zh.md
index 5a4dcb18..848ce5cf 100644
--- a/docs/isa/TEXP_zh.md
+++ b/docs/isa/TEXP_zh.md
@@ -1,127 +1,140 @@
-﻿# TEXP
-
-## 指令示意图
-
-![TEXP tile operation](../figures/isa/TEXP.svg)
-
-## 简介
-
-逐元素指数运算。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \exp(\mathrm{src}_{i,j}) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = texp %src : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <auto PrecisionType = ExpAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TEXP(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-`PrecisionType`可指定以下值：
-
-* `ExpAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
-* `ExpAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
-
-
-## 约束
-
-- **实现检查 (NPU)**:
-    - `TileData::DType` 必须是以下之一：`float` 或 `half`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`);
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **高精度算法**
-    - 仅在A5上有效，`PrecisionType`选项A3上将被忽略。
-
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TEXP(dst, src);
-  TEXP<ExpAlgorithm::HIGH_PRECISION>(dst, src);  // A5 Only
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TEXP(dst, src);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = texp %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
+﻿# pto.texp
+
+![TEXP tile operation](../figures/isa/TEXP.svg)
+
+`pto.texp` 属于[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)指令集。
+
+## 概述
+
+对 tile 做逐元素指数运算，结果写入目标 tile。迭代域由目标 tile 的 valid region 决定。
+
+## 机制
+
+对目标 tile 的 valid region 中每个 `(i, j)`：
+
+$$ \mathrm{dst}_{i,j} = \exp(\mathrm{src}_{i,j}) $$
+
+它是 tile 路径上的一元超越函数，常见于 softmax、归一化和指数域变换。
+
+## 语法
+
+### PTO-AS
+
+```text
+%dst = texp %src : !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```mlir
+%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```mlir
+pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+```cpp
+template <auto PrecisionType = ExpAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
+          typename... WaitEvents>
+PTO_INST RecordEvent TEXP(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+`PrecisionType` 可选：
+
+- `ExpAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
+- `ExpAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile |
+| `%dst` | 目标 tile | 接收逐元素指数结果 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | `dst` valid region 内的每个元素都等于 `exp(src)` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
+## 约束
+
+- 支持类型当前是 `float` / `half`。
+- tile 必须是行主序向量 tile。
+- 操作迭代域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定。
+- 高精度算法只在 A5 有效。
+
+## 异常与非法情形
+
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `float` | Simulated | Supported | Supported |
+| `half` | Simulated | Supported | Supported |
+| 布局 | Any | RowMajor only | RowMajor only |
+
+## 示例
+
+### C++ 自动模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TEXP(dst, src);
+  TEXP<ExpAlgorithm::HIGH_PRECISION>(dst, src);  // A5 Only
+}
+```
+
+### C++ 手动模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TEXP(dst, src);
+}
+```
+
+### PTO-AS
+
+```text
+# 自动模式
+%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
+
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
+%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
+
+# PTO 汇编形式
+%dst = texp %src : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## 相关页面
+
+- 指令集总览：[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)
+- 规范页：[pto.texp](./tile/ops/elementwise-tile-tile/texp_zh.md)
diff --git a/docs/isa/TEXTRACT_FP_zh.md b/docs/isa/TEXTRACT_FP_zh.md
index 648131b2..cc66c964 100644
--- a/docs/isa/TEXTRACT_FP_zh.md
+++ b/docs/isa/TEXTRACT_FP_zh.md
@@ -1,30 +1,30 @@
-﻿# TEXTRACT_FP
+﻿# pto.textract_fp
 
-## 指令示意图
+`pto.textract_fp` 属于[布局与重排](./tile/ops/layout-and-rearrangement/textract-fp_zh.md)指令集。
 
-![TEXTRACT_FP tile operation](../figures/isa/TEXTRACT_FP.svg)
-
-## 简介
+## 概述
 
-带 fp/缩放 Tile 的提取（向量量化参数）。
+带 fp/缩放 Tile 的提取（向量量化参数），语义在有效区域上定义，目标相关的行为标记为实现定义。
 
-## 数学语义
+## 机制
 
 除非另有说明，语义在有效区域上定义，目标相关的行为标记为实现定义。
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -38,37 +38,89 @@ template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluP
 PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | - | 源 Tile |
+| `fp` | - | FP 缩放 Tile |
+| `indexRow` | - | 行偏移量 |
+| `indexCol` | - | 列偏移量 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | - | 带 FP 缩放的提取结果 Tile |
+
+## 副作用
+
+提取操作可能触发目标特定的缩放或量化行为。
+
 ## 约束
 
 类型/布局/位置/形状的合法性取决于后端；将实现特定的说明视为该后端的规范。
 
-## 示例
+## 异常与非法情形
 
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+- 当输入 tile 类型或布局不被目标支持时行为未定义。
+- 当偏移量超出有效范围时行为未定义。
 
-## 汇编示例（ASM）
+## Target-Profile 限制
 
-### 自动模式
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| FP 提取 | ✓ | ✓ | ✓ |
 
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+## 示例
+
+### C++ 自动模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Mat, float, 16, 16>;
+  using DstT = TileLeft<float, 16, 16>;
+  using FpT = TileScale<float, 16, 2>;
+  SrcT src;
+  DstT dst;
+  FpT fp;
+  TEXTRACT_FP(dst, src, fp, /*indexRow=*/0, /*indexCol=*/0);
+}
 ```
 
-### 手动模式
+### C++ 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Mat, float, 16, 16>;
+  using DstT = TileLeft<float, 16, 16>;
+  using FpT = TileScale<float, 16, 2>;
+  SrcT src;
+  DstT dst;
+  FpT fp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(fp, 0x3000);
+  TEXTRACT_FP(dst, src, fp, /*indexRow=*/0, /*indexCol=*/0);
+}
 ```
 
-### PTO 汇编形式
+### PTO-AS
 
-```text
+```mlir
 %dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
+
+![TEXTRACT_FP tile operation](../figures/isa/TEXTRACT_FP.svg)
+
+## 相关页面
+
+- 指令集总览：[布局与重排](./tile/ops/layout-and-rearrangement/textract-fp_zh.md)
diff --git a/docs/isa/TEXTRACT_zh.md b/docs/isa/TEXTRACT_zh.md
index 14266890..71b0ba0a 100644
--- a/docs/isa/TEXTRACT_zh.md
+++ b/docs/isa/TEXTRACT_zh.md
@@ -1,24 +1,20 @@
 # pto.textract
 
-旧路径兼容入口。规范页见 [pto.textract](./tile/ops/layout-and-rearrangement/textract_zh.md)。
+`pto.textract` 属于[布局与重排](./tile/ops/layout-and-rearrangement/textract_zh.md)指令集。
 
-![TEXTRACT tile operation](../figures/isa/TEXTRACT.svg)
-
-## 简介
+## 概述
 
-从较大的源 Tile 中提取较小的子 Tile。
+从较大的源 Tile 中提取较小的子 Tile，概念上从较大的 `src` Tile 中，以 `(indexRow, indexCol)` 为起点复制一个较小窗口到 `dst`，确切的映射取决于 tile 布局。
 
-## 数学语义
-
-概念上从较大的 `src` Tile 中，以 `(indexRow, indexCol)` 为起点复制一个较小窗口到 `dst`。确切的映射取决于 tile 布局。
+## 机制
 
 设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。对于 `0 <= i < R` 和 `0 <= j < C`：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{indexRow}+i,\; \mathrm{indexCol}+j} $$
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+### PTO-AS
 
 同步形式：
 
@@ -28,13 +24,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -58,6 +54,25 @@ template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluP
 PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | - | 源 Tile |
+| `indexRow` | - | 行偏移量 |
+| `indexCol` | - | 列偏移量 |
+| `fp` | - | FP 缩放 Tile（TEXTRACT_FP 形式） |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | - | 提取后的子 Tile |
+
+## 副作用
+
+提取操作不会产生额外的副作用。
+
 ## 约束
 
 ### 通用约束或检查
@@ -87,9 +102,24 @@ PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData
 - 目标支持 `TileType::Mat -> TileType::Left/Right/Scale`、`TileType::Acc -> TileType::Mat`（含 relu、标量量化、向量量化形式），以及特定的 `TileType::Vec -> TileType::Mat` 提取路径。
 - 向量量化形式额外要求提供 `FpTileData` 缩放操作数，对应 `TEXTRACT_FP(...)` 接口。
 
+## 异常与非法情形
+
+- 当 `DstTileData::DType` 与 `SrcTileData::DType` 不匹配时行为未定义。
+- 当 `indexRow + DstTileData::Rows > SrcTileData::Rows` 时行为未定义。
+- 当 `indexCol + DstTileData::Cols > SrcTileData::Cols` 时行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| int8_t 提取 | ✓ | ✓ | ✓ |
+| float 提取 | ✓ | ✓ | ✓ |
+| bfloat16 提取 | ✓ | ✓ | ✓ |
+| 向量量化形式 | ✗ | ✗ | ✓ |
+
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -105,7 +135,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -123,31 +153,20 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
 ```
 
-### 手动模式
+AS Level 2 (DPS)：
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```mlir
+pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
-### PTO 汇编形式
+![TEXTRACT tile operation](../figures/isa/TEXTRACT.svg)
 
-```text
-%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
+## 相关页面
 
-新的 PTO ISA 文档应直接链接到分组后的指令集路径。
+- 指令集总览：[布局与重排](./tile/ops/layout-and-rearrangement/textract_zh.md)
diff --git a/docs/isa/TFMOD_zh.md b/docs/isa/TFMOD_zh.md
index fbc25dfe..59389a59 100644
--- a/docs/isa/TFMOD_zh.md
+++ b/docs/isa/TFMOD_zh.md
@@ -1,24 +1,24 @@
-﻿# TFMOD
-
-## 指令示意图
+﻿# pto.tfmod
 
 ![TFMOD tile operation](../figures/isa/TFMOD.svg)
 
-## 简介
+`pto.tfmod` 属于[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)指令集。
+
+## 概述
 
-两个 Tile 的逐元素余数，余数符号与被除数相同。
+对两个 tile 做逐元素 `fmod` 运算，结果写入目标 tile。迭代域由目标 tile 的 valid region 决定。
 
-## 数学语义
+## 机制
 
-对每个元素 `(i, j)` 在有效区域内：
+对目标 tile 的 valid region 中每个 `(i, j)`：
 
-$$\mathrm{dst}_{i,j} = \mathrm{fmod}(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j})$$
+$$ \mathrm{dst}_{i,j} = \mathrm{fmod}(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) $$
 
-## 汇编语法
+它表示浮点余数语义，余数符号与被除数（`src0`）相同。常用于周期折返、相位归一化和浮点余数路径。除零行为由目标 profile 定义。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tfmod %src0, %src1 : !pto.tile<...>
@@ -26,35 +26,68 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tfmod %src0, %src1 : !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
 PTO_INST RecordEvent TFMOD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src0` | 左 tile | 被除数 tile，在 `dst` valid region 上逐坐标读取 |
+| `%src1` | 右 tile | 除数 tile，在 `dst` valid region 上逐坐标读取 |
+| `%dst` | 目标 tile | 接收逐元素 fmod 结果 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | `dst` valid region 内的每个元素都等于 `fmod(src0, src1)` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- 该操作在 `dst.GetValidRow()` / `dst.GetValidCol()` 上迭代。
-- 除零行为由目标定义；CPU 模拟器在调试构建中会断言。
+- 迭代域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定。
+- 除零行为由目标 profile 定义；CPU 模拟器在调试构建下会断言。
+
+## 异常与非法情形
+
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `int32_t` | Simulated | Supported | Supported |
+| `float` | Simulated | Supported | Supported |
+| `half` | Simulated | Supported | Supported |
+| 布局 | Any | RowMajor only | RowMajor only |
+
+`pto.tfmod` 在 CPU 仿真、A2/A3 和 A5 上保留相同的 PTO 可见语义，但具体支持子集仍取决于 profile。
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
-
 using namespace pto;
 
 void example() {
@@ -64,29 +97,40 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
+### C++ 手动模式
 
-### 自动模式
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
 
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
+void example_manual() {
+  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
+  TileT a, b, dst;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(dst, 0x3000);
+  TFMOD(dst, a, b);
+}
 ```
 
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 自动模式
 %dst = pto.tfmod %src0, %src1 : !pto.tile<...>
-```
 
-### PTO 汇编形式
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
 
-```text
+# PTO 汇编形式
 %dst = tfmod %src0, %src1 : !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)
+- 规范页：[pto.tfmod](./tile/ops/elementwise-tile-tile/tfmod_zh.md)
diff --git a/docs/isa/TFREE_zh.md b/docs/isa/TFREE_zh.md
index 30cfbca7..8e1592dc 100644
--- a/docs/isa/TFREE_zh.md
+++ b/docs/isa/TFREE_zh.md
@@ -1,41 +1,91 @@
-# TFREE
+# pto.tfree
 
-## 指令示意图
+`pto.tfree` 属于[同步与配置](./tile/sync-and-config_zh.md)指令集。
 
-![TFREE tile operation](../figures/isa/TFREE.svg)
-
-## 简介
+## 概述
 
 将当前占用的 pipe 或 FIFO 槽位释放回生产者。
 
-## 数学语义
+## 机制
+
+语义随具体指令变体而变化。除非另有说明，行为都按目标 valid region 定义。
 
-语义随指令而变化。 除非另有说明，行为都按目标 valid region 定义。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
 PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tfree ...
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tfree ins(...) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`.
+声明于 `include/pto/common/pto_instr.hpp`。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| 源 Tile | 输入 | 待释放槽位中的 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 无 | - | 释放操作无返回值，仅将槽位归还生产者 |
+
+## 副作用
+
+释放 pipe 或 FIFO 槽位，将其归还给生产者供后续使用。
 
 ## 约束
 
-数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 只能释放当前核/线程持有的槽位。
+
+## 异常与非法情形
+
+- 释放未持有的槽位属于未定义行为。
+- 在 pipe 或 FIFO 未正确初始化的情形下执行会被后端拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 释放操作 | Simulated | Supported | Supported |
 
 ## 示例
 
+### C++ 自动模式
+
+具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### C++ 手动模式
+
 具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### PTO-AS
+
+```text
+%result = pto.tfree %tile : !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```mlir
+pto.tfree ins(%tile : ...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## 相关页面
+
+- 指令集总览：[同步与配置](./tile/sync-and-config_zh.md)
diff --git a/docs/isa/TGEMV_MX_zh.md b/docs/isa/TGEMV_MX_zh.md
index b3544e3d..b01fac8a 100644
--- a/docs/isa/TGEMV_MX_zh.md
+++ b/docs/isa/TGEMV_MX_zh.md
@@ -1,26 +1,22 @@
-﻿# TGEMV_MX
+﻿# pto.tgemv.mx
 
-## 指令示意图
+`pto.tgemv.mx` 属于[矩阵与矩阵-向量运算](./tile/ops/matrix-and-matrix-vector/tgemv-mx_zh.md)指令集。
 
-![TGEMV_MX tile operation](../figures/isa/TGEMV_MX.svg)
-
-## 简介
+## 概述
 
-带缩放 Tile 的 GEMV 变体，支持混合精度/量化矩阵向量计算。
+带缩放 Tile 的 GEMV 变体，支持混合精度/量化矩阵向量计算。缩放 tile 参与实现定义的混合精度重建/缩放，输出对应于目标定义的 mx GEMV 语义。
 
-## 数学语义
+## 机制
 
 概念上（基础 GEMV 路径）：
 
-$$
-\mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j}
-$$
+$$ \mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
 
 对于 `TGEMV_MX`，缩放 tile 参与实现定义的混合精度重建/缩放。架构约定是输出对应于目标定义的 mx GEMV 语义。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+### PTO-AS
 
 示意形式：
 
@@ -30,14 +26,14 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
-pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
+```mlir
+pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%acc : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
@@ -72,42 +68,94 @@ PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale
 
 附加重载支持累加/偏置变体和 `AccPhase` 选择。
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `aMatrix` | Left | 左操作数向量 Tile |
+| `aScaleMatrix` | LeftScale | 左操作数缩放 Tile |
+| `bMatrix` | Right | 右操作数矩阵 Tile |
+| `bScaleMatrix` | RightScale | 右操作数缩放 Tile |
+| `biasData` | Bias | 偏置 Tile（偏置形式可选） |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cMatrix` / `cOutMatrix` | Acc | MX GEMV 结果 Tile |
+
+## 副作用
+
+MX 混合精度操作可能触发目标特定的缩放、反量化或溢出处理行为。
+
 ## 约束
 
 - 使用后端特定的 mx 合法性检查，用于数据类型、tile 位置、分形/布局组合以及缩放格式。
 - 缩放 tile 兼容性和累加器提升由目标后端的实现定义。
 - 为了可移植性，请根据目标实现约束验证确切的 `(A, B, scaleA, scaleB, C)` 类型元组和 tile 布局。
 
+## 异常与非法情形
+
+- 当缩放 tile 类型或布局不符合目标要求时行为未定义。
+- 当数据类型组合不符合目标支持的 mx 规格时行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| MX GEMV | ✓ | ✗ | ✓ |
+| MX GEMV 累加 | ✓ | ✗ | ✓ |
+| MX GEMV 偏置 | ✓ | ✗ | ✓ |
+
 ## 示例
 
-实际使用模式请参见：
+### C++ 自动模式
 
+实际使用模式请参见：
 - `docs/isa/TMATMUL_MX.md`
 - `docs/isa/TGEMV.md`
 
-## 汇编示例（ASM）
+### C++ 手动模式
 
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<float8_e5m2_t, 1, 64>;
+  using B = TileRight<float8_e5m2_t, 64, 32>;
+  using ScaleA = TileLeftScale<float8_e8m0_t, 1, 2>;
+  using ScaleB = TileRightScale<float8_e8m0_t, 2, 32>;
+  using C = TileAcc<float, 1, 32>;
+  A a;
+  B b;
+  ScaleA scaleA;
+  ScaleB scaleB;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(scaleA, GetScaleAddr(a.data()));
+  TASSIGN(scaleB, GetScaleAddr(b.data()));
+  TASSIGN(c, 0x3000);
+  TGEMV_MX(c, a, scaleA, b, scaleB);
+}
 ```
 
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%acc = tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+AS Level 2 (DPS)：
 
-```text
-%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
+```mlir
+pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%acc : !pto.tile_buf<...>)
 ```
+
+![TGEMV_MX tile operation](../figures/isa/TGEMV_MX.svg)
+
+## 相关页面
+
+- 指令集总览：[矩阵与矩阵-向量运算](./tile/ops/matrix-and-matrix-vector/tgemv-mx_zh.md)
diff --git a/docs/isa/THISTOGRAM_zh.md b/docs/isa/THISTOGRAM_zh.md
index 6fd1c648..70ab70a2 100644
--- a/docs/isa/THISTOGRAM_zh.md
+++ b/docs/isa/THISTOGRAM_zh.md
@@ -1,41 +1,93 @@
-# THISTOGRAM
+# pto.thistogram
 
-## 指令示意图
+`pto.thistogram` 属于[不规则与复杂](./tile/irregular-and-complex_zh.md)指令集。
 
-![THISTOGRAM tile operation](../figures/isa/THISTOGRAM.svg)
-
-## 简介
+## 概述
 
 使用索引 Tile 从源值中累计直方图 bin 计数。
 
-## 数学语义
+## 机制
+
+语义随具体指令变体而变化。除非另有说明，行为都按目标 valid region 定义。
 
-语义随指令而变化。 除非另有说明，行为都按目标 valid region 定义。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
 PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.thistogram ...
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.thistogram ins(...) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`.
+声明于 `include/pto/common/pto_instr.hpp`。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| 源值 Tile | 输入 | 待累计的源值 Tile |
+| 索引 Tile | 输入 | 指定每个值应落入的直方图 bin |
+| 直方图 Tile | 输入/输出 | 累计计数的直方图 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile_buf<u32>` | 更新后的直方图 Tile |
+
+## 副作用
+
+直方图 Tile 中的 bin 计数会被原地更新。
 
 ## 约束
 
-数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 索引值必须在直方图 bin 范围内。
+
+## 异常与非法情形
+
+- 索引越界会被 verifier 或运行时检测并拒绝。
+- 不支持的元素类型或布局会被后端拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 直方图操作 | Simulated | Supported | Supported |
 
 ## 示例
 
+### C++ 自动模式
+
+具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### C++ 手动模式
+
 具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### PTO-AS
+
+```text
+%dst = pto.thistogram %values, %indices, %histogram : ...
+```
+
+### AS Level 2（DPS）
+
+```mlir
+pto.thistogram ins(%values, %indices : ...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## 相关页面
+
+- 指令集总览：[不规则与复杂](./tile/irregular-and-complex_zh.md)
diff --git a/docs/isa/TIMG2COL_zh.md b/docs/isa/TIMG2COL_zh.md
index 2fe24146..9ca57725 100644
--- a/docs/isa/TIMG2COL_zh.md
+++ b/docs/isa/TIMG2COL_zh.md
@@ -1,30 +1,30 @@
-# TIMG2COL
+# pto.timg2col
 
-## 指令示意图
+`pto.timg2col` 属于[布局与重排](./tile/ops/layout-and-rearrangement/timg2col_zh.md)指令集。
 
-![TIMG2COL tile operation](../figures/isa/TIMG2COL.svg)
+## 概述
 
-## 简介
+用于类卷积工作负载的图像到列变换，将图像 tile 重新排列为适合 GEMM 操作的列格式，语义在有效区域上定义，目标相关的行为标记为实现定义。
 
-用于类卷积工作负载的图像到列变换。
+## 机制
 
-## 数学语义
+除非另有说明，语义在有效区域上定义，目标相关的行为标记为实现定义。
 
-除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+## 语法
 
-## 汇编语法
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS_zh.md).
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -37,10 +37,80 @@ PTO_INST RecordEvent TIMG2COL(TileData &dst, ConvTileData &src, uint16_t posM =
                               WaitEvents&... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | Conv | 输入卷积图像 Tile |
+| `posM` | - | M 维度偏移量 |
+| `posK` | - | K 维度偏移量 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | - | img2col 变换后的 Tile |
+
+## 副作用
+
+img2col 变换可能产生填充值或目标特定的填充模式。
+
 ## 约束
 
-- This instruction is target/implementation-specific. See `include/pto/npu/*/TImg2col.hpp` for the supported tile types/layouts and config fields.
+- 此指令是目标/实现特定的。
+- 参见 `include/pto/npu/*/TImg2col.hpp` 了解支持的 tile 类型/布局和配置字段。
+
+## 异常与非法情形
+
+- 当输入 tile 类型或布局不被目标支持时行为未定义。
+- 当偏移量超出有效范围时行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| img2col 变换 | ✓ | ✓ | ✓ |
 
 ## 示例
 
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+### C++ 自动模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  ConvTileData src;
+  TileData dst;
+  TIMG2COL(dst, src);
+}
+```
+
+### C++ 手动模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  ConvTileData src;
+  TileData dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TIMG2COL(dst, src);
+}
+```
+
+### PTO-AS
+
+```mlir
+%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+![TIMG2COL tile operation](../figures/isa/TIMG2COL.svg)
+
+## 相关页面
+
+- 指令集总览：[布局与重排](./tile/ops/layout-and-rearrangement/timg2col_zh.md)
diff --git a/docs/isa/TINSERT_FP_zh.md b/docs/isa/TINSERT_FP_zh.md
index 7010594a..02f72b51 100644
--- a/docs/isa/TINSERT_FP_zh.md
+++ b/docs/isa/TINSERT_FP_zh.md
@@ -1,74 +1,121 @@
-﻿# TINSERT_FP
+﻿# pto.tinsert_fp
 
-## 指令示意图
+`pto.tinsert_fp` 属于[不规则与复杂指令](./tile/irregular-and-complex_zh.md)集。
 
-![TINSERT_FP tile operation](../figures/isa/TINSERT_FP.svg)
+## 概述
 
-## 简介
+带 fp/缩放 Tile 的插入操作，用于向量量化参数。除非另有说明，语义在有效区域上定义，目标相关的行为标记为实现定义。
 
-带 fp/缩放 Tile 的插入（向量量化参数）。
+## 机制
 
-## 数学语义
+对目标有效区域内的每个元素 `(i, j)`，将 `fp` Tile 中对应位置的值插入到 `src` 中，索引由 `idxrow` 和 `idxcol` 指定。
 
-除非另有说明，语义在有效区域上定义，目标相关的行为标记为实现定义。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
           typename... WaitEvents>
 PTO_INST RecordEvent TINSERT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 源 Tile | 源数据 |
+| `fp` | 源 Tile | fp/缩放 Tile |
+| `idxrow` | 标量 | 行索引 |
+| `idxcol` | 标量 | 列索引 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 插入操作后的目标 Tile |
+
+## 副作用
+
+无。
+
 ## 约束
 
-类型/布局/位置/形状的合法性取决于后端；将实现特定的说明视为该后端的规范。
+- 类型/布局/位置/形状的合法性取决于后端；将实现特定的说明视为该后端的规范。
+
+## 异常与非法情形
+
+- 未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+### C++ 自动模式
+
+```cpp
+#include <pto/pto-inst.hpp>
 
-## 汇编示例（ASM）
+using namespace pto;
 
-### 自动模式
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, fp, dst;
+  uint16_t idxrow = 0, idxcol = 0;
+  TINSERT_FP(dst, src, fp, idxrow, idxcol);
+}
+```
 
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+### C++ 手动模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, fp, dst;
+  uint16_t idxrow = 0, idxcol = 0;
+  TASSIGN(src, 0x1000);
+  TASSIGN(fp, 0x2000);
+  TASSIGN(dst, 0x3000);
+  TINSERT_FP(dst, src, fp, idxrow, idxcol);
+}
 ```
 
-### 手动模式
+### PTO-AS
 
 ```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+
 # 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# pto.tassign %src, @tile(0x1000)
+# pto.tassign %fp, @tile(0x2000)
+# pto.tassign %dst, @tile(0x3000)
 %dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[不规则与复杂指令](./tile/irregular-and-complex_zh.md)
diff --git a/docs/isa/TLOG_zh.md b/docs/isa/TLOG_zh.md
index b712bc42..a2359594 100644
--- a/docs/isa/TLOG_zh.md
+++ b/docs/isa/TLOG_zh.md
@@ -1,24 +1,24 @@
-﻿# TLOG
-
-## 指令示意图
+﻿# pto.tlog
 
 ![TLOG tile operation](../figures/isa/TLOG.svg)
 
-## 简介
+`pto.tlog` 属于[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)指令集。
+
+## 概述
 
-Tile 的逐元素自然对数。
+对 tile 做逐元素自然对数，结果写入目标 tile。迭代域由目标 tile 的 valid region 决定。
 
-## 数学语义
+## 机制
 
-对每个元素 `(i, j)` 在有效区域内：
+对目标 tile 的 valid region 中每个 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \log(\mathrm{src}_{i,j}) $$
 
-## 汇编语法
+它是 tile 路径上的一元对数操作，用于归一化、损失计算前处理和指数域反变换。对 `log(<=0)` 的域外情况，行为由目标 profile 定义。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tlog %src : !pto.tile<...>
@@ -26,51 +26,73 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <auto PrecisionType = LogAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
           typename... WaitEvents>
 PTO_INST RecordEvent TLOG(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
 ```
 
-`PrecisionType`可指定以下值：
+`PrecisionType` 可选：
+
+- `LogAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
+- `LogAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile |
+| `%dst` | 目标 tile | 接收逐元素对数结果 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
 
-* `LogAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
-* `LogAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | `dst` valid region 内的每个元素都等于 `log(src)` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
 
 ## 约束
 
-- **实现检查 (NPU)**:
-    - `TileData::DType` 必须是以下之一：`float` 或 `half`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`);
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域.
-- **域 / NaN**:
-    - 域行为（例如，`log(<=0)`）由目标定义。
-- **高精度算法**
-    - 仅在A5上有效，`PrecisionType`选项A3上将被忽略。
+- 支持类型当前是 `float` / `half`。
+- tile 必须是行主序向量 tile。
+- 操作迭代域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定。
+- 对 `log(<=0)` 的域外情况，行为由目标 profile 定义。
+- 高精度算法只在 A5 有效。
+
+## 异常与非法情形
+
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `float` | Simulated | Supported | Supported |
+| `half` | Simulated | Supported | Supported |
+| 布局 | Any | RowMajor only | RowMajor only |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
-
 using namespace pto;
 
 void example() {
@@ -81,29 +103,39 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
+### C++ 手动模式
 
-### 自动模式
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
 
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TLOG(dst, src);
+}
 ```
 
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 自动模式
 %dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
-```
 
-### PTO 汇编形式
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
 
-```text
+# PTO 汇编形式
 %dst = tlog %src : !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)
+- 规范页：[pto.tlog](./tile/ops/elementwise-tile-tile/tlog_zh.md)
diff --git a/docs/isa/TMATMUL_ACC_zh.md b/docs/isa/TMATMUL_ACC_zh.md
index 75590992..31b48f57 100644
--- a/docs/isa/TMATMUL_ACC_zh.md
+++ b/docs/isa/TMATMUL_ACC_zh.md
@@ -1,17 +1,14 @@
-﻿# TMATMUL_ACC
+﻿# pto.tmatmul.acc
 
-## 指令示意图
+`pto.tmatmul.acc` 属于[矩阵与矩阵-向量运算](./tile/ops/matrix-and-matrix-vector/tmatmul-acc_zh.md)指令集。
 
-![TMATMUL_ACC tile operation](../figures/isa/TMATMUL_ACC.svg)
-
-## 简介
+## 概述
 
-带累加器输入的矩阵乘法（融合累加）。
+带累加器输入的矩阵乘法（融合累加），在有效域 `0 <= i < M`、`0 <= j < N` 上计算 $\mathrm{C1}_{i,j} = \mathrm{C0}_{i,j} + \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j}$，其中 `M = aMatrix.GetValidRow()`、`K = aMatrix.GetValidCol()`、`N = bMatrix.GetValidCol()`。
 
-## 数学语义
+## 机制
 
 设：
-
 - `M = aMatrix.GetValidRow()`
 - `K = aMatrix.GetValidCol()`
 - `N = bMatrix.GetValidCol()`
@@ -20,9 +17,9 @@
 
 $$ \mathrm{C1}_{i,j} = \mathrm{C0}_{i,j} + \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+### PTO-AS
 
 同步形式：
 
@@ -32,13 +29,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
 ```
 
@@ -58,16 +55,45 @@ template <AccPhase Phase = AccPhase::Unspecified, typename TileRes, typename Til
 PTO_INST RecordEvent TMATMUL_ACC(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `cInMatrix` | Acc | 输入累加器 Tile |
+| `aMatrix` | Left | 左操作数矩阵 Tile |
+| `bMatrix` | Right | 右操作数矩阵 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cOutMatrix` | Acc | 累加后的 GEMM 结果 Tile |
+
+## 副作用
+
+融合累加操作可能触发目标特定的溢出处理或舍入行为。
+
 ## 约束
 
 - 所有来自 `TMATMUL` 的约束都适用于 `(cOutMatrix, aMatrix, bMatrix)` 三元组。
-- **实现说明 (A2A3/A5)**:
+- 实现说明 (A2A3/A5):
     - `TMATMUL_ACC_IMPL` 使用 `aMatrix.GetValidRow()`、`aMatrix.GetValidCol()` 和 `bMatrix.GetValidCol()` 作为 `m/k/n`。
     - `cInMatrix` 在当前实现中不通过显式断言进行验证（目标定义的行为）。
 
+## 异常与非法情形
+
+- 当 `cInMatrix` 与 `cOutMatrix` 类型不匹配时行为未定义。
+- 当 `m/k/n` 超出目标允许范围时行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 融合累加 | ✓ | ✓ | ✓ |
+
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -85,7 +111,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -107,29 +133,20 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%acc1 = tmatmul.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### 手动模式
+AS Level 2 (DPS)：
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```mlir
+pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
 ```
 
-### PTO 汇编形式
+![TMATMUL_ACC tile operation](../figures/isa/TMATMUL_ACC.svg)
 
-```text
-%acc1 = tmatmul.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-```
+## 相关页面
+
+- 指令集总览：[矩阵与矩阵-向量运算](./tile/ops/matrix-and-matrix-vector/tmatmul-acc_zh.md)
diff --git a/docs/isa/TMATMUL_BIAS_zh.md b/docs/isa/TMATMUL_BIAS_zh.md
index 3eb19d81..accbadc7 100644
--- a/docs/isa/TMATMUL_BIAS_zh.md
+++ b/docs/isa/TMATMUL_BIAS_zh.md
@@ -1,17 +1,14 @@
-﻿# TMATMUL_BIAS
+﻿# pto.tmatmul.bias
 
-## 指令示意图
+`pto.tmatmul.bias` 属于[矩阵与矩阵-向量运算](./tile/ops/matrix-and-matrix-vector/tmatmul-bias_zh.md)指令集。
 
-![TMATMUL_BIAS tile operation](../figures/isa/TMATMUL_BIAS.svg)
-
-## 简介
+## 概述
 
-带偏置加法的矩阵乘法。
+带偏置加法的矩阵乘法，在有效域 `0 <= i < M`、`0 <= j < N` 上计算 $\mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} + \mathrm{Bias}_{0,j}$，偏置广播行为由实现定义。
 
-## 数学语义
+## 机制
 
 设：
-
 - `M = aMatrix.GetValidRow()`
 - `K = aMatrix.GetValidCol()`
 - `N = bMatrix.GetValidCol()`
@@ -22,9 +19,9 @@ $$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} +
 
 偏置广播行为由实现定义。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+### PTO-AS
 
 同步形式：
 
@@ -34,13 +31,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
 ```
 
@@ -57,19 +54,49 @@ template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRigh
 PTO_INST RecordEvent TMATMUL_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `aMatrix` | Left | 左操作数矩阵 Tile |
+| `bMatrix` | Right | 右操作数矩阵 Tile |
+| `biasData` | Bias | 偏置 Tile（一行，广播到输出） |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cMatrix` | Acc | 带偏置的 GEMM 结果 Tile |
+
+## 副作用
+
+偏置加法可能触发目标特定的溢出处理或舍入行为。
+
 ## 约束
 
 - 所有来自 `TMATMUL` 的约束都适用于 `(cMatrix, aMatrix, bMatrix)` 三元组。
-- **偏置约束 (A2A3)**:
+- 偏置约束 (A2A3):
     - `TileBias::DType` 必须匹配 `TileRes::DType`。
     - `TileBias::Loc == TileType::Bias` 且 `TileBias::Rows == 1`。
-- **偏置约束 (A5)**:
+- 偏置约束 (A5):
     - `TileBias::DType` 必须匹配 `TileRes::DType`。
     - `TileBias::Loc == TileType::Bias`、`TileBias::Rows == 1` 且 `TileBias::isRowMajor`。
 
+## 异常与非法情形
+
+- 当 `TileBias::DType` 与 `TileRes::DType` 不匹配时行为未定义。
+- 当 `TileBias::Rows != 1` 时行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 偏置加法 | ✓ | ✓ | ✓ |
+| 偏置行主序 | ✗ | ✗ | ✓ |
+
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -89,7 +116,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -113,29 +140,20 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%acc = tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### 手动模式
+AS Level 2 (DPS)：
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```mlir
+pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
 ```
 
-### PTO 汇编形式
+![TMATMUL_BIAS tile operation](../figures/isa/TMATMUL_BIAS.svg)
 
-```text
-%acc = tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
+## 相关页面
+
+- 指令集总览：[矩阵与矩阵-向量运算](./tile/ops/matrix-and-matrix-vector/tmatmul-bias_zh.md)
diff --git a/docs/isa/TMATMUL_MX_zh.md b/docs/isa/TMATMUL_MX_zh.md
index a4da1f49..45c97873 100644
--- a/docs/isa/TMATMUL_MX_zh.md
+++ b/docs/isa/TMATMUL_MX_zh.md
@@ -1,30 +1,27 @@
-﻿# TMATMUL_MX
+﻿# pto.tmatmul.mx
 
-## 指令示意图
+`pto.tmatmul.mx` 属于[矩阵与矩阵-向量运算](./tile/ops/matrix-and-matrix-vector/tmatmul-mx_zh.md)指令集。
 
-![TMATMUL_MX tile operation](../figures/isa/TMATMUL_MX.svg)
-
-## 简介
+## 概述
 
-带额外缩放 Tile 的矩阵乘法 (GEMM)，用于支持目标上的混合精度/量化矩阵乘法。
+带额外缩放 Tile 的矩阵乘法 (GEMM)，用于支持目标上的混合精度/量化矩阵乘法。`aScaleMatrix` / `bScaleMatrix` 配置实现定义的混合精度行为，缩放 tile 的确切作用以及任何反量化/量化语义由目标定义。
 
-## 数学语义
+## 机制
 
 设：
-
 - `M = aMatrix.GetValidRow()`
 - `K = aMatrix.GetValidCol()`
 - `N = bMatrix.GetValidCol()`
 
-概念上，结果对应于有效矩阵乘法域（`0 <= i < M`，`0 <= j < N`）上的矩阵乘法，缩放 tile `aScaleMatrix` / `bScaleMatrix` 配置实现定义的混合精度行为：
+概念上，结果对应于有效矩阵乘法域（`0 <= i < M`，`0 <= j < N`）上的矩阵乘法：
 
 $$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
 
 `aScaleMatrix` / `bScaleMatrix` 的确切作用（以及任何反量化/量化语义）由目标定义。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+### PTO-AS
 
 同步形式（概念性）：
 
@@ -36,24 +33,18 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
-%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
--> !pto.tile<...>
-%c_out = pto.tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>,
-!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
-%c = pto.tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>,
-!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
+```mlir
+%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%c_out = pto.tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%c = pto.tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
-pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
-outs(%c :  !pto.tile_buf<...>)
-pto.tmatmul.mx.acc ins(%c_in, %a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
-!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-pto.tmatmul.mx.bias ins(%a, %a_scale, %b, %b_scale, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
-!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```mlir
+pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+pto.tmatmul.mx.acc ins(%c_in, %a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+pto.tmatmul.mx.bias ins(%a, %a_scale, %b, %b_scale, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
@@ -86,17 +77,50 @@ template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeft
 PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `aMatrix` | Left | 左操作数矩阵 Tile |
+| `aScaleMatrix` | LeftScale | 左操作数缩放 Tile |
+| `bMatrix` | Right | 右操作数矩阵 Tile |
+| `bScaleMatrix` | RightScale | 右操作数缩放 Tile |
+| `biasData` | Bias | 偏置 Tile（偏置形式可选） |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cMatrix` / `cOutMatrix` | Acc | MX GEMM 结果 Tile |
+
+## 副作用
+
+MX 混合精度操作可能触发目标特定的缩放、反量化或溢出处理行为。
+
 ## 约束
 
-- **实现检查 (A5)**:
+- 实现检查 (A5):
     - `m/k/n` 取自 `aMatrix.GetValidRow()`、`aMatrix.GetValidCol()`、`bMatrix.GetValidCol()`。
     - 静态合法性检查通过 `CheckMadMxValid<...>()`（类型、形状、分形和缩放 tile 合法性）。
-- **偏置形式**:
+- 偏置形式:
     - `TileBias::DType` 必须是 `float` 且 `TileBias::Loc == TileType::Bias`，`TileBias::Rows == 1`（A5 通过 `static_assert` 检查）。
 
+## 异常与非法情形
+
+- 当缩放 tile 类型或布局不符合目标要求时行为未定义。
+- 当 `m/k/n` 超出目标允许范围时行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| MX 基本形式 | ✓ | ✗ | ✓ |
+| MX 累加形式 | ✓ | ✗ | ✓ |
+| MX 偏置形式 | ✓ | ✗ | ✓ |
+
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -120,7 +144,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -150,29 +174,20 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
 %c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
 ```
 
-### 手动模式
+AS Level 2 (DPS)：
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
+```mlir
+pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
 ```
 
-### PTO 汇编形式
+![TMATMUL_MX tile operation](../figures/isa/TMATMUL_MX.svg)
 
-```text
-%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
-# AS Level 2 (DPS)
-pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
-```
+## 相关页面
+
+- 指令集总览：[矩阵与矩阵-向量运算](./tile/ops/matrix-and-matrix-vector/tmatmul-mx_zh.md)
diff --git a/docs/isa/TMATMUL_zh.md b/docs/isa/TMATMUL_zh.md
index 4dcfa8b2..5764b66c 100644
--- a/docs/isa/TMATMUL_zh.md
+++ b/docs/isa/TMATMUL_zh.md
@@ -1,17 +1,14 @@
-﻿# TMATMUL
+﻿# pto.tmatmul
 
-## 指令示意图
+`pto.tmatmul` 属于[矩阵与矩阵-向量运算](./tile/ops/matrix-and-matrix-vector/tmatmul_zh.md)指令集。
 
-![TMATMUL tile operation](../figures/isa/TMATMUL.svg)
-
-## 简介
+## 概述
 
-矩阵乘法 (GEMM)，生成累加器/输出 Tile。
+矩阵乘法 (GEMM)，在有效矩阵乘法域 `0 <= i < M`、`0 <= j < N` 上计算 $\mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j}$，其中 `M = aMatrix.GetValidRow()`、`K = aMatrix.GetValidCol()`、`N = bMatrix.GetValidCol()`，生成累加器/输出 Tile。精确的累加器行为和数据类型提升由目标/实现定义。
 
-## 数学语义
+## 机制
 
 设：
-
 - `M = aMatrix.GetValidRow()`
 - `K = aMatrix.GetValidCol()`
 - `N = bMatrix.GetValidCol()`
@@ -22,9 +19,9 @@ $$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $
 
 精确的累加器行为和数据类型提升由目标/实现定义。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+### PTO-AS
 
 同步形式：
 
@@ -34,13 +31,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
 ```
 
@@ -56,31 +53,62 @@ template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRigh
 PTO_INST RecordEvent TMATMUL(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `aMatrix` | Left | 左操作数矩阵 Tile |
+| `bMatrix` | Right | 右操作数矩阵 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cMatrix` | Acc | GEMM 结果 Tile |
+
+## 副作用
+
+GEMM 累加操作可能触发目标特定的溢出处理或舍入行为。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
+- 实现检查 (A2A3):
     - 支持的 `(CType, AType, BType)` 三元组：
-    - `(int32_t, int8_t, int8_t)`
-    - `(float, half, half)`
-    - `(float, float, float)`
-    - `(float, bfloat16_t, bfloat16_t)`
+        - `(int32_t, int8_t, int8_t)`
+        - `(float, half, half)`
+        - `(float, float, float)`
+        - `(float, bfloat16_t, bfloat16_t)`
     - 静态形状约束：`TileLeft::Rows == TileRes::Rows`、`TileLeft::Cols == TileRight::Rows`、`TileRight::Cols == TileRes::Cols`。
     - Tile 位置：`TileLeft::Loc == Left`、`TileRight::Loc == Right`、`TileRes::Loc == Acc`。
     - 运行时：`m/k/n`（取自 `aMatrix.GetValidRow()`、`aMatrix.GetValidCol()`、`bMatrix.GetValidCol()`）必须在 `[1, 4095]` 范围内。
-- **实现检查 (A5)**:
+- 实现检查 (A5):
     - 累加器类型必须是 `int32_t` 或 `float`。
     - 如果是 `int32_t`：`AType == int8_t` 且 `BType == int8_t`。
     - 如果是 `float`：支持 `half/bfloat16_t/float` 和选定的 fp8 对（目标定义）。
     - 静态形状约束：`TileLeft::Rows == TileRes::Rows`、`TileLeft::Cols == TileRight::Rows`、`TileRight::Cols == TileRes::Cols`。
     - 强制执行分形/布局约束：
-    - Left：`Loc == Left`、`!isRowMajor`、`SFractal == RowMajor`
-    - Right：`Loc == Right`、`isRowMajor`、`SFractal == ColMajor`
-    - Acc：`Loc == Acc`、`!isRowMajor`、`SFractal == RowMajor`
+        - Left：`Loc == Left`、`!isRowMajor`、`SFractal == RowMajor`
+        - Right：`Loc == Right`、`isRowMajor`、`SFractal == ColMajor`
+        - Acc：`Loc == Acc`、`!isRowMajor`、`SFractal == RowMajor`
     - 运行时：`m/k/n`（取自 `aMatrix.GetValidRow()`、`aMatrix.GetValidCol()`、`bMatrix.GetValidCol()`）必须在 `[1, 4095]` 范围内。
 
+## 异常与非法情形
+
+- 当 `m/k/n` 超出 `[1, 4095]` 范围时行为未定义。
+- 当数据类型组合不符合目标支持的三元组时行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| int8_t GEMM | ✓ | ✓ | ✗ |
+| float GEMM | ✓ | ✓ | ✓ |
+| bfloat16 GEMM | ✓ | ✓ | ✓ |
+| 分形布局约束 | ✗ | ✗ | ✓ |
+
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -98,7 +126,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -119,29 +147,20 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%acc = tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### 手动模式
+AS Level 2 (DPS)：
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```mlir
+pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
 ```
 
-### PTO 汇编形式
+![TMATMUL tile operation](../figures/isa/TMATMUL.svg)
 
-```text
-%acc = tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
+## 相关页面
+
+- 指令集总览：[矩阵与矩阵-向量运算](./tile/ops/matrix-and-matrix-vector/tmatmul_zh.md)
diff --git a/docs/isa/TMOV_FP_zh.md b/docs/isa/TMOV_FP_zh.md
index 78e13673..20890e3a 100644
--- a/docs/isa/TMOV_FP_zh.md
+++ b/docs/isa/TMOV_FP_zh.md
@@ -1,24 +1,18 @@
-﻿# TMOV_FP
+﻿# pto.tmov.fp
 
-## 指令示意图
+`pto.tmov.fp` 属于[布局与重排](./tile/layout-and-rearrangement_zh.md)指令集。
 
-![TMOV_FP tile operation](../figures/isa/TMOV_FP.svg)
+## 概述
 
-## 简介
+使用缩放 (`fp`) Tile 作为向量量化参数，将累加器 Tile 移动/转换到目标 Tile。概念上使用从 `fp` 派生的实现定义的量化/反量化配置转换每个元素：$ \mathrm{dst}_{i,j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $
 
-使用缩放 (`fp`) Tile 作为向量量化参数，将累加器 Tile 移动/转换到目标 Tile。
+## 机制
 
-## 数学语义
+该指令将累加器 Tile 中的数据转换后写入目标 Tile，转换参数由 `fp` Tile 提供。`fp` Tile 包含实现定义的量化参数，用于控制转换行为。
 
-概念上使用从 `fp` 派生的实现定义的量化/反量化配置转换每个元素：
+## 语法
 
-$$ \mathrm{dst}_{i,j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
@@ -26,13 +20,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -46,19 +40,50 @@ template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluP
 PTO_INST RecordEvent TMOV_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| dst | 输出 | 目标 Tile |
+| src | 输入 | 源累加器 Tile |
+| fp | 输入 | 浮点量化参数 Tile |
+| events | 可选 | 等待事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| dst | DstTileData& | 量化转换后的目标 Tile |
+| 事件 | RecordEvent | 同步事件 |
+
+## 副作用
+
+该指令使用 `fp` Tile 中的量化参数进行数据转换。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - fp 路径仅支持累加器转换，并通过 `TMOV_IMPL(dst, src, fp)` 中的内部编译时检查进行验证。
-    - `FpTileData::Loc` 必须是 `TileType::Scaling`（`static_assert`）。
-- **实现检查 (A5)**:
-    - 通过 `CheckTMovAccValid(...)` 和 `TMOV_IMPL(dst, src, fp)` 中的相关编译时检查进行验证。
-    - `FpTileData::Loc` 必须是 `TileType::Scaling`（`static_assert`）。
-    - 目标位置取决于目标（fp 路径支持 `Vec` 或 `Mat`）。
+- 实现检查 (A2A3):
+    - fp 路径仅支持累加器转换，并通过 TMOV_IMPL(dst, src, fp) 中的内部编译时检查进行验证
+    - FpTileData::Loc 必须是 TileType::Scaling（static_assert）
+- 实现检查 (A5):
+    - 通过 CheckTMovAccValid(...) 和 TMOV_IMPL(dst, src, fp) 中的相关编译时检查进行验证
+    - FpTileData::Loc 必须是 TileType::Scaling（static_assert）
+    - 目标位置取决于目标（fp 路径支持 Vec 或 Mat）
+
+## 异常与非法情形
+
+- 未指定
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| Acc -> Vec (fp) | - | 支持 | 支持 |
+| Acc -> Mat (fp) | - | 支持 | 支持 |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -77,7 +102,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -99,29 +124,22 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
 %dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 %dst = tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[布局与重排](./tile/layout-and-rearrangement_zh.md)
+- 相关指令：[pto.tmov](./TMOV_zh.md)
diff --git a/docs/isa/TMOV_zh.md b/docs/isa/TMOV_zh.md
index 448878be..fd9874b9 100644
--- a/docs/isa/TMOV_zh.md
+++ b/docs/isa/TMOV_zh.md
@@ -1,30 +1,18 @@
 # pto.tmov
 
-旧路径兼容入口。规范页见 [pto.tmov](./tile/ops/layout-and-rearrangement/tmov_zh.md)。
+`pto.tmov` 属于[布局与重排](./tile/layout-and-rearrangement_zh.md)指令集。
 
-![TMOV tile operation](../figures/isa/TMOV.svg)
+## 概述
 
-## 简介
+在 Tile 之间移动/复制数据，可选通过模板参数和重载选择实现定义的转换模式。TMOV 用于 Vec -> Vec 移动、Mat -> Left/Right/Bias/Scaling/Scale（微缩放）移动（取决于目标）、以及 Acc -> Mat/Vec 移动（取决于目标）。
 
-在 Tile 之间移动/复制，可选通过模板参数和重载选择实现定义的转换模式。
+## 机制
 
-`TMOV` 用于：
+概念上在有效区域上将元素从 `src` 复制或转换到 `dst`。确切的转换取决于所选模式和目标。对于纯复制情况：$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} $。支持多种转换模式，包括 relu 前处理、累加器到向量的特定模式、以及量化参数的微缩放。
 
-- Vec -> Vec 移动
-- Mat -> Left/Right/Bias/Scaling/Scale（微缩放）移动（取决于目标）
-- Acc -> Mat/Vec 移动（取决于目标）
+## 语法
 
-## 数学语义
-
-概念上在有效区域上将元素从 `src` 复制或转换到 `dst`。确切的转换取决于所选模式和目标。
-
-对于纯复制情况：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+### PTO-AS
 
 PTO AS 设计建议将 `TMOV` 拆分为一系列操作：
 
@@ -39,13 +27,13 @@ PTO AS 设计建议将 `TMOV` 拆分为一系列操作：
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -77,77 +65,110 @@ template <typename DstTileData, typename SrcTileData, AccToVecMode mode, ReluPre
 PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| dst | 输出 | 目标 Tile |
+| src | 输入 | 源 Tile |
+| fp | 可选 | 浮点量化 Tile（向量量化形式） |
+| preQuantScalar | 可选 | 标量量化参数 |
+| events | 可选 | 等待事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| dst | DstTileData& | 转换后的目标 Tile |
+| 事件 | RecordEvent | 同步事件 |
+
+## 副作用
+
+该指令可能修改目标 Tile 的内容，并在特定模式下触发硬件数据转换路径。
+
 ## 约束
 
 ### 通用约束或检查
 
-- `TMOV` 包含以下重载族：
-    - 普通移动：`TMOV(dst, src)`
-    - relu 形式：`TMOV<..., reluMode>(dst, src)`
-    - 累加器到向量形式：`TMOV<..., mode, reluMode>(dst, src)`
-    - 向量量化形式：`TMOV<..., FpTileData, mode, reluMode>(dst, src, fp)`
-    - 标量量化形式：`TMOV<..., reluMode>(dst, src, preQuantScalar)` 和 `TMOV<..., mode, reluMode>(dst, src, preQuantScalar)`
-- `reluMode` 取值为 `ReluPreMode::{NoRelu, NormalRelu}`。
-- `mode` 取值为 `AccToVecMode::{SingleModeVec0, SingleModeVec1, DualModeSplitM, DualModeSplitN}`。
+- TMOV 包含以下重载族：
+    - 普通移动：TMOV(dst, src)
+    - relu 形式：TMOV<..., reluMode>(dst, src)
+    - 累加器到向量形式：TMOV<..., mode, reluMode>(dst, src)
+    - 向量量化形式：TMOV<..., FpTileData, mode, reluMode>(dst, src, fp)
+    - 标量量化形式：TMOV<..., reluMode>(dst, src, preQuantScalar) 和 TMOV<..., mode, reluMode>(dst, src, preQuantScalar)
+- reluMode 取值为 ReluPreMode::{NoRelu, NormalRelu}
+- mode 取值为 AccToVecMode::{SingleModeVec0, SingleModeVec1, DualModeSplitM, DualModeSplitN}
 
 ### A2A3 实现检查
 
-- 形状必须匹配：`SrcTileData::Rows == DstTileData::Rows` 且 `SrcTileData::Cols == DstTileData::Cols`。
+- 形状必须匹配：SrcTileData::Rows == DstTileData::Rows 且 SrcTileData::Cols == DstTileData::Cols
 - 支持的 Tile 类型对在编译期限制为：
-    - `TileType::Mat -> TileType::Left/Right/Bias/Scaling`
-    - `TileType::Vec -> TileType::Vec`
-    - `TileType::Acc -> TileType::Mat`
-- 对于 `TileType::Mat -> TileType::Bias`：
-    - 支持的源/目标 dtype 对为 `int32_t -> int32_t`、`float -> float`、`half -> float`
-    - 源行数必须为 `1`
-    - `SrcTileData::Cols * sizeof(SrcType)` 必须按 `64` 字节对齐
-- 对于 `TileType::Mat -> TileType::Scaling`：
-    - 目标 dtype 必须与源 dtype 相同，且必须为 `uint64_t`
-    - 源行数必须为 `1`
-    - `SrcTileData::Cols * sizeof(SrcType)` 必须按 `128` 字节对齐
-- 对于 `TileType::Acc -> TileType::Mat`：
-    - 额外执行 `CheckTMovAccToMat<...>` 编译期检查
-    - 普通/relu 形式使用 `GetCastPreQuantMode<SrcDType, DstDType>()` 推导的 cast pre-quant 模式
-    - 标量量化形式使用 `GetScalarPreQuantMode<SrcDType, DstDType>()`
-    - 向量量化形式要求提供 `FpTileData` 操作数，且 `FpTileData::Loc == TileType::Scaling`，并使用 `GetVectorPreQuantMode<SrcDType, DstDType>()`
+    - TileType::Mat -> TileType::Left/Right/Bias/Scaling
+    - TileType::Vec -> TileType::Vec
+    - TileType::Acc -> TileType::Mat
+- 对于 TileType::Mat -> TileType::Bias：
+    - 支持的源/目标 dtype 对为 int32_t -> int32_t、float -> float、half -> float
+    - 源行数必须为 1
+    - SrcTileData::Cols * sizeof(SrcType) 必须按 64 字节对齐
+- 对于 TileType::Mat -> TileType::Scaling：
+    - 目标 dtype 必须与源 dtype 相同，且必须为 uint64_t
+    - 源行数必须为 1
+    - SrcTileData::Cols * sizeof(SrcType) 必须按 128 字节对齐
+- 对于 TileType::Acc -> TileType::Mat：
+    - 额外执行 CheckTMovAccToMat<...> 编译期检查
+    - 普通/relu 形式使用 GetCastPreQuantMode<SrcDType, DstDType>() 推导的 cast pre-quant 模式
+    - 标量量化形式使用 GetScalarPreQuantMode<SrcDType, DstDType>()
+    - 向量量化形式要求提供 FpTileData 操作数，且 FpTileData::Loc == TileType::Scaling，并使用 GetVectorPreQuantMode<SrcDType, DstDType>()
 
 ### A5 实现检查
 
-- `CommonCheck()` 要求：
+- CommonCheck() 要求：
     - 目标/源 dtype 必须相同
-    - 支持的元素类型为 `int8_t`、`hifloat8_t`、`float8_e5m2_t`、`float8_e4m3_t`、`half`、`bfloat16_t`、`float`、`float4_e2m1x2_t`、`float4_e1m2x2_t`
+    - 支持的元素类型为 int8_t、hifloat8_t、float8_e5m2_t、float8_e4m3_t、half、bfloat16_t、float、float4_e2m1x2_t、float4_e1m2x2_t
     - 源布局必须满足以下之一：
-        - `(SrcTileData::SFractal == SLayout::ColMajor && SrcTileData::isRowMajor)`
-        - `(SrcTileData::SFractal == SLayout::RowMajor && !SrcTileData::isRowMajor)`
-        - `SrcTileData::isRowMajor`
-- `CommonCheckMX()` 用于 MX 路径时要求源/目标 dtype 相同，并支持 `float8_e8m0_t`。
+        - (SrcTileData::SFractal == SLayout::ColMajor && SrcTileData::isRowMajor)
+        - (SrcTileData::SFractal == SLayout::RowMajor && !SrcTileData::isRowMajor)
+        - SrcTileData::isRowMajor
+- CommonCheckMX() 用于 MX 路径时要求源/目标 dtype 相同，并支持 float8_e8m0_t
 - 支持的路径包括：
-    - `TileType::Mat -> TileType::Left/Right/Bias/Scaling/ScaleLeft/ScaleRight`
-    - `TileType::Vec -> TileType::Vec/TileType::Mat`
-    - `TileType::Acc -> TileType::Vec/TileType::Mat`
-    - A5 实现中处理的特定 `ND -> ZZ` 及相关内部路径变体
-- 对于 `TileType::Mat -> TileType::Bias`：
-    - 支持的 dtype 对为 `int32_t -> int32_t`、`float -> float`、`half -> float`、`bfloat16_t -> float`
-    - 源行数必须为 `1`
-    - `DstTileData::Cols * sizeof(DstType)` 必须按 `64` 字节对齐
-    - bias table 占用 `DstTileData::Cols * sizeof(DstType)` 不得超过 `4096` 字节
-- 对于 `TileType::Mat -> TileType::Scaling`：
-    - 源行数必须为 `1`
-    - `DstTileData::Cols * sizeof(DstType)` 必须按 `128` 字节对齐
-    - fixpipe buffer 占用 `DstTileData::Cols * sizeof(DstType)` 不得超过 `4096` 字节
-- 对于 `TileType::Acc -> TileType::Vec`：
-    - `mode` 用于选择 `SingleModeVec0`、`SingleModeVec1`、`DualModeSplitM` 或 `DualModeSplitN`
-    - 双目标模式要求 `QuantMode_t::NoQuant`
-    - 双目标模式不支持 `nz2dn` 路径
-    - 目标 stride 必须非零，且 `dstStride * sizeof(dstType)` 必须是 `32` 字节的整数倍
-- 对于 `TileType::Acc -> TileType::Mat`：
-    - 目标 stride 必须非零，且 `dstStride * sizeof(dstType)` 必须是 `32` 字节的整数倍
+    - TileType::Mat -> TileType::Left/Right/Bias/Scaling/ScaleLeft/ScaleRight
+    - TileType::Vec -> TileType::Vec/TileType::Mat
+    - TileType::Acc -> TileType::Vec/TileType::Mat
+    - A5 实现中处理的特定 ND -> ZZ 及相关内部路径变体
+- 对于 TileType::Mat -> TileType::Bias：
+    - 支持的 dtype 对为 int32_t -> int32_t、float -> float、half -> float、bfloat16_t -> float
+    - 源行数必须为 1
+    - DstTileData::Cols * sizeof(DstType) 必须按 64 字节对齐
+    - bias table 占用 DstTileData::Cols * sizeof(DstType) 不得超过 4096 字节
+- 对于 TileType::Mat -> TileType::Scaling：
+    - 源行数必须为 1
+    - DstTileData::Cols * sizeof(DstType) 必须按 128 字节对齐
+    - fixpipe buffer 占用 DstTileData::Cols * sizeof(DstType) 不得超过 4096 字节
+- 对于 TileType::Acc -> TileType::Vec：
+    - mode 用于选择 SingleModeVec0、SingleModeVec1、DualModeSplitM 或 DualModeSplitN
+    - 双目标模式要求 QuantMode_t::NoQuant
+    - 双目标模式不支持 nz2dn 路径
+    - 目标 stride 必须非零，且 dstStride * sizeof(dstType) 必须是 32 字节的整数倍
+- 对于 TileType::Acc -> TileType::Mat：
+    - 目标 stride 必须非零，且 dstStride * sizeof(dstType) 必须是 32 字节的整数倍
     - 支持通过对应重载启用 relu/标量量化/向量量化形式
 
+## 异常与非法情形
+
+- 未指定
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| Mat -> Left/Right/Bias/Scaling | - | 支持 | 支持 |
+| Vec -> Vec | - | 支持 | 支持 |
+| Acc -> Mat | - | 支持 | 支持 |
+| Acc -> Vec | - | 支持 | 支持 |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -161,7 +182,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -179,31 +200,21 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
-```
 
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
 %dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
-```
 
-### PTO 汇编形式
-
-```text
+# PTO 汇编形式
 %dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
-新的 PTO ISA 文档应直接链接到分组后的指令集路径。
+## 相关页面
+
+- 指令集总览：[布局与重排](./tile/layout-and-rearrangement_zh.md)
diff --git a/docs/isa/TPACK_zh.md b/docs/isa/TPACK_zh.md
index 3211e59c..24a157c9 100644
--- a/docs/isa/TPACK_zh.md
+++ b/docs/isa/TPACK_zh.md
@@ -1,41 +1,89 @@
-# TPACK
+# pto.tpack
 
-## 指令示意图
+`pto.tpack` 属于[不规则与复杂](./tile/irregular-and-complex_zh.md)指令集。
 
-![TPACK tile operation](../figures/isa/TPACK.svg)
-
-## 简介
+## 概述
 
 将 Tile 元素打包或转换为更窄的目标表示。
 
-## 数学语义
+## 机制
+
+语义随具体指令变体而变化。除非另有说明，行为都按目标 valid region 定义。
 
-语义随指令而变化。 除非另有说明，行为都按目标 valid region 定义。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
 PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tpack ...
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tpack ins(...) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`.
+声明于 `include/pto/common/pto_instr.hpp`。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| 源 Tile | 输入 | 待打包的源 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile_buf<...>` | 打包后的目标 Tile |
+
+## 副作用
+
+除产生目标 Tile 外，没有额外架构副作用。
 
 ## 约束
 
-数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+
+## 异常与非法情形
+
+- 源 Tile 与目标 Tile 类型不兼容时会被 verifier 或后端拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 打包操作 | Simulated | Supported | Supported |
 
 ## 示例
 
+### C++ 自动模式
+
+具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### C++ 手动模式
+
 具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### PTO-AS
+
+```text
+%dst = pto.tpack %src : ...
+```
+
+### AS Level 2（DPS）
+
+```mlir
+pto.tpack ins(%src : ...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## 相关页面
+
+- 指令集总览：[不规则与复杂](./tile/irregular-and-complex_zh.md)
diff --git a/docs/isa/TPARTADD_zh.md b/docs/isa/TPARTADD_zh.md
index 1e7ccc49..b6933c2b 100644
--- a/docs/isa/TPARTADD_zh.md
+++ b/docs/isa/TPARTADD_zh.md
@@ -1,14 +1,12 @@
-﻿# TPARTADD
+﻿# pto.tpartadd
 
-## 指令示意图
+`pto.tpartadd` 属于[不规则及复杂运算](./tile/ops/irregular-and-complex/tpartadd_zh.md)指令集。
 
-![TPARTADD tile operation](../figures/isa/TPARTADD.svg)
+## 概述
 
-## 简介
+在目标有效区域内执行逐元素加法。若某个位置上 `src0` 和 `src1` 都有效，则结果为两者之和；若只有一个输入在该位置有效，则结果直接取该输入的值。
 
-在目标有效区域内执行逐元素加法。若某个位置上 `src0` 和 `src1` 都有效，则结果为两者之和；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
-
-## 数学语义
+## 机制
 
 对目标有效区域内的每个元素 `(i, j)`：
 
@@ -21,11 +19,9 @@ $$
 \end{cases}
 $$
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tpartadd %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
@@ -33,13 +29,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -52,31 +48,49 @@ template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, ty
 PTO_INST RecordEvent TPARTADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src0` | 输入 | 源 tile 0 |
+| `src1` | 输入 | 源 tile 1 |
 
-### 通用约束或检查
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 部分加法结果，类型与输入一致 |
+
+## 副作用
+
+`dst` 的有效区域定义结果的计算范围。若 `dst` 的有效区域为零，指令直接返回。
+
+## 约束
 
-- `dst`、`src0` 和 `src1` 的元素类型必须一致。
-- 目标有效区域定义结果的计算范围。
+- `dst`、`src0` 和 `src1` 的元素类型必须一致
+- 目标有效区域定义结果的计算范围
 - 对目标有效区域内的每个元素：
-    - 若两个输入都有效，则执行该指令对应的逐元素运算；
-    - 若只有一个输入有效，则结果直接取该输入的值。
-- 若 `dst` 的有效区域为零，指令直接返回。
-- 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
-- 上述范围之外的有效区域组合，其行为均由具体实现定义。
+    - 若两个输入都有效，则执行逐元素加法
+    - 若只有一个输入有效，则结果直接取该输入的值
+- 若 `dst` 的有效区域为零，指令直接返回
+- 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`
+- 上述范围之外的有效区域组合，其行为均由具体实现定义
+- A2A3：`dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
+- 运行时检查失败时，行为由具体实现定义
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t`。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持的元素类型 | 全部 | `int32_t`、`int16_t`、`half`、`float` | `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t` |
+| 行主序要求 | 无 | 是 | 无 |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -90,7 +104,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -107,29 +121,13 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 自动模式
 %dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = tpartadd %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[不规则及复杂运算](./tile/ops/irregular-and-complex/tpartadd_zh.md)
diff --git a/docs/isa/TPARTMAX_zh.md b/docs/isa/TPARTMAX_zh.md
index 95087a22..b9ec9a86 100644
--- a/docs/isa/TPARTMAX_zh.md
+++ b/docs/isa/TPARTMAX_zh.md
@@ -1,29 +1,29 @@
-﻿# TPARTMAX
+﻿# pto.tpartmax
 
-## 指令示意图
+`pto.tpartmax` 属于[不规则与复杂指令](./tile/irregular-and-complex_zh.md)集。
 
-![TPARTMAX tile operation](../figures/isa/TPARTMAX.svg)
-
-## 简介
+## 概述
 
 在目标有效区域内执行逐元素最大值选择。若某个位置上 `src0` 和 `src1` 都有效，则结果为 `max(src0, src1)`；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
 
-## 数学语义
+## 机制
 
 对目标有效区域内的每个元素 `(i, j)`：
 
 $$
 \mathrm{dst}_{i,j} =
 \begin{cases}
-\max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\\\
-\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\\\
+\max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\
+\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\
 \mathrm{src1}_{i,j} & \text{若仅 src1 在 } (i,j) \text{ 处有定义}
 \end{cases}
 $$
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -33,28 +33,41 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
 PTO_INST RecordEvent TPARTMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src0` | 源 Tile | 第一个输入 Tile |
+| `src1` | 源 Tile | 第二个输入 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 逐元素最大值选择后的目标 Tile |
+
+## 副作用
 
-### 通用约束或检查
+无。
+
+## 约束
 
 - `dst`、`src0` 和 `src1` 的元素类型必须一致。
 - 目标有效区域定义结果的计算范围。
@@ -65,18 +78,20 @@ PTO_INST RecordEvent TPARTMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1
 - 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
 - 上述范围之外的有效区域组合，其行为均由具体实现定义。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
+- 未定义。
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持的元素类型 | - | `int32_t`、`int16_t`、`half`、`float` | `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float` |
+| 布局要求 | - | `isRowMajor` | - |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -90,7 +105,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -107,29 +122,19 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
 # 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
 # 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# pto.tassign %src0, @tile(0x1000)
+# pto.tassign %src1, @tile(0x2000)
+# pto.tassign %dst, @tile(0x3000)
 %dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = tpartmax %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[不规则与复杂指令](./tile/irregular-and-complex_zh.md)
diff --git a/docs/isa/TPARTMIN_zh.md b/docs/isa/TPARTMIN_zh.md
index c548133e..73aefadf 100644
--- a/docs/isa/TPARTMIN_zh.md
+++ b/docs/isa/TPARTMIN_zh.md
@@ -1,29 +1,29 @@
-﻿# TPARTMIN
+﻿# pto.tpartmin
 
-## 指令示意图
+`pto.tpartmin` 属于[不规则与复杂指令](./tile/irregular-and-complex_zh.md)集。
 
-![TPARTMIN tile operation](../figures/isa/TPARTMIN.svg)
-
-## 简介
+## 概述
 
 在目标有效区域内执行逐元素最小值选择。若某个位置上 `src0` 和 `src1` 都有效，则结果为 `min(src0, src1)`；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
 
-## 数学语义
+## 机制
 
 对目标有效区域内的每个元素 `(i, j)`：
 
 $$
 \mathrm{dst}_{i,j} =
 \begin{cases}
-\min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\\\
-\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\\\
+\min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\
+\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\
 \mathrm{src1}_{i,j} & \text{若仅 src1 在 } (i,j) \text{ 处有定义}
 \end{cases}
 $$
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -33,28 +33,41 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
 PTO_INST RecordEvent TPARTMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src0` | 源 Tile | 第一个输入 Tile |
+| `src1` | 源 Tile | 第二个输入 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 逐元素最小值选择后的目标 Tile |
+
+## 副作用
 
-### 通用约束或检查
+无。
+
+## 约束
 
 - `dst`、`src0` 和 `src1` 的元素类型必须一致。
 - 目标有效区域定义结果的计算范围。
@@ -65,18 +78,20 @@ PTO_INST RecordEvent TPARTMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1
 - 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
 - 上述范围之外的有效区域组合，其行为均由具体实现定义。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
+- 未定义。
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持的元素类型 | - | `int32_t`、`int16_t`、`half`、`float` | `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float` |
+| 布局要求 | - | `isRowMajor` | - |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -90,7 +105,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -107,29 +122,19 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
 # 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
 # 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# pto.tassign %src0, @tile(0x1000)
+# pto.tassign %src1, @tile(0x2000)
+# pto.tassign %dst, @tile(0x3000)
 %dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = tpartmin %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[不规则与复杂指令](./tile/irregular-and-complex_zh.md)
diff --git a/docs/isa/TPARTMUL_zh.md b/docs/isa/TPARTMUL_zh.md
index af778780..1e54a1a2 100644
--- a/docs/isa/TPARTMUL_zh.md
+++ b/docs/isa/TPARTMUL_zh.md
@@ -1,30 +1,29 @@
 # pto.tpartmul
 
-旧路径兼容入口。规范页见 [pto.tpartmul](./tile/ops/irregular-and-complex/tpartmul_zh.md)。
+`pto.tpartmul` 属于[不规则与复杂指令](./tile/irregular-and-complex_zh.md)集。
 
-- 指令集：[不规则与复杂指令集](./tile/irregular-and-complex_zh.md)
-- 规范页：[pto.tpartmul](./tile/ops/irregular-and-complex/tpartmul_zh.md)
-
-## 简介
+## 概述
 
 在目标有效区域内执行逐元素乘法。若某个位置上 `src0` 和 `src1` 都有效，则结果为两者之积；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
 
-## 数学语义
+## 机制
 
 对目标有效区域内的每个元素 `(i, j)`：
 
 $$
 \mathrm{dst}_{i,j} =
 \begin{cases}
-\mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j} & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\\\
-\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\\\
+\mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j} & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\
+\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\
 \mathrm{src1}_{i,j} & \text{若仅 src1 在 } (i,j) \text{ 处有定义}
 \end{cases}
 $$
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -34,28 +33,41 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
 PTO_INST RecordEvent TPARTMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src0` | 源 Tile | 第一个输入 Tile |
+| `src1` | 源 Tile | 第二个输入 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 逐元素乘法后的目标 Tile |
+
+## 副作用
 
-### 通用约束或检查
+无。
+
+## 约束
 
 - `dst`、`src0` 和 `src1` 的元素类型必须一致。
 - 目标有效区域定义结果的计算范围。
@@ -66,18 +78,20 @@ PTO_INST RecordEvent TPARTMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1
 - 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
 - 上述范围之外的有效区域组合，其行为均由具体实现定义。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
+- 未定义。
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t`。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持的元素类型 | - | `int32_t`、`int16_t`、`half`、`float` | `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t` |
+| 布局要求 | - | `isRowMajor` | - |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -90,7 +104,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -106,29 +120,19 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
 # 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
 # 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# pto.tassign %src0, @tile(0x1000)
+# pto.tassign %src1, @tile(0x2000)
+# pto.tassign %dst, @tile(0x3000)
 %dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[不规则与复杂指令](./tile/irregular-and-complex_zh.md)
diff --git a/docs/isa/TPOP_zh.md b/docs/isa/TPOP_zh.md
index 5582ef2f..f3797e65 100644
--- a/docs/isa/TPOP_zh.md
+++ b/docs/isa/TPOP_zh.md
@@ -1,41 +1,91 @@
-# TPOP
+# pto.tpop
 
-## 指令示意图
+`pto.tpop` 属于[同步与配置](./tile/sync-and-config_zh.md)指令集。
 
-![TPOP tile operation](../figures/isa/TPOP.svg)
-
-## 简介
+## 概述
 
 从 pipe 或 FIFO 的消费者端弹出一个 Tile。
 
-## 数学语义
+## 机制
+
+语义随具体指令变体而变化。除非另有说明，行为都按目标 valid region 定义。
 
-语义随指令而变化。 除非另有说明，行为都按目标 valid region 定义。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
 PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tpop ...
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tpop ins(...) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`.
+声明于 `include/pto/common/pto_instr.hpp`。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| Pipe/FIFO 引用 | 输入 | 待弹出 Tile 的 pipe 或 FIFO |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile_buf<...>` | 从消费者端弹出的 Tile |
+
+## 副作用
+
+从 pipe 或 FIFO 中移除一个 Tile，该 Tile 所有权转移给消费者。
 
 ## 约束
 
-数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 只能在 pipe 或 FIFO 有可用 Tile 时执行。
+
+## 异常与非法情形
+
+- 在 pipe 或 FIFO 为空时执行属于未定义行为。
+- 消费者无权访问该 pipe 或 FIFO 时会被拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 弹出操作 | Simulated | Supported | Supported |
 
 ## 示例
 
+### C++ 自动模式
+
+具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### C++ 手动模式
+
 具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### PTO-AS
+
+```text
+%tile = pto.tpop %pipe : !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```mlir
+pto.tpop ins(%pipe : ...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## 相关页面
+
+- 指令集总览：[同步与配置](./tile/sync-and-config_zh.md)
diff --git a/docs/isa/TPRINT_zh.md b/docs/isa/TPRINT_zh.md
index f6132a5d..fea1c078 100644
--- a/docs/isa/TPRINT_zh.md
+++ b/docs/isa/TPRINT_zh.md
@@ -1,172 +1,180 @@
-﻿# TPRINT
-
-## 指令示意图
-
-![TPRINT tile operation](../figures/isa/TPRINT.svg)
-
-## 简介
-
-调试/打印 Tile 中的元素（实现定义）。
-
-从设备代码直接打印 Tile 或 GlobalTensor 的内容以用于调试目的。
-
-`TPRINT` 指令输出存储在 Tile 或 GlobalTensor 中的数据的逻辑视图。它支持常见的数据类型（例如 `float`、`half`、`int8`、`uint32`）和多种内存布局（GlobalTensor 的 `ND`、`DN`、`NZ`；片上缓冲区的向量 tiles）。
-
-> **重要**:
-> - 此指令**仅用于开发和调试**。
-> - 它会产生**显著的运行时开销**，**不得在生产 kernel 中使用**。
-> - 如果输出超过内部打印缓冲区，可能会被**截断**。可以通过在编译选项中添加`-DCCEBlockMaxSize=16384`来修改打印缓冲区，默认为16KB。
-> - **需要 CCE 编译选项 `-D_DEBUG --cce-enable-print`**（参见 [行为](#behavior)）。
-
-## 数学语义
-
-除非另有说明，语义在有效区域上定义，目标相关的行为标记为实现定义。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-```text
-tprint %src : !pto.tile<...> | !pto.global<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-```cpp
-// 适用于打印GlobalTensor或Vec类型Tile
-template <PrintFormat Format = PrintFormat::Width8_Precision4, typename TileData>
-PTO_INST void TPRINT(TileData &src);
-
-// 适用于打印Acc类型Tile和Mat类型Tile(Mat打印仅适用于A3，A5暂不支持)
-template <PrintFormat Format = PrintFormat::Width8_Precision4, typename TileData, typename GlobalData>
-PTO_INTERNAL void TPRINT(TileData &src, GlobalData &tmp);
-```
-
-### PrintFormat 枚举
-声明于 `include/pto/common/type.hpp`：
-```cpp
-enum class PrintFormat : uint8_t
-{
-    Width8_Precision4 = 0,  // 打印宽度8，精度4
-    Width8_Precision2 = 1,  // 打印宽度8，精度2
-    Width10_Precision6 = 2, // 打印宽度10，精度6
-};
-```
-
-### 支持的 T 类型
-- **Tile**：TileType必须是`Vec`、`Acc`、`Mat(仅A3支持)`，并具有支持的元素类型。
-- **GlobalTensor**：必须使用布局 `ND`、`DN` 或 `NZ`，并具有支持的元素类型。
-
-## 约束
-
-- **支持的元素类型**:
-    - 浮点数：`float`、`half`
-    - 有符号整数：`int8_t`、`int16_t`、`int32_t`
-    - 无符号整数：`uint8_t`、`uint16_t`、`uint32_t`
-- **对于 GlobalTensor**：布局必须是 `Layout::ND`、`Layout::DN` 或 `Layout::NZ` 之一。
-- **对于 临时空间**：打印`TileType`为`Mat`或`Acc`的Tile时需要传入gm上的临时空间，临时空间不得小于`TileData::Numel * sizeof(T)`。
-- A5暂不支持`TileType`为`Mat`的Tile打印。
-- **回显信息**: `TileType`为`Mat`时，布局将按照`Layout::ND`进行打印，其他布局可能会导致信息错位。
-
-## 行为
-
-- **强制编译标志**:
-
-  在 A2/A3/A5 设备上，`TPRINT` 使用 `cce::printf` 通过设备到主机的调试通道输出。**必须启用 CCE 选项 `-D_DEBUG --cce-enable-print`**。
-
-- **缓冲区限制**:
-
-  `cce::printf` 的内部打印缓冲区大小有限。如果输出超过此缓冲区，可能会出现类似 `"Warning: out of bound! try best to print"` 的警告消息，并且**只会打印部分数据**。
-
-- **同步**:
-
-  自动插入 `pipe_barrier(PIPE_ALL)` 以确保所有先前的操作完成且数据一致。
-
-- **格式化**:
-
-    - 浮点数值：根据 `PrintFormat` 模板参数确定打印格式：
-      - `PrintFormat::Width8_Precision4`: `%8.4f`（默认）
-      - `PrintFormat::Width8_Precision2`: `%8.2f`
-      - `PrintFormat::Width10_Precision6`: `%10.6f`
-    - 整数值：根据 `PrintFormat` 模板参数确定打印格式：
-      - `PrintFormat::Width8_Precision4` 或 `PrintFormat::Width8_Precision2`: `%8d`
-      - `PrintFormat::Width10_Precision6`: `%10d`
-    - 对于 `GlobalTensor`，由于数据大小和缓冲区限制，仅打印其逻辑形状（由 `Shape` 定义）内的元素。
-    - 对于 `Tile`，无效区域（超出 `validRows`/`validCols`）仍会被打印，但在指定部分有效性时用 `|` 分隔符标记。
-
-## 示例
-
-### Print a Tile
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-PTO_INTERNAL void DebugTile(__gm__ float *src) {
-  using ValidSrcShape = TileShape2D<float, 16, 16>;
-  using NDSrcShape = BaseShape2D<float, 32, 32>;
-  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
-  GlobalDataSrc srcGlobal(src);
-
-  using srcTileData = Tile<TileType::Vec, float, 16, 16>;
-  srcTileData srcTile;
-  TASSIGN(srcTile, 0x0);
-
-  TLOAD(srcTile, srcGlobal);
-  TPRINT(srcTile);
-}
-```
-
-### Print a GlobalTensor
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-PTO_INTERNAL void DebugGlobalTensor(__gm__ float *src) {
-  using ValidSrcShape = TileShape2D<float, 16, 16>;
-  using NDSrcShape = BaseShape2D<float, 32, 32>;
-  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
-  GlobalDataSrc srcGlobal(src);
-
-  TPRINT(srcGlobal);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### PTO 汇编形式
-
-```text
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-# AS Level 2 (DPS)
-pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
-```
+﻿# pto.tprint
+
+`pto.tprint` 属于[不规则与复杂操作](./tile/irregular-and-complex_zh.md)指令集。
+
+## 概述
+
+调试/打印 Tile 或 GlobalTensor 中的元素（实现定义）。`TPRINT` 指令输出存储在 Tile 或 GlobalTensor 中的数据的逻辑视图。它支持常见的数据类型（例如 `float`、`half`、`int8`、`uint32`）和多种内存布局（GlobalTensor 的 `ND`、`DN`、`NZ`；片上缓冲区的向量 tiles）。
+
+> **重要**:
+> - 此指令**仅用于开发和调试**
+> - 它会产生**显著的运行时开销**，**不得在生产 kernel 中使用**
+> - 如果输出超过内部打印缓冲区，可能会被**截断**。可以通过在编译选项中添加`-DCCEBlockMaxSize=16384`来修改打印缓冲区，默认为16KB
+> - **需要 CCE 编译选项 `-D_DEBUG --cce-enable-print`**（参见行为部分）
+
+## 机制
+
+该指令读取 Tile 或 GlobalTensor 中的数据并格式化输出。除非另有说明，语义在有效区域上定义，目标相关的行为标记为实现定义。
+
+## 语法
+
+### PTO-AS
+
+```text
+tprint %src : !pto.tile<...> | !pto.global<...>
+```
+
+### AS Level 1（SSA）
+
+```mlir
+pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
+```
+
+### AS Level 2（DPS）
+
+```mlir
+pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+// 适用于打印GlobalTensor或Vec类型Tile
+template <PrintFormat Format = PrintFormat::Width8_Precision4, typename TileData>
+PTO_INST void TPRINT(TileData &src);
+
+// 适用于打印Acc类型Tile和Mat类型Tile(Mat打印仅适用于A3，A5暂不支持)
+template <PrintFormat Format = PrintFormat::Width8_Precision4, typename TileData, typename GlobalData>
+PTO_INTERNAL void TPRINT(TileData &src, GlobalData &tmp);
+```
+
+### PrintFormat 枚举
+
+声明于 `include/pto/common/type.hpp`：
+
+```cpp
+enum class PrintFormat : uint8_t
+{
+    Width8_Precision4 = 0,  // 打印宽度8，精度4
+    Width8_Precision2 = 1,  // 打印宽度8，精度2
+    Width10_Precision6 = 2, // 打印宽度10，精度6
+};
+```
+
+### 支持的 T 类型
+
+- **Tile**：TileType必须是`Vec`、`Acc`、`Mat(仅A3支持)`，并具有支持的元素类型
+- **GlobalTensor**：必须使用布局 `ND`、`DN` 或 `NZ`，并具有支持的元素类型
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| src | 输入 | Tile 或 GlobalTensor |
+| tmp | 可选 | 临时 GlobalTensor 空间（打印 Mat/Acc Tile 时需要） |
+| Format | 模板参数 | 打印格式（默认 Width8_Precision4） |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 无 | void | 直接输出到调试通道 |
+
+## 副作用
+
+- **强制编译标志**: 在 A2/A3/A5 设备上，`TPRINT` 使用 `cce::printf` 通过设备到主机的调试通道输出。**必须启用 CCE 选项 `-D_DEBUG --cce-enable-print`**
+- **缓冲区限制**: `cce::printf` 的内部打印缓冲区大小有限。如果输出超过此缓冲区，可能会出现类似 `"Warning: out of bound! try best to print"` 的警告消息，并且**只会打印部分数据**
+- **同步**: 自动插入 `pipe_barrier(PIPE_ALL)` 以确保所有先前的操作完成且数据一致
+- **格式化**:
+    - 浮点数值：根据 `PrintFormat` 模板参数确定打印格式：
+      - `PrintFormat::Width8_Precision4`: `%8.4f`（默认）
+      - `PrintFormat::Width8_Precision2`: `%8.2f`
+      - `PrintFormat::Width10_Precision6`: `%10.6f`
+    - 整数值：根据 `PrintFormat` 模板参数确定打印格式：
+      - `PrintFormat::Width8_Precision4` 或 `PrintFormat::Width8_Precision2`: `%8d`
+      - `PrintFormat::Width10_Precision6`: `%10d`
+    - 对于 `GlobalTensor`，由于数据大小和缓冲区限制，仅打印其逻辑形状（由 `Shape` 定义）内的元素
+    - 对于 `Tile`，无效区域（超出 `validRows`/`validCols`）仍会被打印，但在指定部分有效性时用 `|` 分隔符标记
+
+## 约束
+
+- 支持的元素类型:
+    - 浮点数：float、half
+    - 有符号整数：int8_t、int16_t、int32_t
+    - 无符号整数：uint8_t、uint16_t、uint32_t
+- 对于 GlobalTensor：布局必须是 `Layout::ND`、`Layout::DN` 或 `Layout::NZ` 之一
+- 对于临时空间：打印 TileType 为 Mat 或 Acc 的 Tile 时需要传入 gm 上的临时空间，临时空间不得小于 TileData::Numel * sizeof(T)
+- A5 暂不支持 TileType 为 Mat 的 Tile 打印
+- 回显信息: TileType 为 Mat 时，布局将按照 `Layout::ND` 进行打印，其他布局可能会导致信息错位
+
+## 异常与非法情形
+
+- 输出超出打印缓冲区时显示警告并截断
+- 编译时未启用 `-D_DEBUG --cce-enable-print` 时指令不可用
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| Vec Tile 打印 | 支持 | 支持 | 支持 |
+| Acc Tile 打印 | 支持 | 支持 | 支持 |
+| Mat Tile 打印 | 支持 | 支持 | 不支持 |
+| GlobalTensor 打印 | 支持 | 支持 | 支持 |
+
+## 示例
+
+### C++ 自动模式 - 打印 Tile
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+PTO_INTERNAL void DebugTile(__gm__ float *src) {
+  using ValidSrcShape = TileShape2D<float, 16, 16>;
+  using NDSrcShape = BaseShape2D<float, 32, 32>;
+  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
+  GlobalDataSrc srcGlobal(src);
+
+  using srcTileData = Tile<TileType::Vec, float, 16, 16>;
+  srcTileData srcTile;
+  TASSIGN(srcTile, 0x0);
+
+  TLOAD(srcTile, srcGlobal);
+  TPRINT(srcTile);
+}
+```
+
+### C++ 自动模式 - 打印 GlobalTensor
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+PTO_INTERNAL void DebugGlobalTensor(__gm__ float *src) {
+  using ValidSrcShape = TileShape2D<float, 16, 16>;
+  using NDSrcShape = BaseShape2D<float, 32, 32>;
+  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
+  GlobalDataSrc srcGlobal(src);
+
+  TPRINT(srcGlobal);
+}
+```
+
+### PTO-AS
+
+```text
+# 自动模式
+pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
+
+# 手动模式
+pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
+
+# PTO 汇编形式
+pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
+# AS Level 2 (DPS)
+pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
+```
+
+## 相关页面
+
+- 指令集总览：[不规则与复杂操作](./tile/irregular-and-complex_zh.md)
diff --git a/docs/isa/TPUSH_zh.md b/docs/isa/TPUSH_zh.md
index bd0d01b9..5deb21e5 100644
--- a/docs/isa/TPUSH_zh.md
+++ b/docs/isa/TPUSH_zh.md
@@ -1,41 +1,92 @@
-# TPUSH
+# pto.tpush
 
-## 指令示意图
+`pto.tpush` 属于[同步与配置](./tile/sync-and-config_zh.md)指令集。
 
-![TPUSH tile operation](../figures/isa/TPUSH.svg)
-
-## 简介
+## 概述
 
 将 Tile 推入 pipe 或 FIFO 的生产者端。
 
-## 数学语义
+## 机制
+
+语义随具体指令变体而变化。除非另有说明，行为都按目标 valid region 定义。
 
-语义随指令而变化。 除非另有说明，行为都按目标 valid region 定义。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
 PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tpush ...
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tpush ins(...) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`.
+声明于 `include/pto/common/pto_instr.hpp`。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| 源 Tile | 输入 | 待推入 pipe 或 FIFO 的 Tile |
+| Pipe/FIFO 引用 | 输入 | 目标 pipe 或 FIFO |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 无 | - | 推送操作无返回值，Tile 所有权转移给生产者 |
+
+## 副作用
+
+将 Tile 移入 pipe 或 FIFO 的生产者端，Tile 所有权从消费者转移给生产者。
 
 ## 约束
 
-数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 数据类型、layout、location 和 shape 的进一步限制以对应 backend 的合法性检查为准。
+- 只能在 pipe 或 FIFO 有可用槽位时执行。
+
+## 异常与非法情形
+
+- 在 pipe 或 FIFO 满时执行属于未定义行为。
+- 生产者无权访问该 pipe 或 FIFO 时会被拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 推送操作 | Simulated | Supported | Supported |
 
 ## 示例
 
+### C++ 自动模式
+
+具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### C++ 手动模式
+
 具体的 Auto / Manual 使用方式见 `docs/isa/` 下的相关指令页。
+
+### PTO-AS
+
+```text
+pto.tpush %tile, %pipe : !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```mlir
+pto.tpush ins(%tile, %pipe : ...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## 相关页面
+
+- 指令集总览：[同步与配置](./tile/sync-and-config_zh.md)
diff --git a/docs/isa/TRANDOM_zh.md b/docs/isa/TRANDOM_zh.md
index 2b83597b..2b7d3ce2 100644
--- a/docs/isa/TRANDOM_zh.md
+++ b/docs/isa/TRANDOM_zh.md
@@ -1,46 +1,42 @@
-# TRANDOM
+# pto.trandom
 
+`pto.trandom` 属于[不规则与复杂指令](./tile/irregular-and-complex_zh.md)集。
 
-## Tile Operation Diagram
+## 概述
 
-![TRANDOM tile operation](../figures/isa/TRANDOM.svg)
+使用基于计数器的密码算法在目标 Tile 中生成伪随机数。该指令实现了一个基于计数器的随机数生成器，对于有效区域中的每个元素，它基于密钥和计数器状态，使用可配置轮数的密码类变换生成伪随机值。算法使用 128 位状态（4 × 32 位计数器）、64 位密钥（2 × 32 位字），以及类似 ChaCha 的四分之一轮操作。
 
-## 简介
+## 机制
 
-使用基于计数器的密码算法在目标 Tile 中生成随机数。
+### 数学语义
 
-## 数学解释
+对有效区域中的每个元素 `(i, j)`：
 
-该指令实现了一个基于计数器的随机数生成器。对于有效区域中的每个元素，它基于密钥和计数器状态，使用可配置轮数的密码类变换生成伪随机值。
+$$ \mathrm{dst}_{i,j} = \mathrm{CipherRound}^R\left(\mathrm{counter}_{i,j},\ \mathrm{key}\right) $$
 
-该算法使用：
-- 128 位状态（4 × 32 位计数器）
-- 64 位密钥（2 × 32 位字）
-- 类似 ChaCha 的四分之一轮操作，使用向量指令
+其中 $R$ 为轮数（默认 10 轮，可选 7 轮），使用类似 ChaCha 的四分之一轮操作进行密码学变换。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
+### PTO-AS
 
 ```text
 trandom %dst, %key, %counter : !pto.tile<...>
 ```
 
-### AS Level 1 (SSA)
+### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### AS Level 2 (DPS)
+### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trandom ins(%key, %counter : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
-## C++ 内置函数
+## C++ 内建接口
 
 声明于 `include/pto/npu/a5/TRandom.hpp`：
 
@@ -49,19 +45,48 @@ template <uint16_t Rounds = 10, typename DstTile>
 PTO_INST void TRANDOM_IMPL(DstTile &dst, TRandomKey &key, TRandomCounter &counter);
 ```
 
-## 约束条件
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `key` | 输入 | 64 位密钥（2 × 32 位字），包含 `key0` 和 `key1` |
+| `counter` | 输入 | 128 位计数器状态（4 × 32 位），每次调用后递增 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 生成的伪随机数，有效区域内有效 |
+
+## 副作用
+
+该操作通过密码学变换更新内部计数器状态。
+
+## 约束
+
+- A5 实现检查：
+    - `DstTile::DType` 必须为 `int32_t` 或 `uint32_t`
+    - Tile 布局必须为行主序（`DstTile::isRowMajor`）
+    - `Rounds` 必须为 7 或 10（默认为 10）
+    - `key` 和 `counter` 不能为空
+- 有效区域：
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域
+
+## 异常与非法情形
 
-- **实现检查（A5）**：
-    - `DstTile::DType` 必须为以下类型之一：`int32_t`、`uint32_t`。
-    - Tile 布局必须为行主序（`DstTile::isRowMajor`）。
-    - `Rounds` 必须为 7 或 10（默认为 10）。
-    - `key` 和 `counter` 不能为空。
-- **有效区域**：
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- 若 `key` 或 `counter` 为空，行为未定义
+- 若 `DstTile::DType` 不是 `int32_t` 或 `uint32_t`，编译失败
+- 若 `Rounds` 不是 7 或 10，编译失败
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持 | 是 | 否 | 是 |
 
 ## 示例
 
-### Auto 模式
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -77,7 +102,7 @@ void example_auto() {
 }
 ```
 
-### Manual 模式
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -94,28 +119,17 @@ void example_manual() {
 }
 ```
 
-## 汇编形式示例
-
-### Auto 模式
+### PTO-AS
 
 ```text
-# Auto 模式：编译器/运行时管理的布局和调度。
+# 自动模式：编译器/运行时管理的布局和调度
 %dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual 模式
 
-```text
-# Manual 模式：在发出指令之前显式绑定资源。
-# Tile 操作数可选：
+# 手动模式：在发出指令之前显式绑定资源
 # pto.tassign %arg0, @tile(0x3000)
 %dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-trandom %dst, %key, %counter : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trandom ins(%key, %counter : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[不规则与复杂指令](./tile/irregular-and-complex_zh.md)
diff --git a/docs/isa/TRECIP_zh.md b/docs/isa/TRECIP_zh.md
index 7ca039a0..a0f95e6c 100644
--- a/docs/isa/TRECIP_zh.md
+++ b/docs/isa/TRECIP_zh.md
@@ -1,24 +1,24 @@
-﻿# TRECIP
-
-## 指令示意图
+﻿# pto.trecip
 
 ![TRECIP tile operation](../figures/isa/TRECIP.svg)
 
-## 简介
+`pto.trecip` 属于[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)指令集。
+
+## 概述
 
-Tile 的逐元素倒数。
+对 tile 做逐元素倒数，结果写入目标 tile。迭代域由目标 tile 的 valid region 决定。
 
-## 数学语义
+## 机制
 
-对每个元素 `(i, j)` 在有效区域内：
+对目标 tile 的 valid region 中每个 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \frac{1}{\mathrm{src}_{i,j}} $$
 
-## 汇编语法
+它适合在后续仍要与其他 tile 做乘法组合时，替代显式除法。除零行为由目标定义。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = trecip %src : !pto.tile<...>
@@ -26,52 +26,73 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trecip ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <auto PrecisionType = RecipAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
           typename... WaitEvents>
 PTO_INST RecordEvent TRECIP(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
 ```
 
-`PrecisionType`可指定以下值：
+`PrecisionType` 可选：
+
+- `RecipAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
+- `RecipAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile |
+| `%dst` | 目标 tile | 接收逐元素倒数结果 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
 
-* `RecipAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
-* `RecipAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | `dst` valid region 内的每个元素都等于 `1 / src` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
 
 ## 约束
 
-- **实现检查 (NPU)**:
-    - `TileData::DType` 必须是以下之一：`float` 或 `half`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`);
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-    - A3 的 TRECIP 指令不支持将源 Tile 和目标 Tile 设置为相同的内存。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **域 / NaN**:
-    - 除零行为由目标定义；CPU 模拟器在调试构建中会断言。
-- **高精度算法**
-    - 仅在A5上有效，`PrecisionType`选项A3上将被忽略。
+- 迭代域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定。
+- 除零行为由目标定义；CPU 模拟器在调试构建下会断言。
+- 高精度算法只在 A5 有效。
+- A3 不支持源 tile 与目标 tile 绑定到同一片内存。
+
+## 异常与非法情形
+
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `float` | Simulated | Supported | Supported |
+| `half` | Simulated | Supported | Supported |
+| 布局 | Any | RowMajor only | RowMajor only |
+| 同址操作 | No | No | No |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
-
 using namespace pto;
 
 void example() {
@@ -82,29 +103,39 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
+### C++ 手动模式
 
-### 自动模式
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
 
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TRECIP(dst, src);
+}
 ```
 
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 自动模式
 %dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
-```
 
-### PTO 汇编形式
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
 
-```text
+# PTO 汇编形式
 %dst = trecip %src : !pto.tile<...>
 # AS Level 2 (DPS)
 pto.trecip ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)
+- 规范页：[pto.trecip](./tile/ops/elementwise-tile-tile/trecip_zh.md)
diff --git a/docs/isa/TROWARGMAX_zh.md b/docs/isa/TROWARGMAX_zh.md
index 517f2fb6..431322f7 100644
--- a/docs/isa/TROWARGMAX_zh.md
+++ b/docs/isa/TROWARGMAX_zh.md
@@ -1,41 +1,35 @@
 # pto.trowargmax
 
-旧路径兼容入口。规范页见 [pto.trowargmax](./tile/ops/reduce-and-expand/trowargmax_zh.md)。
+`pto.trowargmax` 属于[行归约](./tile/ops/reduce-and-expand/trowargmax_zh.md)指令集。
 
-![TROWARGMAX tile operation](../figures/isa/TROWARGMAX.svg)
+## 概述
 
-## 简介
+获取每行最大值对应列索引。对源 tile 的每一行，计算该行最大元素的列索引，写入目标 tile 对应行的第一个位置。
 
-获取每行最大值对应列索引，或同时获取每行最大值及其对应列索引。
-
-## 数学语义
+## 机制
 
 设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
 
 $$ \mathrm{dst}_{i,0} = \underset{0 \le j < C}{\operatorname{argmax}} \; \mathrm{src}_{i,j} $$
 
-$$ \mathrm{dstval}_{i,0} = \max_{0 \le j < C} \mathrm{src}_{i,j} $$
-
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
+### PTO-AS
 
 ```text
 %dst = trowargmax %src : !pto.tile<...> -> !pto.tile<...>
 ```
 Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
 
-### IR Level 1（SSA）
+### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trowargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### IR Level 2（DPS）
+### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trowargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -43,73 +37,63 @@ pto.trowargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%ds
 
 声明于 `include/pto/common/pto_instr.hpp`:
 
-仅输出索引：
-
 ```cpp
 template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
 PTO_INST RecordEvent TROWARGMAX(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
 ```
 
-同时输出值和索引：
+## 输入
 
-```cpp
-template <typename TileDataOutVal, typename TileDataOutIdx, typename TileDataIn, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWARGMAX(TileDataOutVal &dstVal, TileDataOutIdx &dstIdx, TileDataIn &src, TileDataTmp &tmp,
-                                WaitEvents &... events)
-```
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 | 源 tile，支持 `half`、`float` 元素类型 |
+| `tmp` | 临时 | A3 临时 tile，取决于 `srcValidCol` 与 `ElementPerRepeat` 的关系 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 每行最大值的列索引，`uint32_t` 或 `int32_t` 类型 |
+
+## 副作用
+
+指令执行完成后，`dst` 有效行数与 `src` 相同，有效列数为 1。
 
 ## 约束
 
-### 通用约束或检查
-
-- 支持的源元素类型：`half`、`float`。
-- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- 仅输出索引时：
-    -`dst` 和 `src` 必须为 `TileType::Vec`。
-    - 支持的目标元素类型：`uint32_t`、`int32_t`。
-    - 运行时检查遵循共享的行归约检查路径：
-        - `src.GetValidRow() != 0`
-        - `src.GetValidCol() != 0`
-        - `src.GetValidRow() == dst.GetValidRow()`
-    - `dst` 通过共享的行归约索引检查路径约束，可使用以下任一非分形布局：
-        - 单列 DN 布局（`BLayout::ColMajor`、`Cols == 1`），或
-        - 有效列数为 1 的 ND 布局。
-- 同时输出值和索引时：
-    - `dstVal`、`dstIdx`、`src` 必须为 `TileType::Vec`。
-    - `dstVal`的元素类型必须与`src`的元素类型一致。
-    - 支持的目标元素类型：
-        - 源元素类型为`float`时，支持`uint32_t`、`int32_t`。
-        - 源元素类型为`half`时，支持`uint16_t`、`int16_t`。
-    - 运行时检查遵循共享的行归约检查路径：
-        - `src.GetValidRow() != 0`
-        - `src.GetValidCol() != 0`
-        - `src.GetValidRow() == dstIdx.GetValidRow()`
-        - `src.GetValidRow() == dstVal.GetValidRow()`
-    - `dstVal`、`dstIdx`通过共享的行归约索引检查路径约束，可使用以下任一非分形布局：
-        - 单列 DN 布局（`BLayout::ColMajor`、`Cols == 1`），或
-        - 有效列数为 1 的 ND 布局。
-
-### `tmp`临时Tile相关说明
-
-- 仅A3使用`tmp`临时Tile，A5接收`tmp`但实际并不使用。
-- 仅输出索引时，`tmp`临时Tile在`srcValidCol <= ElementPerRepeat`时不使用。
-- 同时输出值和索引且`srcValidCol <= ElementPerRepeat`时，`tmp`临时Tile可使用以下任一非分形布局：
-    - 单列 DN 布局（`BLayout::ColMajor`、`Cols == 1`），有效行数为`srcValidRow * 2`。
-    - 有效行数为`srcValidRow`且有效列数为 2 的 ND 布局。
-- `srcValidCol > ElementPerRepeat`时：
-    - `tmp` tile的行数和`src` tile的行数相同。
-    - 按以下公式根据`src` tile的`validCol`算出`tmp` tile所需stride：
+- `dst` 和 `src` 必须为 `TileType::Vec`
+- 支持的源元素类型：`half`、`float`
+- 支持的目标元素类型：`uint32_t`、`int32_t`
+- `src.GetValidRow() != 0`
+- `src.GetValidCol() != 0`
+- `src.GetValidRow() == dst.GetValidRow()`
+- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）
+- `dst` 可使用单列 DN 布局（`BLayout::ColMajor`、`Cols == 1`）或有效列数为 1 的 ND 布局
+
+### A3 `tmp`临时Tile相关说明
+
+- `tmp`临时Tile在`srcValidCol <= ElementPerRepeat`时不使用，`srcValidCol > ElementPerRepeat`时需要使用
+- `tmp` tile的行数和`src` tile的行数相同
+- 按以下公式根据`src` tile的`validCol`算出`tmp` tile所需stride：
 
 ```text
 repeats = ceil(validCol / elementPerRepeat)
-stride = (ceil(repeats * 2 / elementPerBlock) + ceil(repeats / elementPerBlock)) * elementPerBlock
+stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
 ```
 
+## 异常与非法情形
+
+- 运行时检查失败时，行为由具体实现定义
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持 | 是 | 是 | 是 |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -119,18 +103,15 @@ using namespace pto;
 void example_auto() {
   using SrcT = Tile<TileType::Vec, float, 16, 16>;
   using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
-  using DstValT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
   using TmpT = Tile<TileType::Vec, float, 16, 16>;
   SrcT src;
   DstT dst;
-  DstValT dst;
   TmpT tmp;
   TROWARGMAX(dst, src, tmp);
-  TROWARGMAX(dstVal, dst, src, tmp);
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -140,46 +121,24 @@ using namespace pto;
 void example_manual() {
   using SrcT = Tile<TileType::Vec, float, 16, 16>;
   using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
-  using DstValT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
   using TmpT = Tile<TileType::Vec, float, 16, 16>;
   SrcT src;
   DstT dst;
-  DstValT dst;
   TmpT tmp;
   TASSIGN(src, 0x1000);
   TASSIGN(dst, 0x2000);
-  TASSIGN(dstVal, 0x3000);
-  TASSIGN(tmp, 0x4000);
+  TASSIGN(tmp, 0x3000);
   TROWARGMAX(dst, src, tmp);
-  TROWARGMAX(dstVal, dst, src, tmp);
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 自动模式
 %dst = pto.trowargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
-
-```text
-%dst = trowargmax %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
-pto.trowargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
+## 相关页面
 
-新的 PTO ISA 文档应直接链接到分组后的指令集路径。
+- 指令集总览：[行归约](./tile/ops/reduce-and-expand/trowargmax_zh.md)
diff --git a/docs/isa/TROWARGMIN_zh.md b/docs/isa/TROWARGMIN_zh.md
index bc93817a..2fb10f15 100644
--- a/docs/isa/TROWARGMIN_zh.md
+++ b/docs/isa/TROWARGMIN_zh.md
@@ -1,41 +1,35 @@
 # pto.trowargmin
 
-旧路径兼容入口。规范页见 [pto.trowargmin](./tile/ops/reduce-and-expand/trowargmin_zh.md)。
+`pto.trowargmin` 属于[行归约](./tile/ops/reduce-and-expand/trowargmin_zh.md)指令集。
 
-![TROWARGMIN tile operation](../figures/isa/TROWARGMIN.svg)
+## 概述
 
-## 简介
+获取每行最小值对应列索引。对源 tile 的每一行，计算该行最小元素的列索引，写入目标 tile 对应行的第一个位置。
 
-获取每行最小值对应列索引，或同时获取每行最小值及其对应列索引。
-
-## 数学语义
+## 机制
 
 设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
 
 $$ \mathrm{dst}_{i,0} = \underset{0 \le j < C}{\operatorname{argmin}} \; \mathrm{src}_{i,j} $$
 
-$$ \mathrm{dstval}_{i,0} = \min_{0 \le j < C} \mathrm{src}_{i,j} $$
-
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
+### PTO-AS
 
 ```text
 %dst = trowargmin %src : !pto.tile<...> -> !pto.tile<...>
 ```
 Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
 
-### IR Level 1（SSA）
+### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### IR Level 2（DPS）
+### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trowargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -43,73 +37,63 @@ pto.trowargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%ds
 
 声明于 `include/pto/common/pto_instr.hpp`:
 
-仅输出索引：
-
 ```cpp
 template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
 PTO_INST RecordEvent TROWARGMIN(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
 ```
 
-同时输出值和索引：
+## 输入
 
-```cpp
-template <typename TileDataOutVal, typename TileDataOutIdx, typename TileDataIn, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWARGMIN(TileDataOutVal &dstVal, TileDataOutIdx &dstIdx, TileDataIn &src, TileDataTmp &tmp,
-                                WaitEvents &... events)
-```
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 | 源 tile，支持 `half`、`float` 元素类型 |
+| `tmp` | 临时 | A3 临时 tile，取决于 `srcValidCol` 与 `ElementPerRepeat` 的关系 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 每行最小值的列索引，`uint32_t` 或 `int32_t` 类型 |
+
+## 副作用
+
+指令执行完成后，`dst` 有效行数与 `src` 相同，有效列数为 1。
 
 ## 约束
 
-### 通用约束或检查
-
-- 支持的源元素类型：`half`、`float`。
-- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- 仅输出索引时：
-    -`dst` 和 `src` 必须为 `TileType::Vec`。
-    - 支持的目标元素类型：`uint32_t`、`int32_t`。
-    - 运行时检查遵循共享的行归约检查路径：
-        - `src.GetValidRow() != 0`
-        - `src.GetValidCol() != 0`
-        - `src.GetValidRow() == dst.GetValidRow()`
-    - `dst` 通过共享的行归约索引检查路径约束，可使用以下任一非分形布局：
-        - 单列 DN 布局（`BLayout::ColMajor`、`Cols == 1`），或
-        - 有效列数为 1 的 ND 布局。
-- 同时输出值和索引时：
-    - `dstVal`、`dstIdx`、`src` 必须为 `TileType::Vec`。
-    - `dstVal`的元素类型必须与`src`的元素类型一致。
-    - 支持的目标元素类型：
-        - 源元素类型为`float`时，支持`uint32_t`、`int32_t`。
-        - 源元素类型为`half`时，支持`uint16_t`、`int16_t`。
-    - 运行时检查遵循共享的行归约检查路径：
-        - `src.GetValidRow() != 0`
-        - `src.GetValidCol() != 0`
-        - `src.GetValidRow() == dstIdx.GetValidRow()`
-        - `src.GetValidRow() == dstVal.GetValidRow()`
-    - `dstVal`、`dstIdx`通过共享的行归约索引检查路径约束，可使用以下任一非分形布局：
-        - 单列 DN 布局（`BLayout::ColMajor`、`Cols == 1`），或
-        - 有效列数为 1 的 ND 布局。
-
-### `tmp`临时Tile相关说明
-
-- 仅A3使用`tmp`临时Tile，A5接收`tmp`但实际并不使用。
-- 仅输出索引时，`tmp`临时Tile在`srcValidCol <= ElementPerRepeat`时不使用。
-- 同时输出值和索引且`srcValidCol <= ElementPerRepeat`时，`tmp`临时Tile可使用以下任一非分形布局：
-    - 单列 DN 布局（`BLayout::ColMajor`、`Cols == 1`），有效行数为`srcValidRow * 2`。
-    - 有效行数为`srcValidRow`且有效列数为 2 的 ND 布局。
-- `srcValidCol > ElementPerRepeat`时：
-    - `tmp` tile的行数和`src` tile的行数相同。
-    - 按以下公式根据`src` tile的`validCol`算出`tmp` tile所需stride：
+- `dst` 和 `src` 必须为 `TileType::Vec`
+- 支持的源元素类型：`half`、`float`
+- 支持的目标元素类型：`uint32_t`、`int32_t`
+- `src.GetValidRow() != 0`
+- `src.GetValidCol() != 0`
+- `src.GetValidRow() == dst.GetValidRow()`
+- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）
+- `dst` 可使用单列 DN 布局（`BLayout::ColMajor`、`Cols == 1`）或有效列数为 1 的 ND 布局
+
+### A3 `tmp`临时Tile相关说明
+
+- `tmp`临时Tile在`srcValidCol <= ElementPerRepeat`时不使用，`srcValidCol > ElementPerRepeat`时需要使用
+- `tmp` tile的行数和`src` tile的行数相同
+- 按以下公式根据`src` tile的`validCol`算出`tmp` tile所需stride：
 
 ```text
 repeats = ceil(validCol / elementPerRepeat)
-stride = (ceil(repeats * 2 / elementPerBlock) + ceil(repeats / elementPerBlock)) * elementPerBlock
+stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
 ```
 
+## 异常与非法情形
+
+- 运行时检查失败时，行为由具体实现定义
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持 | 是 | 是 | 是 |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -119,18 +103,15 @@ using namespace pto;
 void example_auto() {
   using SrcT = Tile<TileType::Vec, float, 16, 16>;
   using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using DstValT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
   using TmpT = Tile<TileType::Vec, float, 16, 16>;
   SrcT src;
   DstT dst;
-  DstValT dst;
   TmpT tmp;
   TROWARGMIN(dst, src, tmp);
-  TROWARGMIN(dstVal, dst, src, tmp);
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -140,46 +121,24 @@ using namespace pto;
 void example_manual() {
   using SrcT = Tile<TileType::Vec, float, 16, 16>;
   using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using DstValT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
   using TmpT = Tile<TileType::Vec, float, 16, 16>;
   SrcT src;
   DstT dst;
-  DstValT dst;
   TmpT tmp;
   TASSIGN(src, 0x1000);
   TASSIGN(dst, 0x2000);
-  TASSIGN(dstVal, 0x3000);
-  TASSIGN(tmp, 0x4000);
+  TASSIGN(tmp, 0x3000);
   TROWARGMIN(dst, src, tmp);
-  TROWARGMIN(dstVal, dst, src, tmp);
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 自动模式
 %dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
-
-```text
-%dst = trowargmin %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
-pto.trowargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
+## 相关页面
 
-新的 PTO ISA 文档应直接链接到分组后的指令集路径。
+- 指令集总览：[行归约](./tile/ops/reduce-and-expand/trowargmin_zh.md)
diff --git a/docs/isa/TROWEXPANDADD_zh.md b/docs/isa/TROWEXPANDADD_zh.md
index 7110c5d8..04746280 100644
--- a/docs/isa/TROWEXPANDADD_zh.md
+++ b/docs/isa/TROWEXPANDADD_zh.md
@@ -1,14 +1,12 @@
-﻿# TROWEXPANDADD
+﻿# pto.trowexpandadd
 
-## 指令示意图
+`pto.trowexpandadd` 属于[行归约](./tile/ops/reduce-and-expand/trowexpandadd_zh.md)指令集。
 
-![TROWEXPANDADD tile operation](../figures/isa/TROWEXPANDADD.svg)
+## 概述
 
-## 简介
+行广播加法：将 `src1` 中每行的标量值加到 `src0` 对应行的所有元素上，结果写入 `dst`。
 
-行广播加法：加上一个每行标量向量。
-
-## 数学语义
+## 机制
 
 设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_i` 为从 `src1` 中获取的每行标量（每行一个值）。
 
@@ -18,11 +16,9 @@ $$
 \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + s_i
 $$
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
@@ -30,13 +26,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -53,42 +49,82 @@ template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, ty
 PTO_INST RecordEvent TROWEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src0` | 输入 | 源 tile 0，`half` 或 `float` 类型 |
+| `src1` | 输入 | 每行一个标量（模式 1）或每行 32 字节数据（模式 2） |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 行广播加法结果，与 `src0` 类型相同 |
+
+## 副作用
+
+`dst` 的有效区域定义结果的计算范围。
+
 ## 约束
 
 - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-- `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-- 模式 1：`src1` 预期提供**每行一个标量**（即，其有效形状必须覆盖 `R` 个值）。
-- 模式 2：`src1` 预期提供**每行 32 字节数据**。
-- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TRowExpand*.hpp` 下的后端头文件。
+- `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`
+- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`
+- 模式 1：`src1` 预期提供每行一个标量（即，其有效形状必须覆盖 `R` 个值）
+- 模式 2：`src1` 预期提供每行 32 字节数据
+- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TRowExpand*.hpp` 下的后端头文件
+
+## 异常与非法情形
+
+- 运行时检查失败时，行为由具体实现定义
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持 | 是 | 是 | 是 |
 
 ## 示例
 
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+### C++ 自动模式
 
-## 汇编示例（ASM）
+```cpp
+#include <pto/pto-inst.hpp>
 
-### 自动模式
+using namespace pto;
 
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TROWEXPANDADD(dst, src0, src1);
+}
 ```
 
-### 手动模式
+### C++ 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TROWEXPANDADD(dst, src0, src1);
+}
 ```
 
-### PTO 汇编形式
+### PTO-AS
 
 ```text
-%dst = trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+# 自动模式
+%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
+
+## 相关页面
+
+- 指令集总览：[行归约](./tile/ops/reduce-and-expand/trowexpandadd_zh.md)
diff --git a/docs/isa/TROWEXPANDEXPDIF_zh.md b/docs/isa/TROWEXPANDEXPDIF_zh.md
index 8d08fdc5..4ddf7bf3 100644
--- a/docs/isa/TROWEXPANDEXPDIF_zh.md
+++ b/docs/isa/TROWEXPANDEXPDIF_zh.md
@@ -1,14 +1,12 @@
-﻿# TROWEXPANDEXPDIF
+﻿# pto.trowexpandexpdif
 
-## 指令示意图
+`pto.trowexpandexpdif` 属于[行归约](./tile/ops/reduce-and-expand/trowexpandexpdif_zh.md)指令集。
 
-![TROWEXPANDEXPDIF tile operation](../figures/isa/TROWEXPANDEXPDIF.svg)
+## 概述
 
-## 简介
+行指数差运算：计算 `exp(src0 - src1)`，其中 `src1` 为每行标量，结果写入 `dst`。
 
-行指数差运算：计算 exp(src0 - src1)，其中 src1 为每行标量。
-
-## 数学语义
+## 机制
 
 设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_i` 为从 `src1` 中获取的每行标量（每行一个值）。
 
@@ -18,11 +16,9 @@ $$
 \mathrm{dst}_{i,j} = \exp(\mathrm{src0}_{i,j} - s_i)
 $$
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
@@ -30,13 +26,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -53,42 +49,82 @@ template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, ty
 PTO_INST RecordEvent TROWEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src0` | 输入 | 源 tile 0，`half` 或 `float` 类型 |
+| `src1` | 输入 | 每行一个标量（模式 1）或每行 32 字节数据（模式 2） |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 行指数差结果，与 `src0` 类型相同 |
+
+## 副作用
+
+`dst` 的有效区域定义结果的计算范围。
+
 ## 约束
 
 - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-- `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-- 模式 1：`src1` 预期提供**每行一个标量**（即，其有效形状必须覆盖 `R` 个值）。
-- 模式 2：`src1` 预期提供**每行 32 字节数据**。
-- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TRowExpand*.hpp` 下的后端头文件。
+- `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`
+- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`
+- 模式 1：`src1` 预期提供每行一个标量（即，其有效形状必须覆盖 `R` 个值）
+- 模式 2：`src1` 预期提供每行 32 字节数据
+- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TRowExpand*.hpp` 下的后端头文件
+
+## 异常与非法情形
+
+- 运行时检查失败时，行为由具体实现定义
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持 | 是 | 是 | 是 |
 
 ## 示例
 
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+### C++ 自动模式
 
-## 汇编示例（ASM）
+```cpp
+#include <pto/pto-inst.hpp>
 
-### 自动模式
+using namespace pto;
 
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TROWEXPANDEXPDIF(dst, src0, src1);
+}
 ```
 
-### 手动模式
+### C++ 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TROWEXPANDEXPDIF(dst, src0, src1);
+}
 ```
 
-### PTO 汇编形式
+### PTO-AS
 
 ```text
-%dst = trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+# 自动模式
+%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
+
+## 相关页面
+
+- 指令集总览：[行归约](./tile/ops/reduce-and-expand/trowexpandexpdif_zh.md)
diff --git a/docs/isa/TROWEXPANDMAX_zh.md b/docs/isa/TROWEXPANDMAX_zh.md
index b2fc538a..85ae6649 100644
--- a/docs/isa/TROWEXPANDMAX_zh.md
+++ b/docs/isa/TROWEXPANDMAX_zh.md
@@ -1,14 +1,12 @@
-﻿# TROWEXPANDMAX
+﻿# pto.trowexpandmax
 
-## 指令示意图
+`pto.trowexpandmax` 属于[行归约](./tile/ops/reduce-and-expand/trowexpandmax_zh.md)指令集。
 
-![TROWEXPANDMAX tile operation](../figures/isa/TROWEXPANDMAX.svg)
+## 概述
 
-## 简介
+行广播最大值：将 `src1` 中每行的标量值与 `src0` 对应行的所有元素取最大值，结果写入 `dst`。
 
-行广播最大值：与每行标量向量取最大值。
-
-## 数学语义
+## 机制
 
 设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_i` 为从 `src1` 中获取的每行标量（每行一个值）。
 
@@ -18,11 +16,9 @@ $$
 \mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, s_i)
 $$
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
@@ -30,13 +26,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -53,42 +49,82 @@ template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, ty
 PTO_INST RecordEvent TROWEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src0` | 输入 | 源 tile 0，`half` 或 `float` 类型 |
+| `src1` | 输入 | 每行一个标量（模式 1）或每行 32 字节数据（模式 2） |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 行广播最大值结果，与 `src0` 类型相同 |
+
+## 副作用
+
+`dst` 的有效区域定义结果的计算范围。
+
 ## 约束
 
 - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-- `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-- 模式 1：`src1` 预期提供**每行一个标量**（即，其有效形状必须覆盖 `R` 个值）。
-- 模式 2：`src1` 预期提供**每行 32 字节数据**。
-- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TRowExpand*.hpp` 下的后端头文件。
+- `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`
+- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`
+- 模式 1：`src1` 预期提供每行一个标量（即，其有效形状必须覆盖 `R` 个值）
+- 模式 2：`src1` 预期提供每行 32 字节数据
+- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TRowExpand*.hpp` 下的后端头文件
+
+## 异常与非法情形
+
+- 运行时检查失败时，行为由具体实现定义
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持 | 是 | 是 | 是 |
 
 ## 示例
 
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+### C++ 自动模式
 
-## 汇编示例（ASM）
+```cpp
+#include <pto/pto-inst.hpp>
 
-### 自动模式
+using namespace pto;
 
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TROWEXPANDMAX(dst, src0, src1);
+}
 ```
 
-### 手动模式
+### C++ 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TROWEXPANDMAX(dst, src0, src1);
+}
 ```
 
-### PTO 汇编形式
+### PTO-AS
 
 ```text
-%dst = trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+# 自动模式
+%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
+
+## 相关页面
+
+- 指令集总览：[行归约](./tile/ops/reduce-and-expand/trowexpandmax_zh.md)
diff --git a/docs/isa/TROWEXPANDMIN_zh.md b/docs/isa/TROWEXPANDMIN_zh.md
index 8a28598b..9470f967 100644
--- a/docs/isa/TROWEXPANDMIN_zh.md
+++ b/docs/isa/TROWEXPANDMIN_zh.md
@@ -1,14 +1,12 @@
-﻿# TROWEXPANDMIN
+﻿# pto.trowexpandmin
 
-## 指令示意图
+`pto.trowexpandmin` 属于[行归约](./tile/ops/reduce-and-expand/trowexpandmin_zh.md)指令集。
 
-![TROWEXPANDMIN tile operation](../figures/isa/TROWEXPANDMIN.svg)
+## 概述
 
-## 简介
+行广播最小值：将 `src1` 中每行的标量值与 `src0` 对应行的所有元素取最小值，结果写入 `dst`。
 
-行广播最小值：与每行标量向量取最小值。
-
-## 数学语义
+## 机制
 
 设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_i` 为从 `src1` 中获取的每行标量（每行一个值）。
 
@@ -18,11 +16,9 @@ $$
 \mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, s_i)
 $$
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
@@ -30,13 +26,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -53,42 +49,82 @@ template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, ty
 PTO_INST RecordEvent TROWEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src0` | 输入 | 源 tile 0，`half` 或 `float` 类型 |
+| `src1` | 输入 | 每行一个标量（模式 1）或每行 32 字节数据（模式 2） |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 行广播最小值结果，与 `src0` 类型相同 |
+
+## 副作用
+
+`dst` 的有效区域定义结果的计算范围。
+
 ## 约束
 
 - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-- `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-- 模式 1：`src1` 预期提供**每行一个标量**（即，其有效形状必须覆盖 `R` 个值）。
-- 模式 2：`src1` 预期提供**每行 32 字节数据**。
-- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TRowExpand*.hpp` 下的后端头文件。
+- `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`
+- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`
+- 模式 1：`src1` 预期提供每行一个标量（即，其有效形状必须覆盖 `R` 个值）
+- 模式 2：`src1` 预期提供每行 32 字节数据
+- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TRowExpand*.hpp` 下的后端头文件
+
+## 异常与非法情形
+
+- 运行时检查失败时，行为由具体实现定义
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持 | 是 | 是 | 是 |
 
 ## 示例
 
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+### C++ 自动模式
 
-## 汇编示例（ASM）
+```cpp
+#include <pto/pto-inst.hpp>
 
-### 自动模式
+using namespace pto;
 
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TROWEXPANDMIN(dst, src0, src1);
+}
 ```
 
-### 手动模式
+### C++ 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TROWEXPANDMIN(dst, src0, src1);
+}
 ```
 
-### PTO 汇编形式
+### PTO-AS
 
 ```text
-%dst = trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+# 自动模式
+%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
+
+## 相关页面
+
+- 指令集总览：[行归约](./tile/ops/reduce-and-expand/trowexpandmin_zh.md)
diff --git a/docs/isa/TROWMAX_zh.md b/docs/isa/TROWMAX_zh.md
index 3038f9a0..dc92de16 100644
--- a/docs/isa/TROWMAX_zh.md
+++ b/docs/isa/TROWMAX_zh.md
@@ -1,39 +1,34 @@
-﻿# TROWMAX
+﻿# pto.trowmax
 
-## 指令示意图
+`pto.trowmax` 属于[归约与扩展](./tile/reduce-and-expand_zh.md)指令集。
 
-![TROWMAX tile operation](../figures/isa/TROWMAX.svg)
-
-## 简介
+## 概述
 
 通过取列间最大值来归约每一行。
 
-## 数学语义
+## 机制
 
 设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
 
 $$ \mathrm{dst}_{i,0} = \max_{0 \le j < C} \mathrm{src}_{i,j} $$
 
-## 汇编语法
+迭代域由 `src` 的 valid region 决定。C++ 内建接口需要显式传入 `tmp` 操作数。若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
-```text
-%dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -46,36 +41,48 @@ template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typen
 PTO_INST RecordEvent TROWMAX(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile |
+| `%dst` | 目标 tile | 接收按行取最大值结果 |
+| `%tmp` | 临时 tile | 用于分阶段归约的中间存储 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
 
-### 通用约束或检查
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<R, 1>` | 每行的最大元素值 |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
+## 约束
 
 - `dst` 和 `src` 必须均为 `TileType::Vec`。
 - `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 必须使用以下两种非分形布局之一：
-    - ND 布局（`BLayout::RowMajor`、`SLayout::NoneBox`），或
-    - 列数严格为 1 的 DN 布局（`BLayout::ColMajor`、`SLayout::NoneBox`、`Cols == 1`）。
+- `dst` 必须使用以下两种非分形布局之一：ND 布局（`BLayout::RowMajor`、`SLayout::NoneBox`），或列数严格为 1 的 DN 布局（`BLayout::ColMajor`、`SLayout::NoneBox`、`Cols == 1`）。
 - `dst` 和 `src` 的元素类型必须一致。
-- 运行时有效区域检查：
-    - `src.GetValidRow() != 0`
-    - `src.GetValidCol() != 0`
-    - `src.GetValidRow() == dst.GetValidRow()`
-- 内建接口签名要求显式传入 `tmp` 操作数。
+- 运行时有效区域检查：`src.GetValidRow() != 0`、`src.GetValidCol() != 0`、`src.GetValidRow() == dst.GetValidRow()`。
+
+## 异常与非法情形
 
-### A2A3 实现检查
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
 
-- 支持的元素类型：`half`、`float`、`int32_t`、`int16_t`。
-- 实现同时接受 ND 输出和 `Cols == 1` 的 DN 输出。
-- 运行时检查遵循共享的行归约检查路径：
-    - `src.GetValidRow() != 0`
-    - `src.GetValidCol() != 0`
-    - `src.GetValidRow() == dst.GetValidRow()`
-- 当前实现路径会将 `tmp` 传入后端调用，但本文档不额外补充 checked implementation 未显式约束的 `tmp` shape/layout 要求。
+## Target-Profile 限制
 
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `float` | Simulated | Supported | — |
+| `half` | Simulated | Supported | — |
+| `int16_t` / `int32_t` | — | Supported | — |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -93,7 +100,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -114,29 +121,23 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
 %dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 %dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[归约与扩展](./tile/reduce-and-expand_zh.md)
diff --git a/docs/isa/TROWMIN_zh.md b/docs/isa/TROWMIN_zh.md
index eb06e854..32d3248f 100644
--- a/docs/isa/TROWMIN_zh.md
+++ b/docs/isa/TROWMIN_zh.md
@@ -1,39 +1,34 @@
-﻿# TROWMIN
+﻿# pto.trowmin
 
-## 指令示意图
+`pto.trowmin` 属于[归约与扩展](./tile/reduce-and-expand_zh.md)指令集。
 
-![TROWMIN tile operation](../figures/isa/TROWMIN.svg)
-
-## 简介
+## 概述
 
 通过取列间最小值来归约每一行。
 
-## 数学语义
+## 机制
 
 设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
 
 $$ \mathrm{dst}_{i,0} = \min_{0 \le j < C} \mathrm{src}_{i,j} $$
 
-## 汇编语法
+迭代域由 `src` 的 valid region 决定。C++ 内建接口需要显式传入 `tmp` 操作数。若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
-```text
-%dst = trowmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trowmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -46,36 +41,48 @@ template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typen
 PTO_INST RecordEvent TROWMIN(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile |
+| `%dst` | 目标 tile | 接收按行取最小值结果 |
+| `%tmp` | 临时 tile | 用于分阶段归约的中间存储 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
 
-### 通用约束或检查
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<R, 1>` | 每行的最小元素值 |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
+## 约束
 
 - `dst` 和 `src` 必须均为 `TileType::Vec`。
 - `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 必须使用以下两种非分形布局之一：
-    - ND 布局（`BLayout::RowMajor`、`SLayout::NoneBox`），或
-    - 列数严格为 1 的 DN 布局（`BLayout::ColMajor`、`SLayout::NoneBox`、`Cols == 1`）。
+- `dst` 必须使用以下两种非分形布局之一：ND 布局（`BLayout::RowMajor`、`SLayout::NoneBox`），或列数严格为 1 的 DN 布局（`BLayout::ColMajor`、`SLayout::NoneBox`、`Cols == 1`）。
 - `dst` 和 `src` 的元素类型必须一致。
-- 运行时有效区域检查：
-    - `src.GetValidRow() != 0`
-    - `src.GetValidCol() != 0`
-    - `src.GetValidRow() == dst.GetValidRow()`
-- 内建接口签名要求显式传入 `tmp` 操作数。
+- 运行时有效区域检查：`src.GetValidRow() != 0`、`src.GetValidCol() != 0`、`src.GetValidRow() == dst.GetValidRow()`。
+
+## 异常与非法情形
 
-### A2A3 实现检查
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
 
-- 支持的元素类型：`half`、`float`、`int32_t`、`int16_t`。
-- 实现同时接受 ND 输出和 `Cols == 1` 的 DN 输出。
-- 运行时检查遵循共享的行归约检查路径：
-    - `src.GetValidRow() != 0`
-    - `src.GetValidCol() != 0`
-    - `src.GetValidRow() == dst.GetValidRow()`
-- 当前实现路径会将 `tmp` 传入后端调用，但本文档不额外补充 checked implementation 未显式约束的 `tmp` shape/layout 要求。
+## Target-Profile 限制
 
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `float` | Simulated | Supported | — |
+| `half` | Simulated | Supported | — |
+| `int16_t` / `int32_t` | — | Supported | — |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -93,7 +100,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -114,29 +121,23 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
 %dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 %dst = trowmin %src : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.trowmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[归约与扩展](./tile/reduce-and-expand_zh.md)
diff --git a/docs/isa/TROWPROD_zh.md b/docs/isa/TROWPROD_zh.md
index b29c8d67..f7e5b3e7 100644
--- a/docs/isa/TROWPROD_zh.md
+++ b/docs/isa/TROWPROD_zh.md
@@ -1,43 +1,39 @@
-﻿# TROWPROD
+﻿# pto.trowprod
 
-## 指令示意图
+`pto.trowprod` 属于[行归约](./tile/ops/reduce-and-expand/trowprod_zh.md)指令集。
 
-![TROWPROD tile operation](../figures/isa/TROWPROD.svg)
+## 概述
 
-## 简介
+对每行元素进行乘积归约。将源 tile 每行的所有元素相乘，结果写入目标 tile 对应行的第一个位置。
 
-对每行元素进行乘积归约。
-
-## 数学定义
+## 机制
 
 设 `R = src.GetValidRow()` 且 `C = src.GetValidCol()`。对于 `0 <= i < R`：
 
 $$ \mathrm{dst}_{i,0} = \prod_{j=0}^{C-1} \mathrm{src}_{i,j} $$
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = trowprod %src : !pto.tile<...> -> !pto.tile<...>
 ```
 降级可能引入内部临时 tile；C++ 内建函数需要显式的 `tmp` 操作数。
 
-### AS Level 1 (SSA)
+### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### AS Level 2 (DPS)
+### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trowprod ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
-## C++ 内建函数
+## C++ 内建接口
 
 声明于 `include/pto/common/pto_instr.hpp`：
 
@@ -46,41 +42,49 @@ template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typen
 PTO_INST RecordEvent TROWPROD(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
 ```
 
-## 约束条件
+## 输入
 
-### 通用约束或检查
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 | 源 tile，支持 `half`、`float`、`int32_t`、`int16_t` 元素类型 |
+| `tmp` | 临时 | 临时 tile，接口保留但当前实现路径无特定约束 |
 
-- `dst` 和 `src` 必须均为 `TileType::Vec`。
-- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 必须使用以下两种非分形布局之一：
-    - ND 布局（`BLayout::RowMajor`、`SLayout::NoneBox`），或
-    - 列数严格为 1 的 DN 布局（`BLayout::ColMajor`、`SLayout::NoneBox`、`Cols == 1`）。
-- `dst` 和 `src` 的元素类型必须一致。
-- 运行时有效区域检查：
-    - `src.GetValidRow() != 0`
-    - `src.GetValidCol() != 0`
-    - `src.GetValidRow() == dst.GetValidRow()`
-- 内建接口签名要求显式传入 `tmp` 操作数。
+## 预期输出
 
-### A5 实现检查
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 每行元素的乘积结果，类型与 `src` 一致 |
 
-- 支持的元素类型：`half`、`float`、`int32_t`、`int16_t`。
-- 当前检查到的实现路径中，实际受约束的是 `src` 和 `dst`。
-- 当前实现路径中，没有额外要求 `tmp` 必须满足特定 shape/layout 约束。
+## 副作用
 
-## 实现说明
+指令执行完成后，`dst` 有效行数与 `src` 相同，有效列数为 1。
 
-`TROWPROD` 在当前代码库中遵循已实现的 A5 后端路径。该实现会在校验 `src` / `dst` 约束后，直接完成按行乘积归约。
+## 约束
 
-C++ 内建接口中仍然保留 `tmp` 参数，以保持接口形式一致：
+- `dst` 和 `src` 必须均为 `TileType::Vec`
+- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）
+- `dst` 必须使用以下两种非分形布局之一：
+    - ND 布局（`BLayout::RowMajor`、`SLayout::NoneBox`）
+    - 列数严格为 1 的 DN 布局（`BLayout::ColMajor`、`SLayout::NoneBox`、`Cols == 1`）
+- `dst` 和 `src` 的元素类型必须一致
+- `src.GetValidRow() != 0`
+- `src.GetValidCol() != 0`
+- `src.GetValidRow() == dst.GetValidRow()`
+- 内建接口签名要求显式传入 `tmp` 操作数
+
+## 异常与非法情形
+
+- 运行时检查失败时，行为由具体实现定义
 
-1. `tmp` 仍然保留在内建接口签名和 AS lowering 形式中。
-2. 当前检查到的实现路径中，实际被约束的是 `src` 和 `dst`。
-3. 如果后续该指令的其他后端实现对 `tmp` 引入额外要求，文档应再按对应实现同步更新。
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持 | 是 | 是 | 是 |
 
 ## 示例
 
-### Auto 模式
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -98,7 +102,7 @@ void example_auto() {
 }
 ```
 
-### Manual 模式
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -119,29 +123,13 @@ void example_manual() {
 }
 ```
 
-## ASM 形式示例
-
-### Auto 模式
+### PTO-AS
 
 ```text
-# Auto 模式：编译器/运行时管理的放置和调度。
+# 自动模式
 %dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### Manual 模式
+## 相关页面
 
-```text
-# Manual 模式：在发出指令前显式绑定资源。
-# Tile 操作数可选：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trowprod %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowprod ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[行归约](./tile/ops/reduce-and-expand/trowprod_zh.md)
diff --git a/docs/isa/TROWSUM_zh.md b/docs/isa/TROWSUM_zh.md
index d3b3c1df..2e39b4de 100644
--- a/docs/isa/TROWSUM_zh.md
+++ b/docs/isa/TROWSUM_zh.md
@@ -1,54 +1,68 @@
-﻿# TROWSUM
+﻿# pto.trowsum
 
-## 指令示意图
+`pto.trowsum` 属于[归约指令](./tile/reduce-and-expand_zh.md)集。
 
-![TROWSUM tile operation](../figures/isa/TROWSUM.svg)
-
-## 简介
+## 概述
 
 通过对列求和来归约每一行。
 
-## 数学语义
+## 机制
 
 设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
 
 $$ \mathrm{dst}_{i,0} = \sum_{j=0}^{C-1} \mathrm{src}_{i,j} $$
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
 ```text
 %dst = trowsum %src : !pto.tile<...> -> !pto.tile<...>
 ```
+
 降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trowsum ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
 PTO_INST RecordEvent TROWSUM(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
 
-### 通用约束或检查
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 源 Tile | 输入 Tile |
+| `tmp` | 临时 Tile | 用于内部计算的临时 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 按行归约求和后的目标 Tile |
+
+## 副作用
+
+无。
+
+## 约束
 
 - `dst` 和 `src` 必须均为 `TileType::Vec`。
 - `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
@@ -62,20 +76,20 @@ PTO_INST RecordEvent TROWSUM(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp
     - `src.GetValidRow() == dst.GetValidRow()`
 - 内建接口签名要求显式传入 `tmp` 操作数。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`half`、`float`、`int32_t`、`int16_t`。
-- 实现同时接受 ND 输出和 `Cols == 1` 的 DN 输出，并非仅支持 DN 输出。
-- 运行时检查遵循共享的行归约检查路径：
-    - `src.GetValidRow() != 0`
-    - `src.GetValidCol() != 0`
-    - `src.GetValidRow() == dst.GetValidRow()`
-- 当前实现路径会将 `tmp` 传入后端调用，但本文档不额外补充 checked implementation 未显式约束的 `tmp` shape/layout 要求。
+- 未定义。
 
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持的元素类型 | - | `half`、`float`、`int32_t`、`int16_t` | - |
+| 输出布局 | - | ND 或 `Cols==1` DN | - |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -93,7 +107,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -114,29 +128,19 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
 # 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
 
-### 手动模式
-
-```text
 # 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# pto.tassign %src, @tile(0x1000)
+# pto.tassign %dst, @tile(0x2000)
+# pto.tassign %tmp, @tile(0x3000)
 %dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = trowsum %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowsum ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[归约指令](./tile/reduce-and-expand_zh.md)
diff --git a/docs/isa/TSETTF32MODE_zh.md b/docs/isa/TSETTF32MODE_zh.md
index b1c75bf9..64ecb77c 100644
--- a/docs/isa/TSETTF32MODE_zh.md
+++ b/docs/isa/TSETTF32MODE_zh.md
@@ -1,22 +1,18 @@
-# TSETTF32MODE
+# pto.tsettf32mode
 
-## 指令示意图
+`pto.tsettf32mode` 属于[同步与配置](./tile/sync-and-config_zh.md)指令集。
 
-![TSETTF32MODE tile operation](../figures/isa/TSETTF32MODE.svg)
+## 概述
 
-## 简介
+`TSETTF32MODE` 设置 TF32 相关的变换模式。它本身不做张量算术，而是更新后续相关计算会读取的模式状态。该指令属于同步与配置路径，更接近"模式寄存器写入"，而不是普通 tile 运算。它的效果取决于目标实现如何解释 TF32 模式配置。
 
-`TSETTF32MODE` 设置 TF32 相关的变换模式。它本身不做张量算术，而是更新后续相关计算会读取的模式状态。
+## 机制
 
-## 语义
+该指令写入 TF32 模式寄存器，设置是否启用 TF32 以及具体的变换模式。后续的矩阵运算指令会读取此状态来决定是否使用 TF32 计算路径。
 
-该指令属于同步与配置路径，更接近“模式寄存器写入”，而不是普通 tile 运算。它的效果取决于目标实现如何解释 TF32 模式配置。
+## 语法
 
-## 汇编语法
-
-PTO-AS 形式见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-示意形式：
+### PTO-AS
 
 ```text
 tsettf32mode {enable = true, mode = ...}
@@ -24,13 +20,13 @@ tsettf32mode {enable = true, mode = ...}
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 pto.tsettf32mode {enable = true, mode = ...}
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tsettf32mode ins({enable = true, mode = ...}) outs()
 ```
 
@@ -43,14 +39,42 @@ template <bool isEnable, RoundMode tf32TransMode = RoundMode::CAST_ROUND, typena
 PTO_INST RecordEvent TSETTF32MODE(WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| 事件 | 可选 | 等待事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 事件 | RecordEvent | 同步事件 |
+
+## 副作用
+
+该指令设置 TF32 模式寄存器状态，影响后续所有使用 TF32 格式的矩阵运算行为。
+
 ## 约束
 
-- 仅在对应 backend capability macro 启用时可用。
-- 精确模式取值和硬件行为由目标实现定义。
-- 该指令具有控制状态副作用，应与依赖它的计算指令建立正确顺序。
+- 仅在对应 backend capability macro 启用时可用
+- 精确模式取值和硬件行为由目标实现定义
+- 该指令具有控制状态副作用，应与依赖它的计算指令建立正确顺序
+
+## 异常与非法情形
+
+- 未指定
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| TF32 模式设置 | - | 可选 | 可选 |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 using namespace pto;
@@ -60,7 +84,13 @@ void example_enable_tf32() {
 }
 ```
 
+### PTO-AS
+
+```text
+pto.tsettf32mode {enable = true, mode = ...}
+```
+
 ## 相关页面
 
-- [同步与配置指令集](./tile/sync-and-config_zh.md)
-- [TSETHF32MODE](./TSETHF32MODE_zh.md)
+- 指令集总览：[同步与配置](./tile/sync-and-config_zh.md)
+- 相关指令：[TSETHF32MODE](./tsethf32mode_zh.md)
diff --git a/docs/isa/TSET_IMG2COL_PADDING_zh.md b/docs/isa/TSET_IMG2COL_PADDING_zh.md
index bf70ce99..7e1a8c75 100644
--- a/docs/isa/TSET_IMG2COL_PADDING_zh.md
+++ b/docs/isa/TSET_IMG2COL_PADDING_zh.md
@@ -1,22 +1,18 @@
-# TSET_IMG2COL_PADDING
+# pto.tset_img2col_padding
 
-## 指令示意图
+`pto.tset_img2col_padding` 属于[同步与配置](./tile/sync-and-config_zh.md)指令集。
 
-![TSET_IMG2COL_PADDING tile operation](../figures/isa/TSET_IMG2COL_PADDING.svg)
+## 概述
 
-## 简介
+从 IMG2COL 配置 Tile 设置 IMG2COL 填充元数据。该指令本身不产生直接的张量算术结果，而是更新供后续数据搬运操作消费的 IMG2COL padding 控制状态。
 
-从 IMG2COL 配置 Tile 设置 IMG2COL 填充元数据。
+## 机制
 
-## 数学语义
+该指令将配置 Tile 中的填充参数写入硬件状态寄存器，后续的 IMG2COL 操作会读取这些参数来确定填充行为。
 
-该指令本身不产生直接的张量算术结果，而是更新供后续数据搬运操作消费的 IMG2COL padding 控制状态。
+## 语法
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS_zh.md).
-
-示意形式：
+### PTO-AS
 
 ```text
 tset_img2col_padding %cfg
@@ -24,13 +20,13 @@ tset_img2col_padding %cfg
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tset_img2col_padding ins(%cfg : !pto.fmatrix_config) outs()
 ```
 
@@ -48,15 +44,44 @@ PTO_INST RecordEvent TSET_IMG2COL_PADDING(ConvTileData &src, WaitEvents &... eve
 
 For `MEMORY_BASE` targets, an overload without `SetFmatrixMode` is also provided.
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| src | 输入 | IMG2COL 配置 Tile |
+| events | 可选 | 等待事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 事件 | RecordEvent | 同步事件 |
+
+## 副作用
+
+该指令更新 IMG2COL padding 控制状态，影响后续 TIMG2COL 操作的行为。
+
 ## 约束
 
-- This instruction is backend-specific and available only for backends that expose IMG2COL configuration state.
-- `src` must be a valid IMG2COL configuration tile type accepted by the backend implementation.
-- The exact padding fields updated by this instruction are implementation-defined.
-- Use this instruction before dependent `TIMG2COL` operations in the same execution stream.
+- 该指令是后端特定的，仅适用于暴露 IMG2COL 配置状态的硬件平台
+- src 必须是后端实现可接受的 IMG2COL 配置 tile 类型
+- 该指令更新的确切填充字段由实现定义
+- 在同一执行流中，应在依赖的 TIMG2COL 操作之前使用此指令
+
+## 异常与非法情形
+
+- 未指定
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| IMG2COL Padding 配置 | - | 支持 | 支持 |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -66,3 +91,14 @@ void example_set_img2col_padding(Img2colTileConfig<uint64_t>& cfg) {
   TSET_IMG2COL_PADDING(cfg);
 }
 ```
+
+### PTO-AS
+
+```text
+pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
+```
+
+## 相关页面
+
+- 指令集总览：[同步与配置](./tile/sync-and-config_zh.md)
+- 相关指令：[TIMG2COL](./timg2col_zh.md)、[TSET_IMG2COL_RPT](./TSET_IMG2COL_RPT_zh.md)
diff --git a/docs/isa/TSET_IMG2COL_RPT_zh.md b/docs/isa/TSET_IMG2COL_RPT_zh.md
index 4d84bbcf..31c7e163 100644
--- a/docs/isa/TSET_IMG2COL_RPT_zh.md
+++ b/docs/isa/TSET_IMG2COL_RPT_zh.md
@@ -1,22 +1,18 @@
-# TSET_IMG2COL_RPT
+# pto.tset_img2col_rpt
 
-## 指令示意图
+`pto.tset_img2col_rpt` 属于[同步与配置](./tile/sync-and-config_zh.md)指令集。
 
-![TSET_IMG2COL_RPT tile operation](../figures/isa/TSET_IMG2COL_RPT.svg)
+## 概述
 
-## 简介
+从 IMG2COL 配置 Tile 设置 IMG2COL 重复次数元数据。该指令本身不产生直接的张量算术结果，而是更新供后续数据搬运操作使用的 IMG2COL 控制状态。
 
-从 IMG2COL 配置 Tile 设置 IMG2COL 重复次数元数据。
+## 机制
 
-## 数学语义
+该指令将配置 Tile 中的重复次数参数写入硬件状态寄存器，后续的 IMG2COL 操作会读取这些参数来确定数据重复搬运的次数。
 
-该指令本身不产生直接的张量算术结果，而是更新供后续数据搬运操作使用的 IMG2COL 控制状态。
+## 语法
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS_zh.md).
-
-示意形式：
+### PTO-AS
 
 ```text
 tset_img2col_rpt %cfg
@@ -24,13 +20,13 @@ tset_img2col_rpt %cfg
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tset_img2col_rpt ins(%cfg : !pto.fmatrix_config) outs()
 ```
 
@@ -48,15 +44,44 @@ PTO_INST RecordEvent TSET_IMG2COL_RPT(ConvTileData &src, WaitEvents &... events)
 
 For `MEMORY_BASE` targets, an overload without `SetFmatrixMode` is also provided.
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| src | 输入 | IMG2COL 配置 Tile |
+| events | 可选 | 等待事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 事件 | RecordEvent | 同步事件 |
+
+## 副作用
+
+该指令更新 IMG2COL 重复次数控制状态，影响后续 TIMG2COL 操作的数据搬运行为。
+
 ## 约束
 
-- This instruction is backend-specific and available only for backends that expose IMG2COL configuration state.
-- `src` must be a valid IMG2COL configuration tile type accepted by the backend implementation.
-- The exact register/metadata fields updated by this instruction are implementation-defined.
-- Use this instruction before dependent `TIMG2COL` operations in the same execution stream.
+- 该指令是后端特定的，仅适用于暴露 IMG2COL 配置状态的硬件平台
+- src 必须是后端实现可接受的 IMG2COL 配置 tile 类型
+- 该指令更新的确切寄存器/元数据字段由实现定义
+- 在同一执行流中，应在依赖的 TIMG2COL 操作之前使用此指令
+
+## 异常与非法情形
+
+- 未指定
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| IMG2COL 重复次数配置 | - | 支持 | 支持 |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -66,3 +91,14 @@ void example_set_img2col_rpt(Img2colTileConfig<uint64_t>& cfg) {
   TSET_IMG2COL_RPT(cfg);
 }
 ```
+
+### PTO-AS
+
+```text
+pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
+```
+
+## 相关页面
+
+- 指令集总览：[同步与配置](./tile/sync-and-config_zh.md)
+- 相关指令：[TIMG2COL](./timg2col_zh.md)、[TSET_IMG2COL_PADDING](./TSET_IMG2COL_PADDING_zh.md)
diff --git a/docs/isa/TSORT32_zh.md b/docs/isa/TSORT32_zh.md
index 2229f183..69607982 100644
--- a/docs/isa/TSORT32_zh.md
+++ b/docs/isa/TSORT32_zh.md
@@ -1,14 +1,12 @@
-﻿# TSORT32
+﻿# pto.tsort32
 
-## 指令示意图
+`pto.tsort32` 属于[不规则与复杂指令](./tile/irregular-and-complex_zh.md)集。
 
-![TSORT32 tile operation](../figures/isa/TSORT32.svg)
+## 概述
 
-## 简介
+对 `src` 的每个 32 元素块，与 `idx` 中对应的索引一起进行排序，并将排序后的值-索引对写入 `dst`。在 CPU 仿真实现中，按值降序排序；当值相同时，索引较小者优先。
 
-对 `src` 的每个 32 元素块，与 `idx` 中对应的索引一起进行排序，并将排序后的值-索引对写入 `dst`。
-
-## 数学语义
+## 机制
 
 对每一行，`TSORT32` 会按独立的 32 元素块处理 `src`。设第 `b` 个块覆盖列 `32b ... 32b+31`，该块的有效元素数为 `n_b = min(32, C - 32b)`。
 
@@ -27,14 +25,14 @@ $$
 其中 `π` 是该 32 元素块对应的排序置换。
 
 说明：
-
 - `idx` 是输入 Tile，不是输出 Tile。
 - `dst` 保存的是排序后的值-索引对，而不只是排序后的值。
-- 在 CPU 仿真实现中，按值降序排序；当值相同时，索引较小者优先。
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -44,20 +42,18 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename DstTileData, typename SrcTileData, typename IdxTileData>
 PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx);
@@ -66,23 +62,53 @@ template <typename DstTileData, typename SrcTileData, typename IdxTileData, type
 PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx, TmpTileData &tmp);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 源 Tile | 包含待排序值的输入 Tile |
+| `idx` | 源 Tile | 与 `src` 一起参与重排的索引 Tile |
+| `tmp` | 临时 Tile | 可选，支持非 32 对齐尾块 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 排序后的值-索引对 |
+
+## 副作用
+
+无。
+
 ## 约束
 
 - `TSORT32` 不接受 `WaitEvents&...` 参数，也不在内部调用 `TSYNC(...)`；如有需要请显式同步。
 - `idx` 在两个重载中都是必需的输入操作数；它提供与 `src` 一起参与重排的索引。
-- **实现检查 (A2A3/A5)**:
+- 实现检查 (A2A3/A5):
     - `DstTileData::DType` 必须是 `half` 或 `float`。
     - `SrcTileData::DType` 必须与 `DstTileData::DType` 匹配。
     - `IdxTileData::DType` 必须是 `uint32_t`。
     - `dst`/`src`/`idx` Tile 位置必须是 `TileType::Vec`，且都必须是行主序（`isRowMajor`）。
-- **有效区域**:
+- 有效区域:
     - 实现使用 `dst.GetValidRow()` 作为行数。
     - 实现使用 `src.GetValidCol()` 确定每行参与排序的元素数量。
     - 排序按独立的 32 元素块进行；4 参数重载额外通过 `tmp` 支持非 32 对齐尾块。
 
+## 异常与非法情形
+
+- 未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| DstTileData::DType | - | `half` 或 `float` | `half` 或 `float` |
+| IdxTileData::DType | - | `uint32_t` | `uint32_t` |
+| 布局要求 | - | `isRowMajor` | `isRowMajor` |
+
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -100,7 +126,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -121,30 +147,19 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
 # 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
 
-### 手动模式
-
-```text
 # 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-# pto.tassign %arg2, @tile(0x3000)
+# pto.tassign %src, @tile(0x1000)
+# pto.tassign %idx, @tile(0x2000)
+# pto.tassign %dst, @tile(0x3000)
 %dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[不规则与复杂指令](./tile/irregular-and-complex_zh.md)
diff --git a/docs/isa/TSQRT_zh.md b/docs/isa/TSQRT_zh.md
index a0f8f75c..739d37b0 100644
--- a/docs/isa/TSQRT_zh.md
+++ b/docs/isa/TSQRT_zh.md
@@ -1,24 +1,24 @@
-﻿# TSQRT
-
-## 指令示意图
+﻿# pto.tsqrt
 
 ![TSQRT tile operation](../figures/isa/TSQRT.svg)
 
-## 简介
+`pto.tsqrt` 属于[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)指令集。
+
+## 概述
 
-逐元素平方根。
+对 tile 做逐元素平方根，结果写入目标 tile。迭代域由目标 tile 的 valid region 决定。
 
-## 数学语义
+## 机制
 
-对每个元素 `(i, j)` 在有效区域内：
+对目标 tile 的 valid region 中每个 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \sqrt{\mathrm{src}_{i,j}} $$
 
-## 汇编语法
+它是 tile 路径的一元平方根操作，适用于归一化、距离计算和数值预处理。对负输入的定义域行为由目标 profile 决定。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tsqrt %src : !pto.tile<...>
@@ -26,45 +26,66 @@ PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
 PTO_INST RecordEvent TSQRT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 输入 tile |
+| `%dst` | 目标 tile | 接收逐元素平方根结果 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | `dst` valid region 内的每个元素都等于 `sqrt(src)` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (NPU)**:
-    - `TileData::DType` 必须是以下之一：`float` 或 `half`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`);
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域.
-- **域 / NaN**:
-    - 行为由目标定义（例如，对于负数输入）。
+- 支持类型当前是 `float` / `half`。
+- tile 必须是行主序向量 tile。
+- 迭代域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定。
+- 对负输入的定义域行为由目标 profile 决定。
+
+## 异常与非法情形
+
+- 非法操作数组合、不支持的数据类型、不合法布局或不支持的 target-profile 模式，会被 verifier 或后端实现拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `float` | Simulated | Supported | Supported |
+| `half` | Simulated | Supported | Supported |
+| 布局 | Any | RowMajor only | RowMajor only |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
-
 using namespace pto;
 
 void example_auto() {
@@ -74,11 +95,10 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
-
 using namespace pto;
 
 void example_manual() {
@@ -90,29 +110,24 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 %dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
+pto.tassign %arg0, @tile(0x1000)
+pto.tassign %arg1, @tile(0x2000)
 %dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
 
-```text
+# PTO 汇编形式
 %dst = tsqrt %src : !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[逐元素 Tile-Tile](./tile/elementwise-tile-tile_zh.md)
+- 规范页：[pto.tsqrt](./tile/ops/elementwise-tile-tile/tsqrt_zh.md)
diff --git a/docs/isa/TSTORE_zh.md b/docs/isa/TSTORE_zh.md
index d033dba5..5fae068f 100644
--- a/docs/isa/TSTORE_zh.md
+++ b/docs/isa/TSTORE_zh.md
@@ -1,22 +1,18 @@
-﻿# TSTORE
+﻿# pto.tstore
 
-## 指令示意图
+`pto.tstore` 属于[内存与数据搬运](./tile/memory-and-data-movement_zh.md)指令集。
 
-![TSTORE tile operation](../figures/isa/TSTORE.svg)
+## 概述
 
-## 简介
+将 Tile 中的数据存储到 GlobalTensor (GM)，可选使用原子写入或量化参数。符号表示取决于 GlobalTensor 的形状/步长和 Tile 的布局。概念上（二维视图，带基础偏移量）：$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{src}_{i,j} $
 
-将 Tile 中的数据存储到 GlobalTensor (GM)，可选使用原子写入或量化参数。
+## 机制
 
-## 数学语义
+该指令将源 tile 的有效区域数据写入全局内存目标地址，支持可选的原子操作类型和量化参数。原子写入模式允许并发安全地更新全局内存位置。
 
-符号表示取决于 `GlobalTensor` 的形状/步长和 `Tile` 的布局。概念上（二维视图，带基础偏移量）：
+## 语法
 
-$$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+### PTO-AS
 
 同步形式：
 
@@ -26,13 +22,13 @@ tstore %t1, %sv_out[%c0, %c0]
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tstore ins(%src : !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
 ```
 
@@ -54,39 +50,72 @@ template <typename TileData, typename GlobalData, typename FpTileData, AtomicTyp
 PTO_INST RecordEvent TSTORE_FP(GlobalData& dst, TileData& src, FpTileData& fp, WaitEvents&... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| dst | 输出 | 全局内存目标（GlobalTensor） |
+| src | 输入 | 源 Tile |
+| preQuantScalar | 可选 | 标量量化参数 |
+| fp | 可选 | 浮点量化 Tile |
+| events | 可选 | 等待事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| dst | GlobalTensor& | 数据已写入的全局内存位置 |
+| 事件 | RecordEvent | 同步事件 |
+
+## 副作用
+
+该指令可能产生全局内存的原子写入副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - 源 tile 位置必须是以下之一：`TileType::Vec`、`TileType::Mat`、`TileType::Acc`。
-    - 运行时：所有 `dst.GetShape(dim)` 值和 `src.GetValidRow()/GetValidCol()` 必须 `> 0`。
-    - 对于 `TileType::Vec` / `TileType::Mat`：
-    - `TileData::DType` 必须是以下之一：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`int64_t`、`uint64_t`、`half`、`bfloat16_t`、`float`。
-    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`。
-    - 布局必须匹配 ND/DN/NZ（或特殊情况：`TileData::Rows == 1` 或 `TileData::Cols == 1`）。
-    - 对于 `int64_t/uint64_t`，仅支持 ND->ND 或 DN->DN。
-    - 对于 `TileType::Acc`（包括量化/原子变体）：
-    - 目标布局必须是 ND 或 NZ。
-    - 源数据类型必须是 `int32_t` 或 `float`。
-    - 不使用量化时，目标数据类型必须是 `__gm__ int32_t/float/half/bfloat16_t`。
-    - 静态形状约束：`1 <= TileData::Cols <= 4095`；如果是 ND 则 `1 <= TileData::Rows <= 8192`；如果是 NZ 则 `1 <= TileData::Rows <= 65535` 且 `TileData::Cols % 16 == 0`。
-    - 运行时：`1 <= src.GetValidCol() <= 4095`。
-- **实现检查 (A5)**:
-    - 源 tile 位置必须是 `TileType::Vec` 或 `TileType::Acc`（此目标不支持 `Mat` 存储）。
-    - 对于 `TileType::Vec`：
-    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`。
-    - `TileData::DType` 必须是以下之一：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`int64_t`、`uint64_t`、`half`、`bfloat16_t`、`float`、`float8_e4m3_t`、`float8_e5m2_t`、`hifloat8_t`、`float4_e1m2x2_t`、`float4_e2m1x2_t`。
-    - 布局必须匹配 ND/DN/NZ（或特殊情况：`TileData::Rows == 1` 或 `TileData::Cols == 1`）。
-    - 强制执行额外的对齐约束（例如，对于 ND，行主序宽度（以字节为单位）必须是 32 的倍数；对于 DN，列主序高度（以字节为单位）必须是 32 的倍数，但有特殊情况例外）。
-    - 对于 `TileType::Acc`：
-    - 目标布局必须是 ND 或 NZ；源数据类型必须是 `int32_t` 或 `float`。
-    - 不使用量化时，目标数据类型必须是 `__gm__ int32_t/float/half/bfloat16_t`。
-    - 静态形状约束与 A2A3 对于行/列的约束相同；`AtomicAdd` 额外限制目标数据类型为支持的原子类型。
-- **有效区域**:
-    - 实现使用 `src.GetValidRow()` / `src.GetValidCol()` 作为传输大小.
+- 实现检查 (A2A3):
+    - 源 tile 位置必须是以下之一：TileType::Vec、TileType::Mat、TileType::Acc
+    - 运行时：所有 dst.GetShape(dim) 值和 src.GetValidRow()/GetValidCol() 必须 > 0
+    - 对于 TileType::Vec / TileType::Mat：
+    - TileData::DType 必须是以下之一：int8_t、uint8_t、int16_t、uint16_t、int32_t、uint32_t、int64_t、uint64_t、half、bfloat16_t、float
+    - sizeof(TileData::DType) == sizeof(GlobalData::DType)
+    - 布局必须匹配 ND/DN/NZ（或特殊情况：TileData::Rows == 1 或 TileData::Cols == 1）
+    - 对于 int64_t/uint64_t，仅支持 ND->ND 或 DN->DN
+    - 对于 TileType::Acc（包括量化/原子变体）：
+    - 目标布局必须是 ND 或 NZ
+    - 源数据类型必须是 int32_t 或 float
+    - 不使用量化时，目标数据类型必须是 __gm__ int32_t/float/half/bfloat16_t
+    - 静态形状约束：1 <= TileData::Cols <= 4095；如果是 ND 则 1 <= TileData::Rows <= 8192；如果是 NZ 则 1 <= TileData::Rows <= 65535 且 TileData::Cols % 16 == 0
+    - 运行时：1 <= src.GetValidCol() <= 4095
+- 实现检查 (A5):
+    - 源 tile 位置必须是 TileType::Vec 或 TileType::Acc（此目标不支持 Mat 存储）
+    - 对于 TileType::Vec：
+    - sizeof(TileData::DType) == sizeof(GlobalData::DType)
+    - TileData::DType 必须是以下之一：int8_t、uint8_t、int16_t、uint16_t、int32_t、uint32_t、int64_t、uint64_t、half、bfloat16_t、float、float8_e4m3_t、float8_e5m2_t、hifloat8_t、float4_e1m2x2_t、float4_e2m1x2_t
+    - 布局必须匹配 ND/DN/NZ（或特殊情况：TileData::Rows == 1 或 TileData::Cols == 1）
+    - 强制执行额外的对齐约束（例如，对于 ND，行主序宽度（以字节为单位）必须是 32 的倍数；对于 DN，列主序高度（以字节为单位）必须是 32 的倍数，但有特殊情况例外）
+    - 对于 TileType::Acc：
+    - 目标布局必须是 ND 或 NZ；源数据类型必须是 int32_t 或 float
+    - 不使用量化时，目标数据类型必须是 __gm__ int32_t/float/half/bfloat16_t
+    - 静态形状约束与 A2A3 对于行/列的约束相同；AtomicAdd 额外限制目标数据类型为支持的原子类型
+- 有效区域:
+    - 实现使用 src.GetValidRow() / src.GetValidCol() 作为传输大小
+
+## 异常与非法情形
+
+- 未指定
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| Vec/Mat/Acc Tile 存储 | - | 支持 | Vec/Acc 有限支持 |
+| AtomicAdd | - | 支持 | 支持 |
+| 量化存储 | - | 支持 | 支持 |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -106,7 +135,7 @@ void example_auto(__gm__ T* out) {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -127,29 +156,22 @@ void example_manual(__gm__ T* out) {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式
 pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### 手动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
+# 手动模式
 pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
 
-### PTO 汇编形式
-
-```text
+# PTO 汇编形式
 tstore %t1, %sv_out[%c0, %c0]
 # AS Level 2 (DPS)
 pto.tstore ins(%src : !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
 ```
+
+## 相关页面
+
+- 指令集总览：[内存与数据搬运](./tile/memory-and-data-movement_zh.md)
+- 相关指令：[TLOAD](./TLOAD_zh.md)
diff --git a/docs/isa/comm/README_zh.md b/docs/isa/comm/README_zh.md
index 07785ec2..0d4eafcf 100644
--- a/docs/isa/comm/README_zh.md
+++ b/docs/isa/comm/README_zh.md
@@ -1,130 +1,39 @@
-# PTO 通信 ISA 参考手册
+# PTO Communication ISA Reference
 
-本目录包含 PTO 通信 ISA 的逐指令参考文档。
+本目录包含 PTO 通信 ISA 的逐指令参考文档。通信操作实现跨执行代理和并行 rank 的数据移动和同步。
 
-- 权威来源（C++ 内建接口）：`include/pto/comm/pto_comm_inst.hpp`
-- 类型定义：`include/pto/comm/comm_types.hpp`
-
-## 点对点通信（同步）
-- [**TPUT**](TPUT_zh.md)：远程写（GM → UB → GM）
-- [**TGET**](TGET_zh.md)：远程读（GM → UB → GM）
-
-## 点对点通信（异步）
-- [**TPUT_ASYNC**](TPUT_ASYNC_zh.md)：异步远程写（GM → DMA 引擎 → GM）
-- [**TGET_ASYNC**](TGET_ASYNC_zh.md)：异步远程读（GM → DMA 引擎 → GM）
-
-## 基于信号的同步
-- [**TNOTIFY**](TNOTIFY_zh.md)：向远端 NPU 发送通知
-- [**TWAIT**](TWAIT_zh.md)：阻塞等待信号条件满足
-- [**TTEST**](TTEST_zh.md)：非阻塞检测信号条件
-
-## 集合通信
-
-- [**TGATHER**](TGATHER_zh.md)：从所有 rank 收集数据
-- [**TSCATTER**](TSCATTER_zh.md)：向所有 rank 分发数据
-- [**TREDUCE**](TREDUCE_zh.md)：从所有 rank 归约数据到本地
-- [**TBROADCAST**](TBROADCAST_zh.md)：从当前 NPU 广播数据到所有 rank
-
-## 类型定义
-
-### NotifyOp
-
-`TNOTIFY` 的操作类型：
-
-| 值 | 说明 |
-|-------|-------------|
-| `NotifyOp::Set` | 直接赋值（`signal = value`）|
-| `NotifyOp::AtomicAdd` | 原子加（`signal += value`）|
-
-### WaitCmp
-
-`TWAIT` 和 `TTEST` 的比较运算符：
-
-| 值 | 说明 |
-|-------|-------------|
-| `WaitCmp::EQ` | 等于（`==`）|
-| `WaitCmp::NE` | 不等于（`!=`）|
-| `WaitCmp::GT` | 大于（`>`）|
-| `WaitCmp::GE` | 大于等于（`>=`）|
-| `WaitCmp::LT` | 小于（`<`）|
-| `WaitCmp::LE` | 小于等于（`<=`）|
-
-```cpp
-// 用法示例（统一运行时参数风格）：
-comm::TNOTIFY(signal, 1, comm::NotifyOp::Set);
-comm::TWAIT(signal, 1, comm::WaitCmp::EQ);
-comm::TTEST(signal, 1, comm::WaitCmp::GE);
-```
-
-### ReduceOp
-
-`TREDUCE` 的归约运算符：
-
-| 值 | 说明 |
-|-------|-------------|
-| `ReduceOp::Sum` | 逐元素求和 |
-| `ReduceOp::Max` | 逐元素取最大值 |
-| `ReduceOp::Min` | 逐元素取最小值 |
+## 权威来源
 
-### AtomicType
-
-`TPUT` 的原子操作类型（定义于 `include/pto/common/constants.hpp`）：
-
-| 值 | 说明 |
-|-------|-------------|
-| `AtomicType::AtomicNone` | 无原子操作（默认）|
-| `AtomicType::AtomicAdd` | 原子加操作 |
-
-### DmaEngine
-
-`TPUT_ASYNC` 和 `TGET_ASYNC` 的 DMA 后端选择：
-
-| 值 | 说明 |
-|-------|-------------|
-| `DmaEngine::SDMA` | SDMA 引擎（支持一维传输，Ascend950 上仅支持TGET|
-| `DmaEngine::URMA` | URMA 引擎（支持一维传输，仅Ascend950 / NPU_ARCH 3510）支持|
-
-### AsyncEvent
-
-由 `TPUT_ASYNC` / `TGET_ASYNC` 返回，用于同步传输完成状态：
-
-```cpp
-struct AsyncEvent {
-    uint64_t handle;
-    DmaEngine engine;
-
-    bool valid() const;                        // handle != 0 时返回 true
-    bool Wait(const AsyncSession &session) const; // 阻塞直到传输完成
-    bool Test(const AsyncSession &session) const; // 非阻塞完成检测
-};
-```
-
-### AsyncSession
-
-用于异步 DMA 操作的引擎无关会话对象，构建一次后传递给所有异步调用：
-
-```cpp
-comm::AsyncSession session;
-comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, workspace, session);
-```
-
-定义于 `include/pto/comm/async/async_types.hpp`。构建参数详见 [TPUT_ASYNC](TPUT_ASYNC_zh.md)。
-
-### ParallelGroup
-
-用于多 NPU 集合通信的包装器：
-
-```cpp
-template <typename GlobalData>
-struct ParallelGroup {
-    // 指向 `GlobalData` 对象数组的指针（每个对象封装一个 GM 地址）。
-    // 数组本身是本地元数据；封装的地址可以指向本地或远端 GM，
-    // 具体取决于集合通信指令的语义。
-    GlobalData *tensors;
-    int nranks;   // rank 总数
-    int rootIdx;  // 根 NPU 的 rank 索引
+- C++ 内建接口：`include/pto/comm/pto_comm_inst.hpp`
+- 类型定义：`include/pto/comm/comm_types.hpp`
 
-    // 工厂函数（推荐）：从已有 tensor 数组构建。
-    static ParallelGroup Create(GlobalData *tensorArray, int size, int rank_id);
-};
-```
+## 按操作类型选择
+
+| 类型 | 指令 | 说明 |
+|------|------|------|
+| **点对点同步写** | [TPUT](./TPUT_zh.md) | 远程写（GM → UB → GM） |
+| **点对点同步读** | [TGET](./TGET_zh.md) | 远程读（GM → UB → GM） |
+| **点对点异步写** | [TPUT_ASYNC](./TPUT_ASYNC_zh.md) | 异步远程写（GM → DMA 引擎 → GM） |
+| **点对点异步读** | [TGET_ASYNC](./TGET_ASYNC_zh.md) | 异步远程读（GM → DMA 引擎 → GM） |
+| **通知** | [TNOTIFY](./TNOTIFY_zh.md) | 向远端 NPU 发送通知 |
+| **等待** | [TWAIT](./TWAIT_zh.md) | 阻塞等待信号条件满足 |
+| **非阻塞检测** | [TTEST](./TTEST_zh.md) | 非阻塞检测信号条件 |
+| **广播** | [TBROADCAST](./TBROADCAST_zh.md) | 从 root NPU 广播数据到所有 rank |
+| **收集** | [TGATHER](./TGATHER_zh.md) | 从所有 rank 收集数据到 root |
+| **散发** | [TSCATTER](./TSCATTER_zh.md) | 从 root 向所有 rank 分发数据 |
+| **归约** | [TREDUCE](./TREDUCE_zh.md) | 从所有 rank 归约数据到本地 |
+
+## 同步 vs 异步
+
+| 类型 | 特点 | CANN 版本要求 |
+|------|------|--------------|
+| 同步（`TPUT`/`TGET`） | 操作阻塞直到完成，通过 UB 暂存 tile | CANN 8.x 及以上 |
+| 异步（`TPUT_ASYNC`/`TGET_ASYNC`） | 非阻塞，通过 SDMA/URMA 引擎，支持轮询查询 | **CANN 9.0 及以上** |
+
+## 相关页面
+
+| 页面 | 内容 |
+|------|------|
+| [通信与运行时契约](../other/communication-and-runtime_zh.md) | 通信指令集的规范契约 |
+| [其他与通信](../other/README_zh.md) | 其他与通信指令集总入口 |
+| [tests/](../../../tests/README_zh.md) | 通信测试运行方式 |
diff --git a/docs/isa/comm/TBROADCAST.md b/docs/isa/comm/TBROADCAST.md
index 6cf6ad29..69e3b1b2 100644
--- a/docs/isa/comm/TBROADCAST.md
+++ b/docs/isa/comm/TBROADCAST.md
@@ -1,67 +1,112 @@
 ﻿# TBROADCAST
 
-## Introduction
+`TBROADCAST` is part of the [Communication and Runtime](../other/communication-and-runtime.md) instruction set.
 
-Broadcast data from current NPU to all ranks in the parallel group. The calling NPU is the root and its data is copied to all other NPUs.
+## Summary
 
-Only the root needs to execute `TBROADCAST`. Non-root ranks only need to ensure their destination buffers are allocated and writable for the duration of the operation. Calling `TBROADCAST` on non-root ranks is undefined behavior.
+Broadcast data from the root NPU to all ranks in a parallel group. The calling NPU serves as the root; its data is replicated to every other NPU in the group.
 
-**Large Tile Support**: When the GlobalTensor exceeds the UB (Unified Buffer) tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding.
+Only the root executes the broadcast. Non-root ranks only ensure their destination buffers are allocated and writable. Executing `TBROADCAST` on a non-root rank has undefined behavior.
 
-## Math Interpretation
+When the GlobalTensor exceeds the UB tile capacity, the transfer is automatically chunked via 2D sliding.
 
-After the operation:
+## Mechanism
 
-$$ \mathrm{dst}^{(k)}_{i,j} = \mathrm{src}^{(\text{root})}_{i,j} \quad \forall k \in [0, N) $$
+`TBROADCAST` copies data from the root NPU's source buffer to the corresponding destination buffer on every other NPU in the parallel group. The data path uses UB as a staging area: GM → UB → GM.
+
+For rank $k$ in a group of $N$ ranks, after the operation:
 
-where $N$ is the number of ranks and `root` is the calling NPU.
+$$ \mathrm{dst}^{(k)}_{i,j} = \mathrm{src}^{(\text{root})}_{i,j} \quad \forall k \in [0, N) $$
 
-## Assembly Syntax
+where `root` is the NPU that executes the broadcast.
 
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
+## Syntax
 
-Synchronous form:
+### PTO Assembly Form
 
 ```text
 tbroadcast %group, %src : (!pto.group<...>, !pto.memref<...>)
 ```
-Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
+
+The assembly form takes a parallel group and a source memory reference. UB staging tiles are introduced during lowering; the C++ intrinsic exposes these explicitly.
 
 ## C++ Intrinsic
 
 Declared in `include/pto/comm/pto_comm_inst.hpp`:
 
 ```cpp
-// Basic broadcast (single staging tile)
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                                TileData &stagingTileData, WaitEvents&... events);
-
-// Ping-pong broadcast (double buffering with two staging tiles)
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                                TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+// Basic broadcast — single staging tile
+template <typename ParallelGroupType, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup,
+                                GlobalSrcData &srcGlobalData,
+                                TileData &stagingTileData,
+                                WaitEvents&... events);
+
+// Ping-pong broadcast — two staging tiles for double buffering
+template <typename ParallelGroupType, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup,
+                                GlobalSrcData &srcGlobalData,
+                                TileData &pingTile,
+                                TileData &pongTile,
+                                WaitEvents&... events);
 ```
 
+## Inputs
+
+|| Operand | Type | Description |
+||---------|------|-------------|
+|| `parallelGroup` | `ParallelGroup` | Parallel group descriptor; `GetRootIdx()` identifies the broadcast root |
+|| `srcGlobalData` | `GlobalTensor` | Source data on the root NPU; must point to local GM |
+|| `stagingTileData` | `Tile` | Staging tile in UB for the GM→UB→GM transfer path |
+|| `pingTile` / `pongTile` | `Tile` | Two staging tiles for ping-pong double buffering |
+|| `WaitEvents...` | `RecordEvent...` | Events to wait on before issuing the broadcast |
+
+## Expected Outputs
+
+|| Result | Type | Description |
+||--------|------|-------------|
+|| `RecordEvent` | event | Token signaling broadcast completion; depends on async variant |
+
+## Side Effects
+
+This operation reads from and writes to global memory across multiple NPUs. It establishes synchronization edges through the returned event token.
+
 ## Constraints
 
-- **Type constraints**:
-    - `ParallelGroup::value_type::RawDType` must equal `GlobalSrcData::RawDType`.
-    - `TileData::DType` must equal `GlobalSrcData::RawDType`.
-- **Memory constraints**:
-    - `srcGlobalData` must point to local memory (current NPU).
-    - `stagingTileData` (or `pingTile` / `pongTile`) must be pre-allocated in UB.
-- **ParallelGroup constraints**:
-    - `parallelGroup.tensors[k]` must refer to rank `k`'s destination buffer (remote GM as seen by the root).
-    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the broadcast root.
-    - All destination tensors are assumed to have the same shape and strides.
-- **Chunked mode constraints** (when data exceeds a single UB tile):
-    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
-    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
+### Type constraints
+
+- `ParallelGroup::value_type::RawDType` must equal `GlobalSrcData::RawDType`.
+- `TileData::DType` must equal `GlobalSrcData::RawDType`.
+
+### Memory constraints
+
+- `srcGlobalData` must point to local memory (the calling NPU's GM).
+- `stagingTileData` (or `pingTile`/`pongTile`) must be pre-allocated in UB.
+
+### Parallel group constraints
+
+- `parallelGroup.tensors[k]` must refer to rank `k`'s destination buffer (remote GM as seen from the root).
+- `parallelGroup.GetRootIdx()` identifies the calling NPU as the broadcast root.
+- All destination tensors must have the same shape and strides.
+
+### Chunked mode constraints
+
+When the GlobalTensor exceeds a single UB tile in rows or columns:
+
+- If `TileData` has a static `ValidRow`, `GetShape(DIM_3)` must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` `ValidRow` for partial row support.
+- If `TileData` has a static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` `ValidCol` for partial column support.
+
+## Target-Profile Restrictions
+
+- Collective communication is supported on A2/A3 and A5 profiles. CPU simulation does not support collective operations.
+- The ping-pong double-buffering form is recommended for large transfers to overlap communication with computation.
+- `TBROADCAST` requires a properly initialized `ParallelGroup` covering all participating NPUs.
 
 ## Examples
 
-### Basic Broadcast
+### Basic broadcast
 
 ```cpp
 #include <pto/comm/pto_comm_inst.hpp>
@@ -70,29 +115,26 @@ using namespace pto;
 
 template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
 void broadcast(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
-    // Tile dimensions can differ from tensor dimensions.
-    // The 2D sliding chunked path automatically tiles both row and column.
     using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
     using GTensor = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                 BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
 
     GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
+    for (int i = 0; i < NRANKS; ++i)
         tensors[i] = GTensor(group_addrs[i]);
-    }
 
     comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
     GTensor srcG(my_data);
     TileT stagingTile(TILE_ROWS, TILE_COLS);
 
-    // Current NPU broadcasts its data to all others
+    // Root NPU broadcasts its data to all others
     comm::TBROADCAST(group, srcG, stagingTile);
 }
 ```
 
-### Ping-Pong Broadcast (Double Buffering)
+### Ping-pong double buffering
 
-Uses two UB tiles to overlap TLOAD of the next chunk with TSTORE of the current chunk.
+Uses two UB staging tiles to overlap loading the next chunk with storing the current chunk:
 
 ```cpp
 #include <pto/comm/pto_comm_inst.hpp>
@@ -101,22 +143,27 @@ using namespace pto;
 
 template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
 void broadcast_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
-
     using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GTensor = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
 
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
-        tensors[i] = GPerRank(group_addrs[i]);
-    }
+    GTensor tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i)
+        tensors[i] = GTensor(group_addrs[i]);
 
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GPerRank srcG(my_data);
+    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
+    GTensor srcG(my_data);
     TileT pingTile(TILE_ROWS, TILE_COLS);
     TileT pongTile(TILE_ROWS, TILE_COLS);
 
-    // Ping-pong: overlaps TLOAD and TSTORE for better throughput
+    // Overlaps TLOAD and TSTORE for better throughput
     comm::TBROADCAST(group, srcG, pingTile, pongTile);
 }
 ```
+
+## Related Ops / Instruction Set Links
+
+- Communication overview: [Communication and Runtime](../other/communication-and-runtime.md)
+- Collective operations: [TGET](./TGET.md), [TPUT](./TPUT.md), [TREDUCE](./TREDUCE.md), [TSCATTLER](./TSCATTER.md), [TGATHER](./TGATHER.md)
+- Instruction set: [Other and Communication](../other/README.md)
+- Machine model: [Ordering and Synchronization](../machine-model/ordering-and-synchronization.md)
diff --git a/docs/isa/comm/TBROADCAST_zh.md b/docs/isa/comm/TBROADCAST_zh.md
index a16eea07..1b22e7f5 100644
--- a/docs/isa/comm/TBROADCAST_zh.md
+++ b/docs/isa/comm/TBROADCAST_zh.md
@@ -1,45 +1,78 @@
 # TBROADCAST
 
-## 简介
+`TBROADCAST` 是[通信与运行时](../other/communication-and-runtime_zh.md)指令集的一部分。
 
-`TBROADCAST` 把当前 NPU 作为根节点的数据广播到并行组中的所有 rank。
+## 概述
 
-只有根节点执行 `TBROADCAST`。非根节点只需要保证目标缓冲区在操作期间已分配且可写；在非根节点上主动调用该指令属于未定义行为。
+`TBROADCAST` 把当前 NPU 作为根节点的数据广播到并行组中的所有 rank。调用方 NPU 作为根节点，其数据会被复制到组内所有其他 NPU。
 
-当数据超过单个 UB Tile 容量时，传输会自动按二维滑动方式分块。
+只有根节点执行广播。非根节点只需保证目标缓冲区在操作期间已分配且可写。在非根节点上执行 `TBROADCAST` 属于未定义行为。
 
-## 数学语义
+当 GlobalTensor 超过 UB Tile 容量时，传输会自动按二维滑动方式分块。
 
-广播完成后：
+## 机制
+
+`TBROADCAST` 将根节点 NPU 源缓冲区的数据复制到并行组内所有其他 NPU 的对应目标缓冲区。数据路径以 UB 作为暂存区：GM → UB → GM。
+
+对于包含 $N$ 个 rank 的组中编号为 $k$ 的节点，广播完成后：
 
 $$ \mathrm{dst}^{(k)}_{i,j} = \mathrm{src}^{(\text{root})}_{i,j} \quad \forall k \in [0, N) $$
 
-其中 `N` 为 rank 总数。
+其中 `root` 为执行广播的 NPU。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：
+### PTO 汇编形式
 
 ```text
-tbroadcast %group, %src : (!pto.group<...>, !pto.memref<...>)
+|tbroadcast %group, %src : (!pto.group<...>, !pto.memref<...>)
 ```
 
-lowering 会引入 UB 暂存 Tile，因此 C++ 接口要求显式传入 `stagingTileData`，或在双缓冲模式下传入 `pingTile` / `pongTile`。
+汇编形式接收一个并行组和一个源内存引用。UB 暂存 Tile 在 lowering 阶段引入；C++ 内建接口显式暴露这些 Tile。
 
 ## C++ 内建接口
 
 声明于 `include/pto/comm/pto_comm_inst.hpp`：
 
 ```cpp
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                                TileData &stagingTileData, WaitEvents&... events);
-
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                                TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+// 基础广播 — 单个暂存 Tile
+template <typename ParallelGroupType, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup,
+                                GlobalSrcData &srcGlobalData,
+                                TileData &stagingTileData,
+                                WaitEvents&... events);
+
+// 乒乓广播 — 两个暂存 Tile，用于双缓冲
+template <typename ParallelGroupType, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup,
+                                GlobalSrcData &srcGlobalData,
+                                TileData &pingTile,
+                                TileData &pongTile,
+                                WaitEvents&... events);
 ```
 
+## 输入
+
+| | 操作数 | 类型 | 说明 |
+|---|---------|------|-------------|
+| | `parallelGroup` | `ParallelGroup` | 并行组描述符；`GetRootIdx()` 标识广播根节点 |
+| | `srcGlobalData` | `GlobalTensor` | 根节点上的源数据；必须指向本地 GM |
+| | `stagingTileData` | `Tile` | GM→UB→GM 传输路径上的 UB 暂存 Tile |
+| | `pingTile` / `pongTile` | `Tile` | 双缓冲用的两个 UB 暂存 Tile |
+| | `WaitEvents...` | `RecordEvent...` | 发指令前要等待的事件 |
+
+## 预期输出
+
+| | 结果 | 类型 | 说明 |
+|---|--------|------|-------------|
+| | `RecordEvent` | event | 标记广播完成的事件令牌，具体语义取决于异步变体 |
+
+## 副作用
+
+本指令对所有 rank 的全局内存做读和写操作。通过返回的事件令牌建立同步边。
+
 ## 约束
 
 ### 类型约束
@@ -47,17 +80,29 @@ PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData
 - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalSrcData::RawDType`
 - `TileData::DType` 必须等于 `GlobalSrcData::RawDType`
 
-### 内存与并行组约束
+### 内存约束
 
 - `srcGlobalData` 必须指向根节点本地内存
-- `stagingTileData`、`pingTile`、`pongTile` 必须预先在 UB 中分配
+- `stagingTileData`（或 `pingTile`/`pongTile`）必须预先在 UB 中分配
+
+### 并行组约束
+
 - `parallelGroup.tensors[k]` 必须指向 rank `k` 的目标缓冲区
 - `parallelGroup.GetRootIdx()` 必须标识当前调用方是广播根节点
+- 所有目标 Tensor 必须具有相同的形状和步幅
 
 ### 分块约束
 
-- 静态 `ValidRow` / `ValidCol` 场景下，对应维度必须能整除
-- 若要支持不足整行或整列的边界情况，应使用动态 valid region 的 Tile
+当 GlobalTensor 在行方向或列方向超过单个 UB Tile 时：
+
+- 若 `TileData` 具有静态 `ValidRow`，`GetShape(DIM_3)` 必须能被 `ValidRow` 整除。若需支持不足整行的边界情况，应使用动态 `ValidRow` 的 Tile。
+- 若 `TileData` 具有静态 `ValidCol`，`GetShape(DIM_4)` 必须能被 `ValidCol` 整除。若需支持不足整列的边界情况，应使用动态 `ValidCol` 的 Tile。
+
+## 目标Profile限制
+
+- 集合通信在 A2/A3 和 A5 上支持。CPU 模拟器不支持集合通信。
+- 大数据量传输建议使用乒乓双缓冲形式，以重叠下一次加载与当前块存储，提高吞吐率。
+- `TBROADCAST` 需要正确初始化的 `ParallelGroup`，覆盖所有参与的 NPU。
 
 ## 示例
 
@@ -72,27 +117,54 @@ template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRAN
 void broadcast(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
     using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
     using GTensor = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                 BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
 
     GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GTensor(group_addrs[i]);
+    for (int i = 0; i < NRANKS; ++i)
+        tensors[i] = GTensor(group_addrs[i]);
 
     comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
     GTensor srcG(my_data);
     TileT stagingTile(TILE_ROWS, TILE_COLS);
 
+    // Root NPU broadcasts its data to all others
     comm::TBROADCAST(group, srcG, stagingTile);
 }
 ```
 
 ### 乒乓双缓冲
 
+使用两个 UB 暂存 Tile 来重叠下一块的加载与当前块的存储：
+
 ```cpp
-comm::TBROADCAST(group, srcG, pingTile, pongTile);
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void broadcast_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
+    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GTensor = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GTensor tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i)
+        tensors[i] = GTensor(group_addrs[i]);
+
+    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
+    GTensor srcG(my_data);
+    TileT pingTile(TILE_ROWS, TILE_COLS);
+    TileT pongTile(TILE_ROWS, TILE_COLS);
+
+    // Overlaps TLOAD and TSTORE for better throughput
+    comm::TBROADCAST(group, srcG, pingTile, pongTile);
+}
 ```
 
 ## 相关页面
 
-- [通信与运行时](../other/communication-and-runtime_zh.md)
-- [TGATHER](./TGATHER_zh.md)
-- [TSCATTER](./TSCATTER_zh.md)
+- 通信概述：[通信与运行时](../other/communication-and-runtime_zh.md)
+- 集合通信：[TGATHER](./TGATHER_zh.md)、[TSCATTLER](./TSCATTER_zh.md)、[TREDUCE](./TREDUCE_zh.md)
+- 点对点通信：[TGET](./TGET_zh.md)、[TPUT](./TPUT_zh.md)
+- 指令集：[其他与通信](../other/README_zh.md)
+- 机器模型：[排序与同步](../machine-model/ordering-and-synchronization_zh.md)
diff --git a/docs/isa/comm/TGATHER.md b/docs/isa/comm/TGATHER.md
index 77ec4f9b..ff351344 100644
--- a/docs/isa/comm/TGATHER.md
+++ b/docs/isa/comm/TGATHER.md
@@ -1,72 +1,106 @@
 ﻿# TGATHER
 
-## Introduction
+`TGATHER` is part of the [Communication and Runtime](../other/communication-and-runtime.md) instruction set.
 
-Gather operation: the calling NPU (root) collects data from all ranks in the parallel group and concatenates the results along **DIM_3** (row dimension) into a local output buffer.
+## Summary
 
+Collective gather: the root NPU collects data from all ranks in a parallel group and concatenates the results along DIM_3 (row dimension) into a local output buffer. Only the root executes `TGATHER`; non-root ranks only ensure their source buffers are ready. Executing `TGATHER` on a non-root rank has undefined behavior.
 
-Only the root needs to execute `TGATHER`. Non-root ranks only need to ensure their source buffers are ready and remain valid for the duration of the operation. Calling `TGATHER` on non-root ranks is undefined behavior.
+When the per-rank data exceeds the UB tile capacity, the transfer is automatically chunked via 2D sliding.
 
-**Large Tile Support**: When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding — the same mechanism used by other PTO-COMM instructions.
+## Mechanism
 
-## Math Interpretation
-
-Each rank $r$ has source data of shape $(D_0, D_1, D_2, H, W)$. The gather concatenates all $N$ ranks along DIM_3:
+Each rank $r$ contributes source data of shape $(D_0, D_1, D_2, H, W)$. The gather concatenates all $N$ ranks along DIM_3:
 
 $$\mathrm{dst}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} = \mathrm{src}^{(r)}_{d_0, d_1, d_2,\; i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
 
 The destination tensor has shape $(D_0, D_1, D_2, N \times H, W)$.
 
-## Assembly Syntax
-
-PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+## Syntax
 
-Synchronous form:
+### PTO Assembly Form
 
 ```text
 tgather %group, %dst : (!pto.group<...>, !pto.memref<...>)
 ```
-Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
+
+UB staging tiles are introduced during lowering. The C++ intrinsic exposes them explicitly.
 
 ## C++ Intrinsic
 
 Declared in `include/pto/comm/pto_comm_inst.hpp`:
 
 ```cpp
-// Basic gather (single staging tile)
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+// Basic gather — single staging tile
+template <typename ParallelGroupType, typename GlobalDstData,
+          typename TileData, typename... WaitEvents>
 PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
                              TileData &stagingTileData, WaitEvents&... events);
 
-// Ping-pong gather (double buffering with two staging tiles)
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+// Ping-pong gather — two staging tiles for double buffering
+template <typename ParallelGroupType, typename GlobalDstData,
+          typename TileData, typename... WaitEvents>
 PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
                              TileData &pingTile, TileData &pongTile, WaitEvents&... events);
 ```
 
+## Inputs
+
+|| Operand | Type | Description |
+||---------|------|-------------|
+|| `parallelGroup` | `ParallelGroup` | Parallel group descriptor; `GetRootIdx()` identifies the gather root |
+|| `dstGlobalData` | `GlobalTensor` | Local destination buffer on the root NPU; must be large enough to hold concatenated data |
+|| `stagingTileData` | `Tile` | UB staging tile for the GM→UB→GM transfer path |
+|| `pingTile` / `pongTile` | `Tile` | Two UB staging tiles for ping-pong double buffering |
+|| `WaitEvents...` | `RecordEvent...` | Events to wait on before issuing the gather |
+
+## Expected Outputs
+
+|| Result | Type | Description |
+||--------|------|-------------|
+|| `RecordEvent` | event | Token signaling gather completion |
+
+## Side Effects
+
+This operation reads from all ranks' global memory and writes to the root's global memory. It establishes synchronization edges through the returned event token.
+
 ## Constraints
 
-- **Type constraints**:
-    - `ParallelGroup::value_type::RawDType` must equal `GlobalDstData::RawDType`.
-    - `TileData::DType` must equal `GlobalDstData::RawDType`.
-- **Memory constraints**:
-    - `dstGlobalData` must point to local memory (current NPU) and be large enough to hold the concatenated result from all ranks. Specifically, `dstGlobalData.GetShape(DIM_3)` must be $\geq N \times H$ where $H$ is each rank's `GetShape(DIM_3)`.
-    - If `dstGlobalData.GetShape(DIM_3) > N × H`, only the first `N × H` rows are written; remaining rows are left unchanged.
-    - `stagingTileData` (or `pingTile` / `pongTile`) must be pre-allocated in UB.
-- **ParallelGroup constraints**:
-    - `parallelGroup.tensors[r]` must refer to rank `r`'s source buffer (remote GM as seen by the root).
-    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the gather root.
-    - All source tensors are assumed to have the same shape and strides; behavior is undefined if they differ.
-- **Chunked mode constraints** (when source data exceeds a single UB tile):
-    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's source must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
-    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
+### Type constraints
+
+- `ParallelGroup::value_type::RawDType` must equal `GlobalDstData::RawDType`.
+- `TileData::DType` must equal `GlobalDstData::RawDType`.
+
+### Memory constraints
+
+- `dstGlobalData` must point to local memory and be large enough to hold concatenated data from all ranks. Specifically, `dstGlobalData.GetShape(DIM_3)` must be $\geq N \times H$.
+- If `dstGlobalData.GetShape(DIM_3) > N \times H`, only the first $N \times H$ rows are written; remaining rows are left unchanged.
+- `stagingTileData` / `pingTile` / `pongTile` must be pre-allocated in UB.
+
+### Parallel group constraints
+
+- `parallelGroup.tensors[r]` must refer to rank `r`'s source buffer (remote GM as seen from the root).
+- `parallelGroup.GetRootIdx()` identifies the calling NPU as the gather root.
+- All source tensors must have the same shape and strides.
+
+### Chunked mode constraints
+
+When per-rank data exceeds a single UB tile in rows or columns:
+
+- If `TileData` has a static `ValidRow`, each rank's source `GetShape(DIM_3)` must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` `ValidRow` for partial row support.
+- If `TileData` has a static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` `ValidCol` for partial column support.
+
+## Target-Profile Restrictions
+
+- Collective communication is supported on A2/A3 and A5 profiles. CPU simulation does not support collective operations.
+- Use ping-pong double buffering for large transfers to overlap communication with computation.
+- `TGATHER` requires a properly initialized `ParallelGroup` covering all participating NPUs.
 
 ## Examples
 
-### Basic Gather (Single Staging Tile)
+### Basic gather
 
-Each rank contributes `ROWS × COLS` data. The root collects them into `NRANKS * ROWS` rows.
-The tile size (`TILE_ROWS × TILE_COLS`) can be smaller than the per-rank data — when it is, the implementation automatically chunks the transfer along both DIM_3 and DIM_4 via 2D sliding.
+Each rank contributes `ROWS × COLS` data. The root collects them into `NRANKS × ROWS` rows:
 
 ```cpp
 #include <pto/comm/pto_comm_inst.hpp>
@@ -77,14 +111,13 @@ template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRAN
 void gather(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
     using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
     using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
     using GResult = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
 
     GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
+    for (int i = 0; i < NRANKS; ++i)
         tensors[i] = GPerRank(group_addrs[i]);
-    }
 
     comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
     GResult dstG(result);
@@ -94,35 +127,34 @@ void gather(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
 }
 ```
 
-### Ping-Pong Gather (Double Buffering)
-
-Uses two UB tiles to overlap TLOAD of the next chunk (MTE2) with TSTORE of the current chunk (MTE3).
+### Ping-pong gather
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
 template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
 void gather_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    // Tile can be smaller than the data in both dimensions
     using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
     using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
     using GResult = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
 
     GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
+    for (int i = 0; i < NRANKS; ++i)
         tensors[i] = GPerRank(group_addrs[i]);
-    }
 
     comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
     GResult dstG(result);
     TileT pingTile(TILE_ROWS, TILE_COLS);
     TileT pongTile(TILE_ROWS, TILE_COLS);
 
-    // Ping-pong: overlaps TLOAD and TSTORE for better throughput
     comm::TGATHER(group, dstG, pingTile, pongTile);
 }
 ```
+
+## Related Ops / Instruction Set Links
+
+- Communication overview: [Communication and Runtime](../other/communication-and-runtime.md)
+- Inverse operation: [TSCATTLER](./TSCATTER.md)
+- Collective operations: [TBROADCAST](./TBROADCAST.md), [TSCATTLER](./TSCATTER.md), [TREDUCE](./TREDUCE.md)
+- Point-to-point: [TGET](./TGET.md), [TPUT](./TPUT.md)
+- Instruction set: [Other and Communication](../other/README.md)
diff --git a/docs/isa/comm/TGATHER_zh.md b/docs/isa/comm/TGATHER_zh.md
index 699e50b1..ccac732c 100644
--- a/docs/isa/comm/TGATHER_zh.md
+++ b/docs/isa/comm/TGATHER_zh.md
@@ -1,121 +1,160 @@
-# TGATHER
-
-## 简介
-
-Gather 操作：调用方 NPU（根节点）从并行组中所有 rank 收集数据，并沿 **DIM_3**（行维度）拼接到本地输出缓冲区。
-
-只有根节点需要执行 `TGATHER`。非根节点只需确保在操作期间其源缓冲区已就绪且保持有效。在非根节点上调用 `TGATHER` 属于未定义行为。
-
-**大 Tile 支持**：当 GlobalTensor 在行和/或列方向超出 UB Tile 容量时，传输将通过二维滑动自动分块——与其他 PTO-COMM 指令采用相同机制。
-
-## 数学语义
-
-每个 rank $r$ 的源数据形状为 $(D_0, D_1, D_2, H, W)$。gather 沿 DIM_3 拼接所有 $N$ 个 rank 的数据：
-
-$$\mathrm{dst}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} = \mathrm{src}^{(r)}_{d_0, d_1, d_2,\; i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
-
-目标 tensor 的形状为 $(D_0, D_1, D_2, N \times H, W)$。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-tgather %group, %dst : (!pto.group<...>, !pto.memref<...>)
-```
-
-降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-// 基础 gather（单暂存 Tile）
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                             TileData &stagingTileData, WaitEvents&... events);
-
-// 乒乓 gather（使用两个暂存 Tile 实现双缓冲）
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                             TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-## 约束
-
-- **类型约束**：
-    - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalDstData::RawDType`。
-    - `TileData::DType` 必须等于 `GlobalDstData::RawDType`。
-- **内存约束**：
-    - `dstGlobalData` 必须指向本地内存（当前 NPU），且足够容纳所有 rank 拼接后的结果。具体要求：`dstGlobalData.GetShape(DIM_3)` 必须 $\geq N \times H$，其中 $H$ 为每个 rank 的 `GetShape(DIM_3)`。
-    - 若 `dstGlobalData.GetShape(DIM_3) > N × H`，则只写入前 `N × H` 行，其余行保持不变。
-    - `stagingTileData`（或 `pingTile` / `pongTile`）必须预先在 UB 中分配。
-- **ParallelGroup 约束**：
-    - `parallelGroup.tensors[r]` 必须指向 rank `r` 的源缓冲区（从根节点视角看到的远端 GM）。
-    - `parallelGroup.GetRootIdx()` 标识调用方 NPU 为 gather 根节点。
-    - 所有源 tensor 假定具有相同的形状和步幅；否则行为未定义。
-- **分块模式约束**（源数据超出单个 UB Tile 时）：
-    - 若 `TileData` 具有静态 `ValidRow`，则每个 rank 源数据的 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足一行的情况，请使用 `DYNAMIC` ValidRow 的 Tile。
-    - 若 `TileData` 具有静态 `ValidCol`，则 `GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足一列的情况，请使用 `DYNAMIC` ValidCol 的 Tile。
-
-## 示例
-
-### 基础 Gather（单暂存 Tile）
-
-每个 rank 提供 `ROWS × COLS` 的数据，根节点将其收集到 `NRANKS * ROWS` 行中。
-Tile 大小（`TILE_ROWS × TILE_COLS`）可小于每 rank 的数据——此时实现会自动沿 DIM_3 和 DIM_4 通过二维滑动进行分块传输。
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void gather(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GResult  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GResult dstG(result);
-    TileT stagingTile(TILE_ROWS, TILE_COLS);
-    comm::TGATHER(group, dstG, stagingTile);
-}
-```
-
-### 乒乓 Gather（双缓冲）
-
-使用两个 UB Tile，将下一块的 TLOAD（MTE2）与当前块的 TSTORE（MTE3）重叠执行。
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void gather_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GResult  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GResult dstG(result);
-    TileT pingTile(TILE_ROWS, TILE_COLS);
-    TileT pongTile(TILE_ROWS, TILE_COLS);
-    // 乒乓模式：将 TLOAD 与 TSTORE 重叠执行以提升吞吐量
-    comm::TGATHER(group, dstG, pingTile, pongTile);
-}
-```
+# TGATHER
+
+`TGATHER` 是[通信与运行时](../other/communication-and-runtime_zh.md)指令集的一部分。
+
+## 概述
+
+集合 Gathering 操作：根节点 NPU 从并行组中所有 rank 收集数据，沿 DIM_3（行维度）拼接后写入本地输出缓冲区。只有根节点执行 `TGATHER`；非根节点只需确保源缓冲区已就绪。调用 `TGATHER` 的非根节点属于未定义行为。
+
+当每个 rank 的数据超出 UB Tile 容量时，传输会自动通过二维滑动分块。
+
+## 机制
+
+每个 rank $r$ 的源数据形状为 $(D_0, D_1, D_2, H, W)$。Gather 沿 DIM_3 拼接所有 $N$ 个 rank 的数据：
+
+$$\mathrm{dst}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} = \mathrm{src}^{(r)}_{d_0, d_1, d_2,\; i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
+
+目标 tensor 的形状为 $(D_0, D_1, D_2, N \times H, W)$。
+
+## 语法
+
+### PTO 汇编形式
+
+```text
+tgather %group, %dst : (!pto.group<...>, !pto.memref<...>)
+```
+
+Lowering 会引入 UB 暂存 Tile。C++ 内建接口显式暴露这些 Tile。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+// 基础 gather — 单个暂存 Tile
+template <typename ParallelGroupType, typename GlobalDstData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
+                           TileData &stagingTileData, WaitEvents&... events);
+
+// 乒乓 gather — 两个暂存 Tile 实现双缓冲
+template <typename ParallelGroupType, typename GlobalDstData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
+                           TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+## 输入
+
+|| 操作数 | 类型 | 说明 |
+||--------|------|------|
+|| `parallelGroup` | `ParallelGroup` | 并行组描述符；`GetRootIdx()` 标识 gather 根节点 |
+|| `dstGlobalData` | `GlobalTensor` | 根节点上的本地目标缓冲区；必须足够容纳所有 rank 拼接后的数据 |
+|| `stagingTileData` | `Tile` | GM→UB→GM 传输路径上的 UB 暂存 Tile |
+|| `pingTile` / `pongTile` | `Tile` | 双缓冲用的两个 UB 暂存 Tile |
+|| `WaitEvents...` | `RecordEvent...` | 发指令前要等待的事件 |
+
+## 预期输出
+
+|| 结果 | 类型 | 说明 |
+||------|------|------|
+|| `RecordEvent` | event | 标记 gather 完成的事件令牌 |
+
+## 副作用
+
+本指令从所有 rank 的全局内存读取数据并写入根节点的全局内存。通过返回的事件令牌建立同步边界。
+
+## 约束
+
+### 类型约束
+
+- `ParallelGroup::value_type::RawDType` 必须等于 `GlobalDstData::RawDType`
+- `TileData::DType` 必须等于 `GlobalDstData::RawDType`
+
+### 内存约束
+
+- `dstGlobalData` 必须指向本地内存，且足够容纳所有 rank 拼接后的数据。具体要求：`dstGlobalData.GetShape(DIM_3)` 必须 $\geq N \times H$
+- 若 `dstGlobalData.GetShape(DIM_3) > N \times H`，只写入前 $N \times H$ 行，其余行保持不变
+- `stagingTileData` / `pingTile` / `pongTile` 必须预先在 UB 中分配
+
+### 并行组约束
+
+- `parallelGroup.tensors[r]` 必须指向 rank `r` 的源缓冲区（从根节点视角看到的远端 GM）
+- `parallelGroup.GetRootIdx()` 标识调用方 NPU 为 gather 根节点
+- 所有源 Tensor 必须具有相同的形状和步幅
+
+### 分块约束
+
+当每个 rank 的数据超出单个 UB Tile 的行或列时：
+
+- 若 `TileData` 具有静态 `ValidRow`，每个 rank 源数据的 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足整行，应使用动态 `ValidRow` 的 Tile
+- 若 `TileData` 具有静态 `ValidCol`，`GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足整列，应使用动态 `ValidCol` 的 Tile
+
+## 目标Profile限制
+
+- 集合通信在 A2/A3 和 A5 上支持。CPU 模拟器不支持集合通信。
+- 大数据量传输建议使用乒乓双缓冲，以重叠通信与计算。
+- `TGATHER` 需要正确初始化的 `ParallelGroup`，覆盖所有参与的 NPU。
+
+## 示例
+
+### 基础 gather
+
+每个 rank 提供 `ROWS × COLS` 的数据，根节点将其收集到 `NRANKS × ROWS` 行中：
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void gather(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GResult  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                     BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i)
+        tensors[i] = GPerRank(group_addrs[i]);
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GResult dstG(result);
+    TileT stagingTile(TILE_ROWS, TILE_COLS);
+
+    comm::TGATHER(group, dstG, stagingTile);
+}
+```
+
+### 乒乓 gather
+
+```cpp
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void gather_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GResult  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                     BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i)
+        tensors[i] = GPerRank(group_addrs[i]);
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GResult dstG(result);
+    TileT pingTile(TILE_ROWS, TILE_COLS);
+    TileT pongTile(TILE_ROWS, TILE_COLS);
+
+    comm::TGATHER(group, dstG, pingTile, pongTile);
+}
+```
+
+## 相关页面
+
+- 通信概述：[通信与运行时](../other/communication-and-runtime_zh.md)
+- 逆操作：[TSCATTLER](./TSCATTER_zh.md)
+- 集合通信：[TBROADCAST](./TBROADCAST_zh.md)、[TSCATTLER](./TSCATTER_zh.md)、[TREDUCE](./TREDUCE_zh.md)
+- 点对点通信：[TGET](./TGET_zh.md)、[TPUT](./TPUT_zh.md)
+- 指令集：[其他与通信](../other/README_zh.md)
diff --git a/docs/isa/comm/TGET.md b/docs/isa/comm/TGET.md
index 2135c034..75826d52 100644
--- a/docs/isa/comm/TGET.md
+++ b/docs/isa/comm/TGET.md
@@ -1,109 +1,147 @@
-# TGET
-
-## Introduction
-
-Remote read operation: read remote NPU's data to local memory. Data is transferred via a UB tile as intermediate staging buffer.
-
-When the GlobalTensor exceeds the UB tile capacity, TGET automatically performs **2D sliding**: chunking rows (DIM_3) and columns (DIM_4) to fit each chunk into the tile, iterating over all outer dimensions (DIM_0, DIM_1, DIM_2).
-
-## Math Interpretation
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}^{\mathrm{local}}_{i,j} = \mathrm{src}^{\mathrm{remote}}_{i,j} $$
-
-Data flow: `srcGlobalData (remote GM)` ->`stagingTileData (UB)` ->`dstGlobalData (local GM)`
-
-## Assembly Syntax
-
-PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
-
-Synchronous form:
-
-```text
-tget %dst_local, %src_remote : (!pto.memref<...>, !pto.memref<...>)
-```
-Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`
-
-### Single-tile (auto-chunking)
-
-```cpp
-template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                          TileData &stagingTileData, WaitEvents&... events);
-```
-
-### Ping-pong double buffering
-
-Uses two staging tiles to overlap TLOAD and TSTORE for adjacent chunks, hiding one DMA transfer behind the other.
-
-```cpp
-template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                          TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-## Constraints
-
-- **Type constraints**:
-    - `GlobalSrcData::RawDType` must equal `GlobalDstData::RawDType`.
-    - `TileData::DType` must equal `GlobalSrcData::RawDType`.
-    - `GlobalSrcData::layout` must equal `GlobalDstData::layout`.
-- **Memory constraints**:
-    - `srcGlobalData` must point to remote address (on source NPU).
-    - `dstGlobalData` must point to local address (on current NPU).
-    - `stagingTileData` / `pingTile` / `pongTile` must be pre-allocated in Unified Buffer.
-- **Valid region**:
-    - Transfer size is determined by `GlobalTensor` shape (auto-chunked to fit tile).
-- **Ping-pong**:
-    - `pingTile` and `pongTile` must have the same type and dimensions.
-    - Must reside at non-overlapping UB offsets.
-
-## Examples
-
-### Basic Usage
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void example_tget(__gm__ T* local_data, __gm__ T* remote_addr) {
-    using TileT = Tile<TileType::Vec, T, 16, 16>;
-    using GShape = Shape<1, 1, 1, 16, 16>;
-    using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-    /*
-    If the globalTensor is larger than UB Tile, TGET will perform 2D sliding automatically.
-    using GShape = Shape<1, 1, 1, 4096, 4096>;
-    using GStride = BaseShape2D<T, 4096, 4096, Layout::ND>;
-    */
-    using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-    GTensor srcG(remote_addr);
-    GTensor dstG(local_data);
-    TileT stagingTile;
-    TASSIGN(stagingTile, 0);
-
-    // Basic remote read
-    comm::TGET(dstG, srcG, stagingTile);
-}
-```
-
-### Ping-pong Double Buffering
-
-```cpp
-constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
-TileT pingTile(64, 64);
-TileT pongTile(64, 64);
-TASSIGN(pingTile, 0);
-TASSIGN(pongTile, tileUBBytes);  // Non-overlapping UB region
-
-// Overlaps TLOAD[i+1] with TSTORE[i] for better pipeline utilization
-comm::TGET(dstG, srcG, pingTile, pongTile);
-```
+# TGET
+
+`TGET` is part of the [Communication and Runtime](../other/communication-and-runtime.md) instruction set.
+
+## Summary
+
+Remote read operation: copy data from a remote NPU's global memory to the local NPU's global memory. Data traverses a UB staging tile as an intermediate buffer. When the GlobalTensor exceeds the UB tile capacity, TGET automatically chunks the transfer via 2D sliding.
+
+Only the local NPU executes TGET; the remote NPU is passive.
+
+## Mechanism
+
+`TGET` reads from a remote NPU's global memory and writes to the local NPU's global memory. The data path is: remote GM → staging tile (UB) → local GM.
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}^{\mathrm{local}}_{i,j} = \mathrm{src}^{\mathrm{remote}}_{i,j} $$
+
+The data flow in full:
+
+```
+srcGlobalData (remote GM) → stagingTileData (UB) → dstGlobalData (local GM)
+```
+
+## Syntax
+
+### PTO Assembly Form
+
+```text
+tget %dst_local, %src_remote : (!pto.memref<...>, !pto.memref<...>)
+```
+
+UB staging tiles are introduced during lowering. The C++ intrinsic exposes them explicitly.
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`:
+
+```cpp
+// Single-tile form — auto-chunking for large tensors
+template <typename GlobalDstData, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &stagingTileData, WaitEvents&... events);
+
+// Ping-pong double buffering form
+template <typename GlobalDstData, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+## Inputs
+
+|| Operand | Type | Description |
+||---------|------|-------------|
+|| `dstGlobalData` | `GlobalTensor` | Local destination; must point to local GM |
+|| `srcGlobalData` | `GlobalTensor` | Remote source; must point to remote NPU's GM |
+|| `stagingTileData` | `Tile` | UB staging tile for the GM→UB→GM transfer path |
+|| `pingTile` / `pongTile` | `Tile` | Two UB staging tiles for ping-pong double buffering |
+|| `WaitEvents...` | `RecordEvent...` | Events to wait on before issuing the get |
+
+## Expected Outputs
+
+|| Result | Type | Description |
+||--------|------|-------------|
+|| `RecordEvent` | event | Token signaling completion of the remote read |
+
+## Side Effects
+
+This operation reads from remote global memory and writes to local global memory. It establishes synchronization edges through the returned event token.
+
+## Constraints
+
+### Type constraints
+
+- `GlobalSrcData::RawDType` must equal `GlobalDstData::RawDType`.
+- `TileData::DType` must equal `GlobalSrcData::RawDType`.
+- `GlobalSrcData::layout` must equal `GlobalDstData::layout`.
+
+### Memory constraints
+
+- `srcGlobalData` must point to a remote address (on the source NPU).
+- `dstGlobalData` must point to a local address (on the current NPU).
+- `stagingTileData` / `pingTile` / `pongTile` must be pre-allocated in UB.
+
+### Transfer constraints
+
+- Transfer size is determined by the `GlobalTensor` shape; auto-chunking tiles data to fit the UB staging buffer.
+- When auto-chunking, rows (DIM_3) and columns (DIM_4) are subdivided as needed.
+
+### Ping-pong constraints
+
+- `pingTile` and `pongTile` must have identical type and dimensions.
+- They must reside at non-overlapping UB offsets.
+
+## Target-Profile Restrictions
+
+- Point-to-point communication is supported on A2/A3 and A5 profiles. CPU simulation does not support remote memory access.
+- Use ping-pong double buffering when transferring large tensors to overlap consecutive transfers.
+- `TGET` requires a valid remote GM address; the remote NPU must have the corresponding memory region allocated.
+
+## Examples
+
+### Basic remote read
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_tget(__gm__ T* local_data, __gm__ T* remote_addr) {
+    using TileT = Tile<TileType::Vec, T, 16, 16>;
+    using GShape = Shape<1, 1, 1, 16, 16>;
+    using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+    using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+    GTensor srcG(remote_addr);
+    GTensor dstG(local_data);
+    TileT stagingTile;
+    TASSIGN(stagingTile, 0);
+
+    comm::TGET(dstG, srcG, stagingTile);
+}
+```
+
+### Ping-pong double buffering
+
+```cpp
+constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
+TileT pingTile(64, 64);
+TileT pongTile(64, 64);
+TASSIGN(pingTile, 0);
+TASSIGN(pongTile, tileUBBytes);  // Non-overlapping UB offsets
+
+// Overlaps TLOAD[i+1] with TSTORE[i] for better pipeline utilization
+comm::TGET(dstG, srcG, pingTile, pongTile);
+```
+
+## Related Ops / Instruction Set Links
+
+- Communication overview: [Communication and Runtime](../other/communication-and-runtime.md)
+- Inverse operation: [TPUT](./TPUT.md)
+- Collective operations: [TBROADCAST](./TBROADCAST.md), [TGATHER](./TGATHER.md), [TSCATTLER](./TSCATTER.md), [TREDUCE](./TREDUCE.md)
+- Instruction set: [Other and Communication](../other/README.md)
diff --git a/docs/isa/comm/TGET_ASYNC.md b/docs/isa/comm/TGET_ASYNC.md
index 62349e8c..26faa0b1 100644
--- a/docs/isa/comm/TGET_ASYNC.md
+++ b/docs/isa/comm/TGET_ASYNC.md
@@ -1,26 +1,46 @@
 # TGET_ASYNC
 
-## Introduction
+`TGET_ASYNC` is part of the [Communication and Runtime](../other/communication-and-runtime.md) instruction set.
 
-`TGET_ASYNC` is an asynchronous remote read primitive. It starts a transfer from remote GM to local GM and returns an `AsyncEvent` immediately.
+## Summary
 
-Data flow:
+Asynchronous remote read: initiates a transfer from a remote NPU's global memory to local global memory and returns an `AsyncEvent` immediately without blocking. The event is used later to wait for transfer completion.
 
-`srcGlobalData (remote GM) -> DMA engine -> dstGlobalData (local GM)`
+Two DMA engines are supported: SDMA (default, available on all targets) and URMA (hardware RDMA, available on Ascend950 / NPU_ARCH 3510 only).
 
-## Template Parameter
+## Mechanism
 
-- `engine`:
-    - `DmaEngine::SDMA` (default)
-    - `DmaEngine::URMA` (Ascend950, NPU_ARCH 3510 only)
+`TGET_ASYNC` starts a DMA transfer from remote GM to local GM and returns immediately:
 
-> **Important (SDMA path)**
-> `TGET_ASYNC` with `DmaEngine::SDMA` currently supports **only flat contiguous logical 1D tensors**.
-> Non-1D or non-contiguous layouts are not supported by the current SDMA async implementation.
+```
+srcGlobalData (remote GM) → DMA engine → dstGlobalData (local GM)
+```
+
+The `AsyncSession` manages the engine-agnostic async state. After issuing one or more async operations, call `event.Wait(session)` to block until all pending operations complete (quiet semantics — a single `Wait` drains all operations issued since the last `Wait`).
+
+### Engine differences
+
+- **SDMA**: Submits data transfer SQEs; flag SQE is deferred to `Wait`, which polls for completion.
+- **URMA**: Submits an RDMA READ WQE and rings the doorbell immediately; `Wait` polls the Completion Queue.
+
+## Syntax
+
+### PTO Assembly Form
+
+Engine selection is a template parameter at the C++ level. The assembly form does not expose the engine choice.
+
+### Template Parameter
+
+|| Value | Description |
+||-------|-------------|
+|| `DmaEngine::SDMA` | Default. System DMA — available on all targets. |
+|| `DmaEngine::URMA` | User-level RDMA — Ascend950 (NPU_ARCH 3510) only. |
+
+> **SDMA limitation**: Currently supports **only flat contiguous logical 1D tensors**. Non-1D or non-contiguous layouts are not supported. If this requirement is not met, the implementation returns an invalid async event (`handle == 0`).
 
 ## C++ Intrinsic
 
-Declared in `include/pto/comm/pto_comm_inst.hpp`.
+Declared in `include/pto/comm/pto_comm_inst.hpp`:
 
 ```cpp
 template <DmaEngine engine = DmaEngine::SDMA,
@@ -29,17 +49,7 @@ PTO_INST AsyncEvent TGET_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcG
                                const AsyncSession &session, WaitEvents &... events);
 ```
 
-`AsyncSession` is an engine-agnostic session object. Build once with
-`BuildAsyncSession<engine>()`, then pass to all async calls and event waits.
-The template `engine` parameter selects the DMA backend at compile time, making the
-code forward-compatible with future engines (CCU, etc.).
-
-## AsyncSession Construction
-
-Use `BuildAsyncSession` from `include/pto/comm/async/async_event_impl.hpp`.
-There are two overloads — one for SDMA and one for URMA — with different parameter lists.
-
-### SDMA Construction (default)
+### AsyncSession construction (SDMA)
 
 ```cpp
 template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
@@ -51,18 +61,16 @@ PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
                                     uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
 ```
 
-| Parameter | Default | Description |
-|---|---|---|
-| `scratchTile` | — | UB scratch tile for SDMA control metadata (see [scratchTile Role](#scratchtile-role)). |
-| `workspace` | — | GM pointer allocated by host-side `SdmaWorkspaceManager`. |
-| `session` | — | Output `AsyncSession` object. |
-| `syncId` | `0` | MTE3/MTE2 pipe sync event id (0-7). Override if kernel uses other pipe barriers on the same id. |
-| `baseConfig` | `{kDefaultSdmaBlockBytes, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`. Suitable for most single-queue transfers. |
-| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA channel group index. Default uses `get_block_idx()` internally, mapping to current AI core. Override for multi-block or custom channel mapping scenarios. |
+|| Parameter | Default | Description |
+||-----------|---------|-------------|
+|| `scratchTile` | — | UB scratch tile for SDMA control metadata |
+|| `workspace` | — | GM pointer from host-side `SdmaWorkspaceManager` |
+|| `session` | — | Output `AsyncSession` object |
+|| `syncId` | `0` | MTE3/MTE2 pipe sync event id (0–7) |
+|| `baseConfig` | `{kDefaultSdmaBlockBytes, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}` |
+|| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA channel group index; defaults to current AI core |
 
-### URMA Construction (NPU_ARCH 3510 only)
-
-> URMA (User-level RDMA Memory Access) is a hardware-accelerated RDMA transport available on Ascend950 (NPU_ARCH 3510).
+### AsyncSession construction (URMA, NPU_ARCH 3510 only)
 
 ```cpp
 #ifdef PTO_URMA_SUPPORTED
@@ -73,72 +81,68 @@ PTO_INTERNAL bool BuildAsyncSession(__gm__ uint8_t *workspace,
 #endif
 ```
 
-| Parameter | Description |
-|---|---|
-| `workspace` | GM pointer allocated by host-side `UrmaWorkspaceManager`. |
-| `destRankId` | Remote PE rank id that this session communicates with. For `TGET_ASYNC` this is the source rank. |
-| `session` | Output `AsyncSession` object. |
+|| Parameter | Description |
+||-----------|-------------|
+|| `workspace` | GM pointer from host-side `UrmaWorkspaceManager` |
+|| `destRankId` | Source rank id (remote NPU for `TGET_ASYNC`) |
+|| `session` | Output `AsyncSession` object |
 
-URMA does not require `scratchTile` — polling uses `ld_dev`/`st_dev` hardware intrinsics directly.
+## Inputs
 
-## Constraints
+|| Operand | Type | Description |
+||---------|------|-------------|
+|| `dstGlobalData` | `GlobalTensor` | Local destination; must be flat contiguous 1D |
+|| `srcGlobalData` | `GlobalTensor` | Remote source; must be flat contiguous 1D |
+|| `session` | `AsyncSession` | Engine-agnostic session object |
+|| `WaitEvents...` | `RecordEvent...` | Events to wait on before issuing the get |
 
-- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
-- `GlobalSrcData::layout == GlobalDstData::layout`
-- Both SDMA and URMA paths require source tensor to be **flat contiguous logical 1D only**
-- SDMA workspace must be a valid GM pointer allocated by host-side `SdmaWorkspaceManager`
-- URMA workspace must be a valid GM pointer allocated by host-side `UrmaWorkspaceManager`
-- URMA is only available on NPU_ARCH 3510 (Ascend950)
-- The symmetric data buffer passed to `UrmaWorkspaceManager::Init()` must be backed by huge-page memory (allocate with `ACL_MEM_MALLOC_HUGE_ONLY`). The underlying MR registration requires huge-page backing; `ACL_MEM_MALLOC_HUGE_FIRST` may silently fall back to 4KB pages for small allocations, causing registration to fail
+## Expected Outputs
 
-If the 1D contiguous requirement is not met, current implementation returns an invalid async event (`handle == 0`).
+|| Result | Type | Description |
+||--------|------|-------------|
+|| `AsyncEvent` | event | Handle for later `Wait` call; drain via `event.Wait(session)` |
 
-## scratchTile Role
+## Side Effects
 
-`scratchTile` is **not** used to hold transferred payload data.
-It is converted to `TmpBuffer` and used as temporary UB workspace for:
+This operation initiates a DMA transfer from remote global memory to local global memory. Completion is deferred to the `Wait` call.
 
-- writing/reading SDMA control words (flag, sq_tail, channel_info)
-- polling event completion flags
-- committing queue tail during completion
-
-The real payload path remains remote GM -> DMA engine -> local GM; `scratchTile` is only for control/synchronization metadata.
+## Constraints
 
-## scratchTile Type and Size Constraints
+### Type constraints
 
-- must be a `pto::Tile` type
-- must be UB/Vec tile (`ScratchTile::Loc == TileType::Vec`)
-- available bytes must be at least `sizeof(uint64_t)` (8 bytes)
+- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
+- `GlobalSrcData::layout == GlobalDstData::layout`
+- Both SDMA and URMA require **flat contiguous logical 1D tensors** only.
 
-Recommended: `Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>` (256B).
+### Memory constraints
 
-## Completion Semantics (Quiet Semantics)
+- SDMA: `workspace` must be allocated by host-side `SdmaWorkspaceManager`.
+- URMA: `workspace` must be allocated by host-side `UrmaWorkspaceManager`; the buffer must be backed by huge-page memory (`ACL_MEM_MALLOC_HUGE_ONLY`).
 
-The completion mechanism differs by engine, but user-facing quiet semantics are identical:
+### Platform constraints
 
-- **SDMA**: `TGET_ASYNC` only submits data transfer SQEs. The flag SQE is deferred to `Wait`, which polls the flag for completion.
-- **URMA**: `TGET_ASYNC` submits an RDMA READ WQE and rings the doorbell immediately. `Wait` polls the Completion Queue (CQ) until all expected CQEs have been consumed.
+- URMA is available on NPU_ARCH 3510 (Ascend950) only.
 
-- `event.Wait(session)` — blocks until **all async operations issued since the last Wait** are complete
+## scratchTile Role (SDMA)
 
-This means after multiple `TGET_ASYNC` calls, a single `Wait` on the last returned `AsyncEvent` drains all pending operations (similar to shmem's quiet semantics).
+`scratchTile` does **not** hold payload data. It is converted to `TmpBuffer` and used as temporary UB workspace for SDMA control words (flag, sq_tail, channel_info), polling completion flags, and committing queue tail. The payload path is always remote GM → DMA engine → local GM.
 
-After wait succeeds, all issued reads into `dstGlobalData` are complete.
+Requirements: must be `pto::Tile` with `TileType::Vec`, at least 8 bytes. Recommended: `Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>` (256B).
 
-## Example
+## Target-Profile Restrictions
 
-### Single Transfer
+- SDMA is available on all targets. URMA is Ascend950-only.
+- CPU simulation does not support async communication operations.
+- The `AsyncSession` is engine-agnostic; switching engines requires recompilation.
 
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
+## Examples
 
-using namespace pto;
+### Single transfer (SDMA)
 
+```cpp
 template <typename T>
 __global__ AICORE void SimpleGet(__gm__ T *localDst, __gm__ T *remoteSrc,
-                                 __gm__ uint8_t *sdmaWorkspace)
-{
+                                 __gm__ uint8_t *sdmaWorkspace) {
     using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
     using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
     using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
@@ -153,75 +157,40 @@ __global__ AICORE void SimpleGet(__gm__ T *localDst, __gm__ T *remoteSrc,
     TASSIGN(scratchTile, 0x0);
 
     comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
+    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session))
         return;
-    }
 
     auto event = comm::TGET_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
     (void)event.Wait(session);
 }
 ```
 
-### Batch Transfer (Quiet Semantics)
+### Batch transfer — quiet semantics
 
 ```cpp
-template <typename T>
-__global__ AICORE void BatchGet(__gm__ T *localDstBase, __gm__ T *remoteSrcBase,
-                                __gm__ uint8_t *sdmaWorkspace, int nranks)
-{
-    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    comm::AsyncEvent lastEvent;
-    for (int rank = 0; rank < nranks; ++rank) {
-        GT dstG(localDstBase + rank * 1024, shape, stride);
-        GT srcG(remoteSrcBase + rank * 1024, shape, stride);
-        lastEvent = comm::TGET_ASYNC(dstG, srcG, session);
-    }
-    (void)lastEvent.Wait(session);  // single Wait drains all pending ops
+comm::AsyncEvent lastEvent;
+for (int rank = 0; rank < nranks; ++rank) {
+    GT dstG(localDstBase + rank * 1024, shape, stride);
+    GT srcG(remoteSrcBase + rank * 1024, shape, stride);
+    lastEvent = comm::TGET_ASYNC(dstG, srcG, session);
 }
+(void)lastEvent.Wait(session);  // single Wait drains all pending ops
 ```
 
-### URMA Example (NPU_ARCH 3510)
+### URMA (Ascend950)
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
+comm::AsyncSession session;
+if (!comm::BuildAsyncSession<comm::DmaEngine::URMA>(urmaWorkspace, srcRankId, session))
+    return;
 
-using namespace pto;
-
-template <typename T>
-__global__ AICORE void SimpleGetUrma(__gm__ T *localDst, __gm__ T *remoteSrc,
-                                     __gm__ uint8_t *urmaWorkspace, uint32_t srcRankId)
-{
-    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT dstG(localDst, shape, stride);
-    GT srcG(remoteSrc, shape, stride);
+auto event = comm::TGET_ASYNC<comm::DmaEngine::URMA>(dstG, srcG, session);
+(void)event.Wait(session);
+```
 
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::URMA>(urmaWorkspace, srcRankId, session)) {
-        return;
-    }
+## Related Ops / Instruction Set Links
 
-    auto event = comm::TGET_ASYNC<comm::DmaEngine::URMA>(dstG, srcG, session);
-    (void)event.Wait(session);
-}
-```
+- Communication overview: [Communication and Runtime](../other/communication-and-runtime.md)
+- Synchronous counterpart: [TGET](./TGET.md)
+- Async write: [TPUT_ASYNC](./TPUT_ASYNC.md)
+- Instruction set: [Other and Communication](../other/README.md)
diff --git a/docs/isa/comm/TGET_ASYNC_zh.md b/docs/isa/comm/TGET_ASYNC_zh.md
index d3e7dd60..980f07a1 100644
--- a/docs/isa/comm/TGET_ASYNC_zh.md
+++ b/docs/isa/comm/TGET_ASYNC_zh.md
@@ -1,224 +1,276 @@
-# TGET_ASYNC
-
-## 简介
-
-`TGET_ASYNC` 是异步远程读原语。它启动一次从远端 GM 到本地 GM 的传输，并立即返回 `AsyncEvent`。
-
-数据流：
-
-`srcGlobalData（远端 GM）` → DMA 引擎 → `dstGlobalData（本地 GM）`
-
-## 模板参数
-
-- `engine`：
-    - `DmaEngine::SDMA`（默认）
-    - `DmaEngine::URMA`（Ascend950，仅 NPU_ARCH 3510）
-
-> **注意（SDMA 路径）**
-> `TGET_ASYNC` 配合 `DmaEngine::SDMA` 目前**仅支持扁平连续的逻辑一维 tensor**。
-> 当前 SDMA 异步实现不支持非一维或非连续布局。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA,
-          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
-PTO_INST AsyncEvent TGET_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                               const AsyncSession &session, WaitEvents &... events);
-```
-
-`AsyncSession` 是引擎无关的会话对象。使用 `BuildAsyncSession<engine>()` 构建一次后，传递给所有异步调用和事件等待。模板参数 `engine` 在编译期选择 DMA 后端，使代码对未来引擎（CCU 等）保持前向兼容。
-
-## AsyncSession 构建
-
-使用 `include/pto/comm/async/async_event_impl.hpp` 中的 `BuildAsyncSession`。
-该函数有两个重载——分别用于 SDMA 和 URMA，参数列表不同。
-
-### SDMA 构建（默认）
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
-PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
-                                    __gm__ uint8_t *workspace,
-                                    AsyncSession &session,
-                                    uint32_t syncId = 0,
-                                    const sdma::SdmaBaseConfig &baseConfig = {sdma::kDefaultSdmaBlockBytes, 0, 1},
-                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
-```
-
-| 参数 | 默认值 | 说明 |
-|---|---|---|
-| `scratchTile` | — | 用于 SDMA 控制元数据的 UB scratch tile（参见 [scratchTile 的作用](#scratchtile-的作用)）。|
-| `workspace` | — | 由主机侧 `SdmaWorkspaceManager` 分配的 GM 指针。|
-| `session` | — | 输出的 `AsyncSession` 对象。|
-| `syncId` | `0` | MTE3/MTE2 管道同步事件 ID（0-7）。若 kernel 在相同 ID 上使用了其他管道屏障，则需覆盖此值。|
-| `baseConfig` | `{kDefaultSdmaBlockBytes, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`。适用于大多数单队列传输场景。|
-| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA 通道组索引。默认内部使用 `get_block_idx()` 映射到当前 AI Core。多 block 或自定义通道映射场景下需覆盖此值。|
-
-### URMA 构建（仅 NPU_ARCH 3510）
-
-> URMA（User-level RDMA Memory Access）是 Ascend950（NPU_ARCH 3510）上的硬件加速 RDMA 传输引擎。
-
-```cpp
-#ifdef PTO_URMA_SUPPORTED
-template <DmaEngine engine>
-PTO_INTERNAL bool BuildAsyncSession(__gm__ uint8_t *workspace,
-                                    uint32_t destRankId,
-                                    AsyncSession &session);
-#endif
-```
-
-| 参数 | 说明 |
-|---|---|
-| `workspace` | 由主机侧 `UrmaWorkspaceManager` 分配的 GM 指针。|
-| `destRankId` | 此会话通信的远端 PE rank id。对于 `TGET_ASYNC`，这是数据来源的源 rank。|
-| `session` | 输出的 `AsyncSession` 对象。|
-
-URMA 不需要 `scratchTile`——轮询通过 `ld_dev`/`st_dev` 硬件原语直接操作。
-
-## 约束
-
-- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
-- `GlobalSrcData::layout == GlobalDstData::layout`
-- SDMA 和 URMA 路径均要求源 tensor 为**扁平连续的逻辑一维**
-- SDMA workspace 必须是由主机侧 `SdmaWorkspaceManager` 分配的有效 GM 指针
-- URMA workspace 必须是由主机侧 `UrmaWorkspaceManager` 分配的有效 GM 指针
-- URMA 仅在 NPU_ARCH 3510（Ascend950）上可用
-- 传给 `UrmaWorkspaceManager::Init()` 的对称数据缓冲区必须由大页内存支撑（使用 `ACL_MEM_MALLOC_HUGE_ONLY` 分配）。底层 MR 注册要求大页背景；`ACL_MEM_MALLOC_HUGE_FIRST` 在小尺寸分配时可能静默回退到 4KB 小页，导致注册失败
-
-若不满足一维连续要求，当前实现返回无效 async event（`handle == 0`）。
-
-## scratchTile 的作用
-
-`scratchTile` **不是**用于传输数据负载的暂存缓冲区。
-它被转换为 `TmpBuffer`，用作临时 UB 工作区，用于：
-
-- 写入/读取 SDMA 控制字（flag、sq_tail、channel_info）
-- 轮询事件完成标志
-- 完成时提交队列尾部
-
-实际数据路径为远端 GM → DMA 引擎 → 本地 GM；`scratchTile` 仅用于控制和同步元数据。
-
-## scratchTile 类型与大小约束
-
-- 必须是 `pto::Tile` 类型
-- 必须是 UB/Vec tile（`ScratchTile::Loc == TileType::Vec`）
-- 可用字节数至少为 `sizeof(uint64_t)`（8 字节）
-
-推荐使用：`Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>`（256B）。
-
-## 完成语义（Quiet 语义）
-
-不同引擎的底层完成机制不同，但用户侧的 quiet 语义行为一致：
-
-- **SDMA**：`TGET_ASYNC` 仅提交数据传输 SQE，flag SQE 延迟到 `Wait` 时提交，通过轮询 flag 判断完成。
-- **URMA**：`TGET_ASYNC` 立即提交 RDMA READ WQE 并敲门铃。`Wait` 通过轮询 Completion Queue（CQ）等待所有预期的 CQE 被消费。
-
-- `event.Wait(session)` — 阻塞，直到**自上次 Wait 以来所有已发出的异步操作**全部完成
-
-这意味着多次 `TGET_ASYNC` 调用后，只需对最后一个返回的 `AsyncEvent` 调用一次 `Wait`，即可等待所有 pending 操作完成（类似 shmem 的 quiet 语义）。
-
-wait 成功后，所有已发出的 `dstGlobalData` 读入数据均已全部就绪。
-
-## 示例
-
-### 单次传输
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
-
-using namespace pto;
-
-template <typename T>
-__global__ AICORE void SimpleGet(__gm__ T *localDst, __gm__ T *remoteSrc,
-                                 __gm__ uint8_t *sdmaWorkspace)
-{
-    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT dstG(localDst,  shape, stride);
-    GT srcG(remoteSrc, shape, stride);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    auto event = comm::TGET_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
-    (void)event.Wait(session);
-}
-```
-
-### 批量传输（Quiet 语义）
-
-```cpp
-template <typename T>
-__global__ AICORE void BatchGet(__gm__ T *localDstBase, __gm__ T *remoteSrcBase,
-                                __gm__ uint8_t *sdmaWorkspace, int nranks)
-{
-    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    comm::AsyncEvent lastEvent;
-    for (int rank = 0; rank < nranks; ++rank) {
-        GT dstG(localDstBase + rank * 1024, shape, stride);
-        GT srcG(remoteSrcBase + rank * 1024, shape, stride);
-        lastEvent = comm::TGET_ASYNC(dstG, srcG, session);
-    }
-    (void)lastEvent.Wait(session);  // 一次 Wait 等待所有 pending 操作
-}
-```
-
-### URMA 示例（NPU_ARCH 3510）
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
-
-using namespace pto;
-
-template <typename T>
-__global__ AICORE void SimpleGetUrma(__gm__ T *localDst, __gm__ T *remoteSrc,
-                                     __gm__ uint8_t *urmaWorkspace, uint32_t srcRankId)
-{
-    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT dstG(localDst, shape, stride);
-    GT srcG(remoteSrc, shape, stride);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::URMA>(urmaWorkspace, srcRankId, session)) {
-        return;
-    }
-
-    auto event = comm::TGET_ASYNC<comm::DmaEngine::URMA>(dstG, srcG, session);
-    (void)event.Wait(session);
-}
-```
+# TGET_ASYNC
+
+## 概述
+
+`TGET_ASYNC` 是异步远程读原语：启动从远端 GM 到本地 GM 的 DMA 传输后立即返回 `AsyncEvent`，不阻塞。稍后通过事件等待传输完成。
+
+支持两种 DMA 引擎：SDMA（默认，所有目标均可用）和 URMA（硬件 RDMA，仅 Ascend950 / NPU_ARCH 3510 可用）。
+
+## 机制
+
+`TGET_ASYNC` 启动从远端 GM 到本地 GM 的 DMA 传输后立即返回：
+
+```
+srcGlobalData（远端 GM） → DMA 引擎 → dstGlobalData（本地 GM）
+```
+
+`AsyncSession` 管理引擎无关的异步状态。发出一个或多个异步操作后，调用 `event.Wait(session)` 阻塞直到所有 pending 操作完成（quiet 语义——一次 `Wait` 等待自上次 `Wait` 以来发出的所有操作）。
+
+不同引擎的完成机制：
+
+- **SDMA**：只提交数据传输 SQE，flag SQE 延迟到 `Wait` 时提交，通过轮询 flag 判断完成。
+- **URMA**：立即提交 RDMA READ WQE 并敲门铃，`Wait` 轮询 Completion Queue 直到所有预期的 CQE 被消费。
+
+## 模板参数
+
+- `engine`：
+    - `DmaEngine::SDMA`（默认）
+    - `DmaEngine::URMA`（Ascend950，仅 NPU_ARCH 3510）
+
+> **注意（SDMA 路径）**
+> `TGET_ASYNC` 配合 `DmaEngine::SDMA` 目前**仅支持扁平连续的逻辑一维 tensor**。
+> 当前 SDMA 异步实现不支持非一维或非连续布局。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA,
+          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
+PTO_INST AsyncEvent TGET_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                               const AsyncSession &session, WaitEvents &... events);
+```
+
+`AsyncSession` 是引擎无关的会话对象。使用 `BuildAsyncSession<engine>()` 构建一次后，传递给所有异步调用和事件等待。模板参数 `engine` 在编译期选择 DMA 后端，使代码对未来引擎（CCU 等）保持前向兼容。
+
+## 输入
+
+|| 操作数 | 类型 | 说明 |
+||--------|------|------|
+|| `dstGlobalData` | `GlobalTensor` | 本地目标，必须为扁平连续一维 |
+|| `srcGlobalData` | `GlobalTensor` | 远端源，必须为扁平连续一维 |
+|| `session` | `AsyncSession` | 引擎无关的会话对象 |
+|| `WaitEvents...` | `RecordEvent...` | 发指令前要等待的事件 |
+
+## 预期输出
+
+|| 结果 | 类型 | 说明 |
+||------|------|------|
+|| `AsyncEvent` | event | 后续用于 `Wait` 调用的句柄 |
+
+## 副作用
+
+此操作启动从远端 GM 到本地 GM 的 DMA 传输。完成时机延后到 `Wait` 调用。
+
+## AsyncSession 构建
+
+使用 `include/pto/comm/async/async_event_impl.hpp` 中的 `BuildAsyncSession`。
+该函数有两个重载——分别用于 SDMA 和 URMA，参数列表不同。
+
+### SDMA 构建（默认）
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
+PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
+                                    __gm__ uint8_t *workspace,
+                                    AsyncSession &session,
+                                    uint32_t syncId = 0,
+                                    const sdma::SdmaBaseConfig &baseConfig = {sdma::kDefaultSdmaBlockBytes, 0, 1},
+                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
+```
+
+| 参数 | 默认值 | 说明 |
+|---|---|---|
+| `scratchTile` | — | 用于 SDMA 控制元数据的 UB scratch tile（参见 [scratchTile 的作用](#scratchtile-的作用)）。|
+| `workspace` | — | 由主机侧 `SdmaWorkspaceManager` 分配的 GM 指针。|
+| `session` | — | 输出的 `AsyncSession` 对象。|
+| `syncId` | `0` | MTE3/MTE2 管道同步事件 ID（0-7）。若 kernel 在相同 ID 上使用了其他管道屏障，则需覆盖此值。|
+| `baseConfig` | `{kDefaultSdmaBlockBytes, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`。适用于大多数单队列传输场景。|
+| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA 通道组索引。默认内部使用 `get_block_idx()` 映射到当前 AI Core。多 block 或自定义通道映射场景下需覆盖此值。|
+
+### URMA 构建（仅 NPU_ARCH 3510）
+
+> URMA（User-level RDMA Memory Access）是 Ascend950（NPU_ARCH 3510）上的硬件加速 RDMA 传输引擎。
+
+```cpp
+#ifdef PTO_URMA_SUPPORTED
+template <DmaEngine engine>
+PTO_INTERNAL bool BuildAsyncSession(__gm__ uint8_t *workspace,
+                                    uint32_t destRankId,
+                                    AsyncSession &session);
+#endif
+```
+
+| 参数 | 说明 |
+|---|---|
+| `workspace` | 由主机侧 `UrmaWorkspaceManager` 分配的 GM 指针。|
+| `destRankId` | 此会话通信的远端 PE rank id。对于 `TGET_ASYNC`，这是数据来源的源 rank。|
+| `session` | 输出的 `AsyncSession` 对象。|
+
+URMA 不需要 `scratchTile`——轮询通过 `ld_dev`/`st_dev` 硬件原语直接操作。
+
+## 约束
+
+### 类型约束
+
+- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
+- `GlobalSrcData::layout == GlobalDstData::layout`
+
+### 传输约束
+
+- SDMA 和 URMA 路径均要求源 tensor 为**扁平连续的逻辑一维**
+- 若不满足一维连续要求，当前实现返回无效 async event（`handle == 0`）
+
+### 内存约束
+
+- SDMA workspace 必须是由主机侧 `SdmaWorkspaceManager` 分配的有效 GM 指针
+- URMA workspace 必须是由主机侧 `UrmaWorkspaceManager` 分配的有效 GM 指针
+- URMA 仅在 NPU_ARCH 3510（Ascend950）上可用
+- 传给 `UrmaWorkspaceManager::Init()` 的对称数据缓冲区必须由大页内存支撑（使用 `ACL_MEM_MALLOC_HUGE_ONLY` 分配）
+
+## scratchTile 的作用
+
+`scratchTile` **不是**用于传输数据负载的暂存缓冲区。
+它被转换为 `TmpBuffer`，用作临时 UB 工作区，用于：
+
+- 写入/读取 SDMA 控制字（flag、sq_tail、channel_info）
+- 轮询事件完成标志
+- 完成时提交队列尾部
+
+实际数据路径为远端 GM → DMA 引擎 → 本地 GM；`scratchTile` 仅用于控制和同步元数据。
+
+## scratchTile 类型与大小约束
+
+- 必须是 `pto::Tile` 类型
+- 必须是 UB/Vec tile（`ScratchTile::Loc == TileType::Vec`）
+- 可用字节数至少为 `sizeof(uint64_t)`（8 字节）
+
+推荐使用：`Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>`（256B）。
+
+## 完成语义（Quiet 语义）
+
+不同引擎的底层完成机制不同，但用户侧的 quiet 语义行为一致：
+
+- **SDMA**：`TGET_ASYNC` 仅提交数据传输 SQE，flag SQE 延迟到 `Wait` 时提交，通过轮询 flag 判断完成。
+- **URMA**：`TGET_ASYNC` 立即提交 RDMA READ WQE 并敲门铃。`Wait` 通过轮询 Completion Queue（CQ）等待所有预期的 CQE 被消费。
+
+- `event.Wait(session)` — 阻塞，直到**自上次 Wait 以来所有已发出的异步操作**全部完成
+
+这意味着多次 `TGET_ASYNC` 调用后，只需对最后一个返回的 `AsyncEvent` 调用一次 `Wait`，即可等待所有 pending 操作完成（类似 shmem 的 quiet 语义）。
+
+wait 成功后，所有已发出的 `dstGlobalData` 读入数据均已全部就绪。
+
+## 示例
+
+### 单次传输
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/common/pto_tile.hpp>
+
+using namespace pto;
+
+template <typename T>
+__global__ AICORE void SimpleGet(__gm__ T *localDst, __gm__ T *remoteSrc,
+                                 __gm__ uint8_t *sdmaWorkspace)
+{
+    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
+    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
+
+    ShapeDyn shape(1, 1, 1, 1, 1024);
+    StrideDyn stride(1024, 1024, 1024, 1024, 1);
+    GT dstG(localDst,  shape, stride);
+    GT srcG(remoteSrc, shape, stride);
+
+    ScratchTile scratchTile;
+    TASSIGN(scratchTile, 0x0);
+
+    comm::AsyncSession session;
+    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
+        return;
+    }
+
+    auto event = comm::TGET_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
+    (void)event.Wait(session);
+}
+```
+
+### 批量传输（Quiet 语义）
+
+```cpp
+template <typename T>
+__global__ AICORE void BatchGet(__gm__ T *localDstBase, __gm__ T *remoteSrcBase,
+                                __gm__ uint8_t *sdmaWorkspace, int nranks)
+{
+    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
+    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
+
+    ShapeDyn shape(1, 1, 1, 1, 1024);
+    StrideDyn stride(1024, 1024, 1024, 1024, 1);
+
+    ScratchTile scratchTile;
+    TASSIGN(scratchTile, 0x0);
+
+    comm::AsyncSession session;
+    if (!comm::BuildAsyncSession(scratchTile, sdmaWorkspace, session)) {
+        return;
+    }
+
+    comm::AsyncEvent lastEvent;
+    for (int rank = 0; rank < nranks; ++rank) {
+        GT dstG(localDstBase + rank * 1024, shape, stride);
+        GT srcG(remoteSrcBase + rank * 1024, shape, stride);
+        lastEvent = comm::TGET_ASYNC(dstG, srcG, session);
+    }
+    (void)lastEvent.Wait(session);  // 一次 Wait 等待所有 pending 操作
+}
+```
+
+### URMA 示例（NPU_ARCH 3510）
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/common/pto_tile.hpp>
+
+using namespace pto;
+
+template <typename T>
+__global__ AICORE void SimpleGetUrma(__gm__ T *localDst, __gm__ T *remoteSrc,
+                                     __gm__ uint8_t *urmaWorkspace, uint32_t srcRankId)
+{
+    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
+
+    ShapeDyn shape(1, 1, 1, 1, 1024);
+    StrideDyn stride(1024, 1024, 1024, 1024, 1);
+    GT dstG(localDst, shape, stride);
+    GT srcG(remoteSrc, shape, stride);
+
+    comm::AsyncSession session;
+    if (!comm::BuildAsyncSession<comm::DmaEngine::URMA>(urmaWorkspace, srcRankId, session)) {
+        return;
+    }
+
+    auto event = comm::TGET_ASYNC<comm::DmaEngine::URMA>(dstG, srcG, session);
+    (void)event.Wait(session);
+}
+```
+
+## 目标Profile限制
+
+- SDMA 在所有目标上可用。URMA 仅在 Ascend950（Ascend 9xx）上可用。
+- CPU 模拟器不支持异步通信操作。
+- `AsyncSession` 是引擎无关的；切换引擎需要重新编译。
+
+## 相关页面
+
+- 通信概述：[通信与运行时](../other/communication-and-runtime_zh.md)
+- 同步对应：[TGET](./TGET_zh.md)
+- 异步写：[TPUT_ASYNC](./TPUT_ASYNC_zh.md)
+- 指令集：[其他与通信](../other/README_zh.md)
diff --git a/docs/isa/comm/TGET_zh.md b/docs/isa/comm/TGET_zh.md
index b6028a93..ed0bf671 100644
--- a/docs/isa/comm/TGET_zh.md
+++ b/docs/isa/comm/TGET_zh.md
@@ -1,6 +1,6 @@
 # pto.tget / TGET
 
-## 简介
+## 概述
 
 `TGET` 是远程读原语：把远端 NPU 上的 GM 数据读到当前 NPU 的本地 GM。`pto.tget` 是 IR 形式，`TGET` 是 C++ intrinsic 形式，两者描述的是同一条通信指令。
 
@@ -12,13 +12,9 @@
 
 当 `GlobalTensor` 的行或列超出单个 UB Tile 的容量时，`TGET` 会自动沿 `DIM_3` 和 `DIM_4` 做二维滑动分块，不需要手工把传输拆成小块。
 
-## 数学语义
+只有本地 NPU 执行 TGET；远端 NPU 是被动的。
 
-对有效区域中的每个元素 `(i, j)`：
-
-$$ \mathrm{dst}^{\mathrm{local}}_{i,j} = \mathrm{src}^{\mathrm{remote}}_{i,j} $$
-
-## 汇编语法
+## 语法
 
 PTO-AS 形式：
 
@@ -53,6 +49,26 @@ PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData,
                           WaitEvents&... events);
 ```
 
+## 输入
+
+| 操作数 | 类型 | 描述 |
+|--------|------|------|
+| `dstGlobalData` | `GlobalTensor` | 本地目标，必须指向本地 GM |
+| `srcGlobalData` | `GlobalTensor` | 远端源，必须指向远端 NPU 的 GM |
+| `stagingTileData` | `Tile` | UB 暂存 Tile，用于 GM→UB→GM 传输路径 |
+| `pingTile` / `pongTile` | `Tile` | 用于乒乓双缓冲的两个 UB 暂存 Tile |
+| `WaitEvents...` | `RecordEvent...` | 在发起 GET 前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 描述 |
+|------|------|------|
+| `RecordEvent` | event | 标记远程读取完成的事件令牌 |
+
+## 副作用
+
+此操作从远端 GM 读取数据并写入本地 GM。它通过返回的事件令牌建立同步边界。
+
 ## 约束
 
 ### 类型约束
@@ -67,11 +83,22 @@ PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData,
 - `dstGlobalData` 必须指向本地地址（当前 NPU）
 - `stagingTileData`、`pingTile`、`pongTile` 必须预先在 UB 中分配
 
+### 传输约束
+
+- 传输大小由 `GlobalTensor` 的 shape 决定；自动分块以适配 UB 暂存缓冲区
+- 自动分块时，行（`DIM_3`）和列（`DIM_4`）会根据需要细分
+
 ### 乒乓约束
 
 - `pingTile` 与 `pongTile` 的类型和维度必须一致
 - 两者必须位于不重叠的 UB 偏移
 
+## 目标Profile限制
+
+- 点对点通信仅在 A2/A3 和 A5 Profile 上支持。CPU 模拟不支持远程内存访问。
+- 传输大张量时请使用乒乓双缓冲，以重叠连续传输，提高流水线利用率。
+- `TGET` 需要有效的远端 GM 地址；远端 NPU 必须已分配对应的内存区域。
+
 ## 示例
 
 ### 基础形式
@@ -128,6 +155,6 @@ comm::TGET(dstG, srcG, pingTile, pongTile);
 ## 相关页面
 
 - [通信与运行时](../other/communication-and-runtime_zh.md)
-- [TPUT](./TPUT_zh.md)
-- [TBROADCAST](./TBROADCAST_zh.md)
-- [TGATHER](./TGATHER_zh.md)
+- 逆操作：[TPUT](./TPUT_zh.md)
+- 集合通信：[TBROADCAST](./TBROADCAST_zh.md)、[TGATHER](./TGATHER_zh.md)、[TSCATTLER](./TSCATTER_zh.md)、[TREDUCE](./TREDUCE_zh.md)
+- 指令集：[其他与通信](../other/README_zh.md)
diff --git a/docs/isa/comm/TNOTIFY.md b/docs/isa/comm/TNOTIFY.md
index 65009b4a..470944f4 100644
--- a/docs/isa/comm/TNOTIFY.md
+++ b/docs/isa/comm/TNOTIFY.md
@@ -1,10 +1,16 @@
 ﻿# TNOTIFY
 
-## Introduction
+`TNOTIFY` is part of the [Communication and Runtime](../other/communication-and-runtime.md) instruction set.
 
-Send flag notification to remote NPU. Used for lightweight synchronization between NPUs without transferring bulk data.
+## Summary
 
-## Math Interpretation
+Send a flag notification to a remote NPU. Used for lightweight inter-NPU synchronization without bulk data transfer. The remote signal is updated atomically.
+
+Used in conjunction with `TWAIT` (blocking) or `TTEST` (polling) on the receiver side to implement producer-consumer patterns.
+
+## Mechanism
+
+`TNOTIFY` writes to a remote signal location. The operation semantics depend on the selected operator:
 
 For `NotifyOp::Set`:
 
@@ -12,40 +18,63 @@ $$ \mathrm{signal}^{\mathrm{remote}} = \mathrm{value} $$
 
 For `NotifyOp::AtomicAdd`:
 
-$$ \mathrm{signal}^{\mathrm{remote}} \mathrel{+}= \mathrm{value} \quad (\text{atomic}) $$
+$$ \mathrm{signal}^{\mathrm{remote}} \mathrel{+}= \mathrm{value} \quad (\text{hardware atomic}) $$
 
-## Assembly Syntax
+## Syntax
 
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
+### PTO Assembly Form
 
 ```text
 tnotify %signal_remote, %value {op = #pto.notify_op<Set>} : (!pto.memref<i32>, i32)
 tnotify %signal_remote, %value {op = #pto.notify_op<AtomicAdd>} : (!pto.memref<i32>, i32)
 ```
 
-## C++ Intrinsic
+### C++ Intrinsic
 
 Declared in `include/pto/comm/pto_comm_inst.hpp`:
 
 ```cpp
 template <typename GlobalSignalData, typename... WaitEvents>
-PTO_INST void TNOTIFY(GlobalSignalData &dstSignalData, int32_t value, NotifyOp op, WaitEvents&... events);
+PTO_INST void TNOTIFY(GlobalSignalData &dstSignalData, int32_t value,
+                      NotifyOp op, WaitEvents&... events);
 ```
 
+## Inputs
+
+|| Operand | Type | Description |
+||---------|------|-------------|
+|| `dstSignalData` | `GlobalSignalData` | Remote signal location; must be `int32_t` |
+|| `value` | `int32_t` | Value to set or add |
+|| `op` | `NotifyOp` | Operator: `Set` (direct store) or `AtomicAdd` (hardware atomic) |
+|| `WaitEvents...` | `RecordEvent...` | Events to wait on before issuing the notification |
+
+## Expected Outputs
+
+None. This is a fire-and-forget operation.
+
+## Side Effects
+
+This operation writes to remote global memory. It may establish synchronization edges through the returned event token.
+
 ## Constraints
 
-- **Type constraints**:
-    - `GlobalSignalData::DType` must be `int32_t` (32-bit signal).
-- **Memory constraints**:
-    - `dstSignalData` must point to remote address (on target NPU).
-    - `dstSignalData` should be 4-byte aligned.
-- **Operation semantics**:
-    - `NotifyOp::Set`: Direct store to remote memory.
-    - `NotifyOp::AtomicAdd`: Hardware atomic add using `st_atomic` instruction.
+### Type constraints
+
+- `GlobalSignalData::DType` must be `int32_t`.
+
+### Memory constraints
+
+- `dstSignalData` must point to a remote address (on the target NPU).
+- The signal location should be 4-byte aligned.
+
+## Target-Profile Restrictions
+
+- `TNOTIFY` is supported on A2/A3 and A5 profiles. CPU simulation does not implement remote signal operations.
+- `AtomicAdd` uses hardware atomic store instructions; ensure the target NPU supports the atomic operation.
 
 ## Examples
 
-### Basic Set Notification
+### Basic set notification
 
 ```cpp
 #include <pto/comm/pto_comm_inst.hpp>
@@ -54,47 +83,40 @@ using namespace pto;
 
 void notify_set(__gm__ int32_t* remote_signal) {
     comm::Signal sig(remote_signal);
-
-    // Set remote signal to 1
     comm::TNOTIFY(sig, 1, comm::NotifyOp::Set);
 }
 ```
 
-### Atomic Counter Increment
+### Atomic counter increment
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
 void atomic_increment(__gm__ int32_t* remote_counter) {
     comm::Signal counter(remote_counter);
-
-    // Atomically add 1 to remote counter
     comm::TNOTIFY(counter, 1, comm::NotifyOp::AtomicAdd);
 }
 ```
 
-### Producer-Consumer Pattern
+### Producer-consumer pattern
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-// Producer: notify when data is ready
+// Producer: signals when data is ready
 void producer(__gm__ int32_t* remote_flag) {
     // ... produce data ...
-
     comm::Signal flag(remote_flag);
     comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
 }
 
-// Consumer: wait for data
+// Consumer: blocks until data is signaled
 void consumer(__gm__ int32_t* local_flag) {
     comm::Signal flag(local_flag);
     comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
-
     // ... consume data ...
 }
 ```
+
+## Related Ops / Instruction Set Links
+
+- Communication overview: [Communication and Runtime](../other/communication-and-runtime.md)
+- Blocking counterpart: [TWAIT](./TWAIT.md)
+- Non-blocking counterpart: [TTEST](./TTEST.md)
+- Instruction set: [Other and Communication](../other/README.md)
diff --git a/docs/isa/comm/TNOTIFY_zh.md b/docs/isa/comm/TNOTIFY_zh.md
index 029286bc..1764bedc 100644
--- a/docs/isa/comm/TNOTIFY_zh.md
+++ b/docs/isa/comm/TNOTIFY_zh.md
@@ -1,10 +1,14 @@
 # TNOTIFY
 
-## 简介
+`TNOTIFY` 是[通信与运行时](../other/communication-and-runtime_zh.md)指令集的一部分。
 
-`TNOTIFY` 向远端 NPU 发送标志通知，用于在不搬运大量数据的前提下建立轻量级同步。
+## 概述
 
-## 数学语义
+向远端 NPU 发送标志通知，用于在不搬运大量数据的前提下建立轻量级同步。常与 `TWAIT`（阻塞等待）或 `TTEST`（轮询）配合使用，实现基于标志的生产者-消费者同步。
+
+## 机制
+
+`TNOTIFY` 向远端信号地址执行写操作。行为取决于选定的运算符：
 
 `NotifyOp::Set`：
 
@@ -12,37 +16,63 @@ $$ \mathrm{signal}^{\mathrm{remote}} = \mathrm{value} $$
 
 `NotifyOp::AtomicAdd`：
 
-$$ \mathrm{signal}^{\mathrm{remote}} \mathrel{+}= \mathrm{value} $$
+$$ \mathrm{signal}^{\mathrm{remote}} \mathrel{+}= \mathrm{value} \quad (\text{硬件原子}) $$
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：
+### PTO 汇编形式
 
 ```text
 tnotify %signal_remote, %value {op = #pto.notify_op<Set>} : (!pto.memref<i32>, i32)
 tnotify %signal_remote, %value {op = #pto.notify_op<AtomicAdd>} : (!pto.memref<i32>, i32)
 ```
 
-## C++ 内建接口
+### C++ 内建接口
 
 声明于 `include/pto/comm/pto_comm_inst.hpp`：
 
 ```cpp
 template <typename GlobalSignalData, typename... WaitEvents>
-PTO_INST void TNOTIFY(GlobalSignalData &dstSignalData, int32_t value, NotifyOp op, WaitEvents&... events);
+PTO_INST void TNOTIFY(GlobalSignalData &dstSignalData, int32_t value,
+                     NotifyOp op, WaitEvents&... events);
 ```
 
+## 输入
+
+|| 操作数 | 类型 | 说明 |
+||--------|------|------|
+|| `dstSignalData` | `GlobalSignalData` | 远端信号地址；必须为 `int32_t` |
+|| `value` | `int32_t` | 要写入或加上的值 |
+|| `op` | `NotifyOp` | 运算符：`Set`（直接写入）或 `AtomicAdd`（硬件原子） |
+|| `WaitEvents...` | `RecordEvent...` | 发指令前要等待的事件 |
+
+## 预期输出
+
+无。此操作是即发即忘型（fire-and-forget）。
+
+## 副作用
+
+此操作向远端全局内存写入数据。返回值（如果有）可用于建立同步边界。
+
 ## 约束
 
+### 类型约束
+
 - `GlobalSignalData::DType` 必须为 `int32_t`
-- `dstSignalData` 必须指向远端地址
-- `dstSignalData` 建议满足 4 字节对齐
-- `NotifyOp::Set` 表示直接写入
-- `NotifyOp::AtomicAdd` 表示原子加
+
+### 内存约束
+
+- `dstSignalData` 必须指向远端地址（目标 NPU）
+- 信号地址建议满足 4 字节对齐
+
+## 目标Profile限制
+
+- `TNOTIFY` 在 A2/A3 和 A5 上支持。CPU 模拟器不支持远端信号操作。
+- `AtomicAdd` 使用硬件原子存储指令；确保目标 NPU 支持该原子操作。
 
 ## 示例
 
-### 基础通知
+### 基础 Set 通知
 
 ```cpp
 void notify_set(__gm__ int32_t* remote_signal) {
@@ -59,3 +89,26 @@ void atomic_increment(__gm__ int32_t* remote_counter) {
     comm::TNOTIFY(counter, 1, comm::NotifyOp::AtomicAdd);
 }
 ```
+
+### 生产者-消费者模式
+
+```cpp
+// 生产者：数据就绪后发信号
+void producer(__gm__ int32_t* remote_flag) {
+    comm::Signal flag(remote_flag);
+    comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
+}
+
+// 消费者：阻塞等待数据
+void consumer(__gm__ int32_t* local_flag) {
+    comm::Signal flag(local_flag);
+    comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
+}
+```
+
+## 相关页面
+
+- 通信概述：[通信与运行时](../other/communication-and-runtime_zh.md)
+- 阻塞等待：[TWAIT](./TWAIT_zh.md)
+- 非阻塞轮询：[TTEST](./TTEST_zh.md)
+- 指令集：[其他与通信](../other/README_zh.md)
diff --git a/docs/isa/comm/TPUT.md b/docs/isa/comm/TPUT.md
index dbbccafa..27d8daff 100644
--- a/docs/isa/comm/TPUT.md
+++ b/docs/isa/comm/TPUT.md
@@ -1,83 +1,129 @@
 ﻿# TPUT
 
-## Introduction
+`TPUT` is part of the [Communication and Runtime](../other/communication-and-runtime.md) instruction set.
 
-Remote write operation: write local data to remote NPU's memory. Data is transferred via a UB tile as intermediate staging buffer.
+## Summary
 
-When the GlobalTensor exceeds the UB tile capacity, TPUT automatically performs **2D sliding** — chunking rows (DIM_3) and columns (DIM_4) to fit each chunk into the tile, iterating over all outer dimensions (DIM_0, DIM_1, DIM_2).
+Remote write operation: copy data from the local NPU's global memory to a remote NPU's global memory. Data traverses a UB staging tile as an intermediate buffer. When the GlobalTensor exceeds the UB tile capacity, TPUT automatically chunks the transfer via 2D sliding.
 
-## Math Interpretation
+Only the local NPU executes TPUT; the remote NPU is passive.
+
+## Mechanism
+
+`TPUT` reads from local global memory and writes to remote global memory. The data path is: local GM → staging tile (UB) → remote GM.
 
 For each element `(i, j)` in the valid region:
 
 $$ \mathrm{dst}^{\mathrm{remote}}_{i,j} = \mathrm{src}^{\mathrm{local}}_{i,j} $$
 
-Data flow: `srcGlobalData (local GM)` → `stagingTileData (UB)` → `dstGlobalData (remote GM)`
+The data flow in full:
 
-## Assembly Syntax
+```
+srcGlobalData (local GM) → stagingTileData (UB) → dstGlobalData (remote GM)
+```
 
-PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+## Syntax
 
-Synchronous form:
+### PTO Assembly Form
 
 ```text
 tput %dst_remote, %src_local : (!pto.memref<...>, !pto.memref<...>)
 ```
-Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
 
-## C++ Intrinsic
+UB staging tiles are introduced during lowering. The C++ intrinsic exposes them explicitly.
 
-Declared in `include/pto/comm/pto_comm_inst.hpp`
+## C++ Intrinsic
 
-### Single-tile (auto-chunking)
+Declared in `include/pto/comm/pto_comm_inst.hpp`:
 
 ```cpp
+// Single-tile form — auto-chunking for large tensors
 template <AtomicType atomicType = AtomicType::AtomicNone,
-          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+          typename GlobalDstData, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
 PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
                           TileData &stagingTileData, WaitEvents&... events);
-```
-
-### Ping-pong double buffering
 
-Uses two staging tiles to overlap TLOAD and TSTORE for adjacent chunks, hiding one DMA transfer behind the other.
-
-```cpp
+// Ping-pong double buffering form
 template <AtomicType atomicType = AtomicType::AtomicNone,
-          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+          typename GlobalDstData, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
 PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
                           TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-### Runtime atomic type
 
-```cpp
-template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+// Runtime atomic type selection
+template <typename GlobalDstData, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
 PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
                           TileData &stagingTileData, AtomicType atomicType, WaitEvents&... events);
 ```
 
+### Atomic types
+
+|| Value | Behavior |
+||-------|----------|
+|| `AtomicType::AtomicNone` | Direct write, no atomic semantics |
+|| `AtomicType::AtomicAdd` | Atomically add source value to destination |
+
+## Inputs
+
+|| Operand | Type | Description |
+||---------|------|-------------|
+|| `dstGlobalData` | `GlobalTensor` | Remote destination; must point to target NPU's GM |
+|| `srcGlobalData` | `GlobalTensor` | Local source; must point to current NPU's GM |
+|| `stagingTileData` | `Tile` | UB staging tile for the GM→UB→GM transfer path |
+|| `pingTile` / `pongTile` | `Tile` | Two UB staging tiles for ping-pong double buffering |
+|| `atomicType` | `AtomicType` | Atomic operation mode (optional; defaults to `AtomicNone`) |
+|| `WaitEvents...` | `RecordEvent...` | Events to wait on before issuing the put |
+
+## Expected Outputs
+
+|| Result | Type | Description |
+||--------|------|-------------|
+|| `RecordEvent` | event | Token signaling completion of the remote write |
+
+## Side Effects
+
+This operation reads from local global memory and writes to remote global memory. It establishes synchronization edges through the returned event token.
+
 ## Constraints
 
-- **Type constraints**:
-    - `GlobalSrcData::RawDType` must equal `GlobalDstData::RawDType`.
-    - `TileData::DType` must equal `GlobalSrcData::RawDType`.
-    - `GlobalSrcData::layout` must equal `GlobalDstData::layout`.
-- **Memory constraints**:
-    - `dstGlobalData` must point to remote address (on target NPU).
-    - `srcGlobalData` must point to local address (on current NPU).
-    - `stagingTileData` / `pingTile` / `pongTile` must be pre-allocated in Unified Buffer.
-- **Valid region**:
-    - Transfer size is determined by `GlobalTensor` shape (auto-chunked to fit tile).
-- **Atomic operation**:
-    - `atomicType` supports `AtomicNone` and `AtomicAdd`.
-- **Ping-pong**:
-    - `pingTile` and `pongTile` must have the same type and dimensions.
-    - Must reside at non-overlapping UB offsets.
+### Type constraints
+
+- `GlobalSrcData::RawDType` must equal `GlobalDstData::RawDType`.
+- `TileData::DType` must equal `GlobalSrcData::RawDType`.
+- `GlobalSrcData::layout` must equal `GlobalDstData::layout`.
+
+### Memory constraints
+
+- `dstGlobalData` must point to a remote address (on the target NPU).
+- `srcGlobalData` must point to a local address (on the current NPU).
+- `stagingTileData` / `pingTile` / `pongTile` must be pre-allocated in UB.
+
+### Transfer constraints
+
+- Transfer size is determined by the `GlobalTensor` shape; auto-chunking tiles data to fit the UB staging buffer.
+- When auto-chunking, rows (DIM_3) and columns (DIM_4) are subdivided as needed.
+
+### Atomic constraints
+
+- `atomicType` supports `AtomicNone` and `AtomicAdd`.
+
+### Ping-pong constraints
+
+- `pingTile` and `pongTile` must have identical type and dimensions.
+- They must reside at non-overlapping UB offsets.
+
+## Target-Profile Restrictions
+
+- Point-to-point communication is supported on A2/A3 and A5 profiles. CPU simulation does not support remote memory access.
+- Use ping-pong double buffering when transferring large tensors to overlap consecutive transfers.
+- `TPUT` requires a valid remote GM address; the remote NPU must have the corresponding memory region allocated.
+- `AtomicAdd` is useful for distributed accumulation patterns.
 
 ## Examples
 
-### Basic Usage
+### Basic remote write
 
 ```cpp
 #include <pto/comm/pto_comm_inst.hpp>
@@ -90,11 +136,6 @@ void example_tput(__gm__ T* local_data, __gm__ T* remote_addr) {
     using TileT = Tile<TileType::Vec, T, 16, 16>;
     using GShape = Shape<1, 1, 1, 16, 16>;
     using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-    /*
-    If the globalTensor is larger than UB Tile, TPUT will perform 2D sliding automatically.
-    using GShape = Shape<1, 1, 1, 4096, 4096>;
-    using GStride = BaseShape2D<T, 4096, 4096, Layout::ND>;
-    */
     using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
 
     GTensor srcG(local_data);
@@ -102,30 +143,38 @@ void example_tput(__gm__ T* local_data, __gm__ T* remote_addr) {
     TileT stagingTile;
     TASSIGN(stagingTile, 0);
 
-    // Basic remote write
     comm::TPUT(dstG, srcG, stagingTile);
-
-    // Remote write with atomic add
-    comm::TPUT<AtomicType::AtomicAdd>(dstG, srcG, stagingTile);
 }
 ```
 
-### Ping-pong Double Buffering
+### Remote write with atomic add
+
+```cpp
+comm::TPUT<AtomicType::AtomicAdd>(dstG, srcG, stagingTile);
+```
+
+### Ping-pong double buffering
 
 ```cpp
 constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
 TileT pingTile(64, 64);
 TileT pongTile(64, 64);
 TASSIGN(pingTile, 0);
-TASSIGN(pongTile, tileUBBytes);  // Non-overlapping UB region
+TASSIGN(pongTile, tileUBBytes);
 
-// Overlaps TLOAD[i+1] with TSTORE[i] for better pipeline utilization
 comm::TPUT(dstG, srcG, pingTile, pongTile);
 ```
 
-### Runtime Atomic Type
+### Runtime atomic type
 
 ```cpp
 // Select atomic type at runtime instead of compile-time template parameter
 comm::TPUT(dstG, srcG, stagingTile, AtomicType::AtomicAdd);
 ```
+
+## Related Ops / Instruction Set Links
+
+- Communication overview: [Communication and Runtime](../other/communication-and-runtime.md)
+- Inverse operation: [TGET](./TGET.md)
+- Collective operations: [TBROADCAST](./TBROADCAST.md), [TGATHER](./TGATHER.md), [TSCATTLER](./TSCATTER.md), [TREDUCE](./TREDUCE.md)
+- Instruction set: [Other and Communication](../other/README.md)
diff --git a/docs/isa/comm/TPUT_ASYNC.md b/docs/isa/comm/TPUT_ASYNC.md
index b6e09255..6900dd39 100644
--- a/docs/isa/comm/TPUT_ASYNC.md
+++ b/docs/isa/comm/TPUT_ASYNC.md
@@ -1,28 +1,42 @@
 # TPUT_ASYNC
 
-## Introduction
+`TPUT_ASYNC` is part of the [Communication and Runtime](../other/communication-and-runtime.md) instruction set.
 
-`TPUT_ASYNC` is an asynchronous remote write primitive. It starts a transfer from local GM to remote GM and returns an `AsyncEvent` immediately.
+## Summary
 
-Data flow:
+Asynchronous remote write: initiates a transfer from local global memory to a remote NPU's global memory and returns an `AsyncEvent` immediately without blocking. The event is used later to wait for transfer completion.
 
-`srcGlobalData (local GM) -> DMA engine -> dstGlobalData (remote GM)`
+Two DMA engines are supported: SDMA (default, available on all targets) and URMA (hardware RDMA, available on Ascend950 / NPU_ARCH 3510 only).
 
+## Mechanism
 
-## Template Parameter
+`TPUT_ASYNC` starts a DMA transfer from local GM to remote GM and returns immediately:
 
-- `engine`:
-    - `DmaEngine::SDMA` (default)
-    - `DmaEngine::URMA` (Ascend950, NPU_ARCH 3510 only)
+```
+srcGlobalData (local GM) → DMA engine → dstGlobalData (remote GM)
+```
+
+The `AsyncSession` manages engine-agnostic async state. After issuing one or more async operations, call `event.Wait(session)` to block until all pending operations complete (quiet semantics — a single `Wait` drains all operations issued since the last `Wait`).
+
+### Engine differences
+
+- **SDMA**: Submits data transfer SQEs; flag SQE is deferred to `Wait`, which polls for completion.
+- **URMA**: Submits an RDMA WRITE WQE and rings the doorbell immediately; `Wait` polls the Completion Queue.
+
+## Syntax
+
+### Template Parameter
 
-> **Important (SDMA path)**
-> `TPUT_ASYNC` with `DmaEngine::SDMA` currently supports **only flat contiguous logical 1D tensors**.
-> Non-1D or non-contiguous layouts are not supported by the current SDMA async implementation.
+|| Value | Description |
+||-------|-------------|
+|| `DmaEngine::SDMA` | Default. System DMA — available on all targets. |
+|| `DmaEngine::URMA` | User-level RDMA — Ascend950 (NPU_ARCH 3510) only. |
 
+> **SDMA limitation**: Currently supports **only flat contiguous logical 1D tensors**. Non-1D or non-contiguous layouts are not supported. If this requirement is not met, the implementation returns an invalid async event (`handle == 0`).
 
 ## C++ Intrinsic
 
-Declared in `include/pto/comm/pto_comm_inst.hpp`.
+Declared in `include/pto/comm/pto_comm_inst.hpp`:
 
 ```cpp
 template <DmaEngine engine = DmaEngine::SDMA,
@@ -31,17 +45,7 @@ PTO_INST AsyncEvent TPUT_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcG
                                const AsyncSession &session, WaitEvents &... events);
 ```
 
-`AsyncSession` is an engine-agnostic session object. Build once with
-`BuildAsyncSession<engine>()`, then pass to all async calls and event waits.
-The template `engine` parameter selects the DMA backend at compile time, making the
-code forward-compatible with future engines (CCU, etc.).
-
-## AsyncSession Construction
-
-Use `BuildAsyncSession` from `include/pto/comm/async/async_event_impl.hpp`.
-There are two overloads — one for SDMA and one for URMA — with different parameter lists.
-
-### SDMA Construction (default)
+### AsyncSession construction (SDMA)
 
 ```cpp
 template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
@@ -53,18 +57,16 @@ PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
                                     uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
 ```
 
-| Parameter | Default | Description |
-|---|---|---|
-| `scratchTile` | — | UB scratch tile for SDMA control metadata (see [scratchTile Role](#scratchtile-role)). |
-| `workspace` | — | GM pointer allocated by host-side `SdmaWorkspaceManager`. |
-| `session` | — | Output `AsyncSession` object. |
-| `syncId` | `0` | MTE3/MTE2 pipe sync event id (0-7). Override if kernel uses other pipe barriers on the same id. |
-| `baseConfig` | `{kDefaultSdmaBlockBytes, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`. Suitable for most single-queue transfers. |
-| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA channel group index. Default uses `get_block_idx()` internally, mapping to current AI core. Override for multi-block or custom channel mapping scenarios. |
+|| Parameter | Default | Description |
+||-----------|---------|-------------|
+|| `scratchTile` | — | UB scratch tile for SDMA control metadata |
+|| `workspace` | — | GM pointer from host-side `SdmaWorkspaceManager` |
+|| `session` | — | Output `AsyncSession` object |
+|| `syncId` | `0` | MTE3/MTE2 pipe sync event id (0–7) |
+|| `baseConfig` | `{kDefaultSdmaBlockBytes, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}` |
+|| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA channel group index; defaults to current AI core |
 
-### URMA Construction (NPU_ARCH 3510 only)
-
-> URMA (User-level RDMA Memory Access) is a hardware-accelerated RDMA transport available on Ascend950 (NPU_ARCH 3510).
+### AsyncSession construction (URMA, NPU_ARCH 3510 only)
 
 ```cpp
 #ifdef PTO_URMA_SUPPORTED
@@ -75,72 +77,68 @@ PTO_INTERNAL bool BuildAsyncSession(__gm__ uint8_t *workspace,
 #endif
 ```
 
-| Parameter | Description |
-|---|---|
-| `workspace` | GM pointer allocated by host-side `UrmaWorkspaceManager`. |
-| `destRankId` | Remote PE rank id that this session communicates with. For `TPUT_ASYNC` this is the destination rank. |
-| `session` | Output `AsyncSession` object. |
-
-URMA does not require `scratchTile` — polling uses `ld_dev`/`st_dev` hardware intrinsics directly.
+|| Parameter | Description |
+||-----------|-------------|
+|| `workspace` | GM pointer from host-side `UrmaWorkspaceManager` |
+|| `destRankId` | Destination rank id (remote NPU for `TPUT_ASYNC`) |
+|| `session` | Output `AsyncSession` object |
 
-## Constraints
+## Inputs
 
-- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
-- `GlobalSrcData::layout == GlobalDstData::layout`
-- Both SDMA and URMA paths require source tensor to be **flat contiguous logical 1D only**
-- SDMA workspace must be a valid GM pointer allocated by host-side `SdmaWorkspaceManager`
-- URMA workspace must be a valid GM pointer allocated by host-side `UrmaWorkspaceManager`
-- URMA is only available on NPU_ARCH 3510 (Ascend950)
-- The symmetric data buffer passed to `UrmaWorkspaceManager::Init()` must be backed by huge-page memory (allocate with `ACL_MEM_MALLOC_HUGE_ONLY`). The underlying MR registration requires huge-page backing; `ACL_MEM_MALLOC_HUGE_FIRST` may silently fall back to 4KB pages for small allocations, causing registration to fail
+|| Operand | Type | Description |
+||---------|------|-------------|
+|| `dstGlobalData` | `GlobalTensor` | Remote destination; must be flat contiguous 1D |
+|| `srcGlobalData` | `GlobalTensor` | Local source; must be flat contiguous 1D |
+|| `session` | `AsyncSession` | Engine-agnostic session object |
+|| `WaitEvents...` | `RecordEvent...` | Events to wait on before issuing the put |
 
-If the 1D contiguous requirement is not met, current implementation returns an invalid async event (`handle == 0`).
+## Expected Outputs
 
-## scratchTile Role
+|| Result | Type | Description |
+||--------|------|-------------|
+|| `AsyncEvent` | event | Handle for later `Wait` call; drain via `event.Wait(session)` |
 
-`scratchTile` is **not** the payload staging buffer for user data.
-It is converted to `TmpBuffer` and used as temporary UB workspace for:
+## Side Effects
 
-- writing/reading SDMA control words (flag, sq_tail, channel_info)
-- polling event completion flags
-- committing queue tail during completion
+This operation initiates a DMA transfer from local global memory to remote global memory. Completion is deferred to the `Wait` call.
 
-Data payload moves between GM buffers directly; `scratchTile` only supports control and synchronization metadata.
+## Constraints
 
-## scratchTile Type and Size Constraints
+### Type constraints
 
-- must be a `pto::Tile` type
-- must be UB/Vec tile (`ScratchTile::Loc == TileType::Vec`)
-- available bytes must be at least `sizeof(uint64_t)` (8 bytes)
+- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
+- `GlobalSrcData::layout == GlobalDstData::layout`
+- Both SDMA and URMA require **flat contiguous logical 1D tensors** only.
 
-Recommended: `Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>` (256B).
+### Memory constraints
 
-## Completion Semantics (Quiet Semantics)
+- SDMA: `workspace` must be allocated by host-side `SdmaWorkspaceManager`.
+- URMA: `workspace` must be allocated by host-side `UrmaWorkspaceManager`; the buffer must be backed by huge-page memory (`ACL_MEM_MALLOC_HUGE_ONLY`).
 
-The completion mechanism differs by engine, but user-facing quiet semantics are identical:
+### Platform constraints
 
-- **SDMA**: `TPUT_ASYNC` only submits data transfer SQEs. The flag SQE is deferred to `Wait`, which polls the flag for completion.
-- **URMA**: `TPUT_ASYNC` submits an RDMA WRITE WQE and rings the doorbell immediately. `Wait` polls the Completion Queue (CQ) until all expected CQEs have been consumed.
+- URMA is available on NPU_ARCH 3510 (Ascend950) only.
 
-- `event.Wait(session)` — blocks until **all async operations issued since the last Wait** are complete
+## scratchTile Role (SDMA)
 
-This means after multiple `TPUT_ASYNC` calls, a single `Wait` on the last returned `AsyncEvent` drains all pending operations (similar to shmem's quiet semantics).
+`scratchTile` does **not** hold payload data. It is converted to `TmpBuffer` and used as temporary UB workspace for SDMA control words (flag, sq_tail, channel_info), polling completion flags, and committing queue tail. The payload path is always local GM → DMA engine → remote GM.
 
-After wait succeeds, all issued writes to `dstGlobalData` are complete.
+Requirements: must be `pto::Tile` with `TileType::Vec`, at least 8 bytes. Recommended: `Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>` (256B).
 
-## Example
+## Target-Profile Restrictions
 
-### Single Transfer
+- SDMA is available on all targets. URMA is Ascend950-only.
+- CPU simulation does not support async communication operations.
+- The `AsyncSession` is engine-agnostic; switching engines requires recompilation.
 
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
+## Examples
 
-using namespace pto;
+### Single transfer (SDMA)
 
+```cpp
 template <typename T>
 __global__ AICORE void SimplePut(__gm__ T *remoteDst, __gm__ T *localSrc,
-                                 __gm__ uint8_t *sdmaWorkspace)
-{
+                                 __gm__ uint8_t *sdmaWorkspace) {
     using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
     using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
     using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
@@ -155,75 +153,40 @@ __global__ AICORE void SimplePut(__gm__ T *remoteDst, __gm__ T *localSrc,
     TASSIGN(scratchTile, 0x0);
 
     comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
+    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session))
         return;
-    }
 
     auto event = comm::TPUT_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
     (void)event.Wait(session);
 }
 ```
 
-### Batch Transfer (Quiet Semantics)
+### Batch transfer — quiet semantics
 
 ```cpp
-template <typename T>
-__global__ AICORE void BatchPut(__gm__ T *remoteDstBase, __gm__ T *localSrc,
-                                __gm__ uint8_t *sdmaWorkspace, int nranks)
-{
-    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT srcG(localSrc, shape, stride);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    comm::AsyncEvent lastEvent;
-    for (int rank = 0; rank < nranks; ++rank) {
-        GT dstG(remoteDstBase + rank * 1024, shape, stride);
-        lastEvent = comm::TPUT_ASYNC(dstG, srcG, session);
-    }
-    (void)lastEvent.Wait(session);  // single Wait drains all pending ops
+comm::AsyncEvent lastEvent;
+for (int rank = 0; rank < nranks; ++rank) {
+    GT dstG(remoteDstBase + rank * 1024, shape, stride);
+    GT srcG(localSrcBase + rank * 1024, shape, stride);
+    lastEvent = comm::TPUT_ASYNC(dstG, srcG, session);
 }
+(void)lastEvent.Wait(session);  // single Wait drains all pending ops
 ```
 
-### URMA Example (NPU_ARCH 3510)
+### URMA (Ascend950)
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
-
-using namespace pto;
+comm::AsyncSession session;
+if (!comm::BuildAsyncSession<comm::DmaEngine::URMA>(urmaWorkspace, destRankId, session))
+    return;
 
-template <typename T>
-__global__ AICORE void SimplePutUrma(__gm__ T *remoteDst, __gm__ T *localSrc,
-                                     __gm__ uint8_t *urmaWorkspace, uint32_t destRankId)
-{
-    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT dstG(remoteDst, shape, stride);
-    GT srcG(localSrc, shape, stride);
+auto event = comm::TPUT_ASYNC<comm::DmaEngine::URMA>(dstG, srcG, session);
+(void)event.Wait(session);
+```
 
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::URMA>(urmaWorkspace, destRankId, session)) {
-        return;
-    }
+## Related Ops / Instruction Set Links
 
-    auto event = comm::TPUT_ASYNC<comm::DmaEngine::URMA>(dstG, srcG, session);
-    (void)event.Wait(session);
-}
-```
+- Communication overview: [Communication and Runtime](../other/communication-and-runtime.md)
+- Synchronous counterpart: [TPUT](./TPUT.md)
+- Async read: [TGET_ASYNC](./TGET_ASYNC.md)
+- Instruction set: [Other and Communication](../other/README.md)
diff --git a/docs/isa/comm/TPUT_ASYNC_zh.md b/docs/isa/comm/TPUT_ASYNC_zh.md
index a314b070..16e2aab8 100644
--- a/docs/isa/comm/TPUT_ASYNC_zh.md
+++ b/docs/isa/comm/TPUT_ASYNC_zh.md
@@ -1,224 +1,276 @@
-# TPUT_ASYNC
-
-## 简介
-
-`TPUT_ASYNC` 是异步远程写原语。它启动一次从本地 GM 到远端 GM 的传输，并立即返回 `AsyncEvent`。
-
-数据流：
-
-`srcGlobalData（本地 GM）` → DMA 引擎 → `dstGlobalData（远端 GM）`
-
-## 模板参数
-
-- `engine`：
-    - `DmaEngine::SDMA`（默认）
-    - `DmaEngine::URMA`（Ascend950，仅 NPU_ARCH 3510）
-
-> **注意（SDMA 路径）**
-> `TPUT_ASYNC` 配合 `DmaEngine::SDMA` 目前**仅支持扁平连续的逻辑一维 tensor**。
-> 当前 SDMA 异步实现不支持非一维或非连续布局。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA,
-          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
-PTO_INST AsyncEvent TPUT_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                               const AsyncSession &session, WaitEvents &... events);
-```
-
-`AsyncSession` 是引擎无关的会话对象。使用 `BuildAsyncSession<engine>()` 构建一次后，传递给所有异步调用和事件等待。模板参数 `engine` 在编译期选择 DMA 后端，使代码对未来引擎（CCU 等）保持前向兼容。
-
-## AsyncSession 构建
-
-使用 `include/pto/comm/async/async_event_impl.hpp` 中的 `BuildAsyncSession`。
-该函数有两个重载——分别用于 SDMA 和 URMA，参数列表不同。
-
-### SDMA 构建（默认）
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
-PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
-                                    __gm__ uint8_t *workspace,
-                                    AsyncSession &session,
-                                    uint32_t syncId = 0,
-                                    const sdma::SdmaBaseConfig &baseConfig = {sdma::kDefaultSdmaBlockBytes, 0, 1},
-                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
-```
-
-| 参数 | 默认值 | 说明 |
-|---|---|---|
-| `scratchTile` | — | 用于 SDMA 控制元数据的 UB scratch tile（参见 [scratchTile 的作用](#scratchtile-的作用)）。|
-| `workspace` | — | 由主机侧 `SdmaWorkspaceManager` 分配的 GM 指针。|
-| `session` | — | 输出的 `AsyncSession` 对象。|
-| `syncId` | `0` | MTE3/MTE2 管道同步事件 ID（0-7）。若 kernel 在相同 ID 上使用了其他管道屏障，则需覆盖此值。|
-| `baseConfig` | `{kDefaultSdmaBlockBytes, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`。适用于大多数单队列传输场景。|
-| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA 通道组索引。默认内部使用 `get_block_idx()` 映射到当前 AI Core。多 block 或自定义通道映射场景下需覆盖此值。|
-
-### URMA 构建（仅 NPU_ARCH 3510）
-
-> URMA（User-level RDMA Memory Access）是 Ascend950（NPU_ARCH 3510）上的硬件加速 RDMA 传输引擎。
-
-```cpp
-#ifdef PTO_URMA_SUPPORTED
-template <DmaEngine engine>
-PTO_INTERNAL bool BuildAsyncSession(__gm__ uint8_t *workspace,
-                                    uint32_t destRankId,
-                                    AsyncSession &session);
-#endif
-```
-
-| 参数 | 说明 |
-|---|---|
-| `workspace` | 由主机侧 `UrmaWorkspaceManager` 分配的 GM 指针。|
-| `destRankId` | 此会话通信的远端 PE rank id。对于 `TPUT_ASYNC`，这是数据写入的目标 rank。|
-| `session` | 输出的 `AsyncSession` 对象。|
-
-URMA 不需要 `scratchTile`——轮询通过 `ld_dev`/`st_dev` 硬件原语直接操作。
-
-## 约束
-
-- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
-- `GlobalSrcData::layout == GlobalDstData::layout`
-- SDMA 和 URMA 路径均要求源 tensor 为**扁平连续的逻辑一维**
-- SDMA workspace 必须是由主机侧 `SdmaWorkspaceManager` 分配的有效 GM 指针
-- URMA workspace 必须是由主机侧 `UrmaWorkspaceManager` 分配的有效 GM 指针
-- URMA 仅在 NPU_ARCH 3510（Ascend950）上可用
-- 传给 `UrmaWorkspaceManager::Init()` 的对称数据缓冲区必须由大页内存支撑（使用 `ACL_MEM_MALLOC_HUGE_ONLY` 分配）。底层 MR 注册要求大页背景；`ACL_MEM_MALLOC_HUGE_FIRST` 在小尺寸分配时可能静默回退到 4KB 小页，导致注册失败
-
-若不满足一维连续要求，当前实现返回无效 async event（`handle == 0`）。
-
-## scratchTile 的作用
-
-`scratchTile` **不是**用于存放用户数据负载的暂存缓冲区。
-它被转换为 `TmpBuffer`，用作临时 UB 工作区，用于：
-
-- 写入/读取 SDMA 控制字（flag、sq_tail、channel_info）
-- 轮询事件完成标志
-- 完成时提交队列尾部
-
-实际数据负载直接在 GM 缓冲区之间传输；`scratchTile` 仅用于控制和同步元数据。
-
-## scratchTile 类型与大小约束
-
-- 必须是 `pto::Tile` 类型
-- 必须是 UB/Vec tile（`ScratchTile::Loc == TileType::Vec`）
-- 可用字节数至少为 `sizeof(uint64_t)`（8 字节）
-
-推荐使用：`Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>`（256B）。
-
-## 完成语义（Quiet 语义）
-
-不同引擎的底层完成机制不同，但用户侧的 quiet 语义行为一致：
-
-- **SDMA**：`TPUT_ASYNC` 仅提交数据传输 SQE，flag SQE 延迟到 `Wait` 时提交，通过轮询 flag 判断完成。
-- **URMA**：`TPUT_ASYNC` 立即提交 RDMA WRITE WQE 并敲门铃。`Wait` 通过轮询 Completion Queue（CQ）等待所有预期的 CQE 被消费。
-
-- `event.Wait(session)` — 阻塞，直到**自上次 Wait 以来所有已发出的异步操作**全部完成
-
-这意味着多次 `TPUT_ASYNC` 调用后，只需对最后一个返回的 `AsyncEvent` 调用一次 `Wait`，即可等待所有 pending 操作完成（类似 shmem 的 quiet 语义）。
-
-wait 成功后，所有已发出的 `dstGlobalData` 写入均已全部完成。
-
-## 示例
-
-### 单次传输
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
-
-using namespace pto;
-
-template <typename T>
-__global__ AICORE void SimplePut(__gm__ T *remoteDst, __gm__ T *localSrc,
-                                 __gm__ uint8_t *sdmaWorkspace)
-{
-    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT dstG(remoteDst, shape, stride);
-    GT srcG(localSrc,  shape, stride);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    auto event = comm::TPUT_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
-    (void)event.Wait(session);
-}
-```
-
-### 批量传输（Quiet 语义）
-
-```cpp
-template <typename T>
-__global__ AICORE void BatchPut(__gm__ T *remoteDstBase, __gm__ T *localSrc,
-                                __gm__ uint8_t *sdmaWorkspace, int nranks)
-{
-    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT srcG(localSrc, shape, stride);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    comm::AsyncEvent lastEvent;
-    for (int rank = 0; rank < nranks; ++rank) {
-        GT dstG(remoteDstBase + rank * 1024, shape, stride);
-        lastEvent = comm::TPUT_ASYNC(dstG, srcG, session);
-    }
-    (void)lastEvent.Wait(session);  // 一次 Wait 等待所有 pending 操作
-}
-```
-
-### URMA 示例（NPU_ARCH 3510）
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
-
-using namespace pto;
-
-template <typename T>
-__global__ AICORE void SimplePutUrma(__gm__ T *remoteDst, __gm__ T *localSrc,
-                                     __gm__ uint8_t *urmaWorkspace, uint32_t destRankId)
-{
-    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT dstG(remoteDst, shape, stride);
-    GT srcG(localSrc, shape, stride);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::URMA>(urmaWorkspace, destRankId, session)) {
-        return;
-    }
-
-    auto event = comm::TPUT_ASYNC<comm::DmaEngine::URMA>(dstG, srcG, session);
-    (void)event.Wait(session);
-}
-```
+# TPUT_ASYNC
+
+## 概述
+
+`TPUT_ASYNC` 是异步远程写原语：启动从本地 GM 到远端 GM 的 DMA 传输后立即返回 `AsyncEvent`，不阻塞。稍后通过事件等待传输完成。
+
+支持两种 DMA 引擎：SDMA（默认，所有目标均可用）和 URMA（硬件 RDMA，仅 Ascend950 / NPU_ARCH 3510 可用）。
+
+## 机制
+
+`TPUT_ASYNC` 启动从本地 GM 到远端 GM 的 DMA 传输后立即返回：
+
+```
+srcGlobalData（本地 GM） → DMA 引擎 → dstGlobalData（远端 GM）
+```
+
+`AsyncSession` 管理引擎无关的异步状态。发出一个或多个异步操作后，调用 `event.Wait(session)` 阻塞直到所有 pending 操作完成（quiet 语义）。
+
+不同引擎的完成机制：
+
+- **SDMA**：只提交数据传输 SQE，flag SQE 延迟到 `Wait` 时提交，通过轮询 flag 判断完成。
+- **URMA**：立即提交 RDMA WRITE WQE 并敲门铃，`Wait` 轮询 Completion Queue 直到所有预期的 CQE 被消费。
+
+## 模板参数
+
+- `engine`：
+    - `DmaEngine::SDMA`（默认）
+    - `DmaEngine::URMA`（Ascend950，仅 NPU_ARCH 3510）
+
+> **注意（SDMA 路径）**
+> `TPUT_ASYNC` 配合 `DmaEngine::SDMA` 目前**仅支持扁平连续的逻辑一维 tensor**。
+> 当前 SDMA 异步实现不支持非一维或非连续布局。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA,
+          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
+PTO_INST AsyncEvent TPUT_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                               const AsyncSession &session, WaitEvents &... events);
+```
+
+`AsyncSession` 是引擎无关的会话对象。使用 `BuildAsyncSession<engine>()` 构建一次后，传递给所有异步调用和事件等待。模板参数 `engine` 在编译期选择 DMA 后端。
+
+## 输入
+
+|| 操作数 | 类型 | 说明 |
+||--------|------|------|
+|| `dstGlobalData` | `GlobalTensor` | 远端目标，必须为扁平连续一维 |
+|| `srcGlobalData` | `GlobalTensor` | 本地源，必须为扁平连续一维 |
+|| `session` | `AsyncSession` | 引擎无关的会话对象 |
+|| `WaitEvents...` | `RecordEvent...` | 发指令前要等待的事件 |
+
+## 预期输出
+
+|| 结果 | 类型 | 说明 |
+||------|------|------|
+|| `AsyncEvent` | event | 后续用于 `Wait` 调用的句柄 |
+
+## 副作用
+
+此操作启动从本地 GM 到远端 GM 的 DMA 传输。完成时机延后到 `Wait` 调用。
+
+## AsyncSession 构建
+
+使用 `include/pto/comm/async/async_event_impl.hpp` 中的 `BuildAsyncSession`。
+该函数有两个重载——分别用于 SDMA 和 URMA，参数列表不同。
+
+### SDMA 构建（默认）
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
+PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
+                                    __gm__ uint8_t *workspace,
+                                    AsyncSession &session,
+                                    uint32_t syncId = 0,
+                                    const sdma::SdmaBaseConfig &baseConfig = {sdma::kDefaultSdmaBlockBytes, 0, 1},
+                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
+```
+
+| 参数 | 默认值 | 说明 |
+|---|---|---|
+| `scratchTile` | — | 用于 SDMA 控制元数据的 UB scratch tile（参见 [scratchTile 的作用](#scratchtile-的作用)）。|
+| `workspace` | — | 由主机侧 `SdmaWorkspaceManager` 分配的 GM 指针。|
+| `session` | — | 输出的 `AsyncSession` 对象。|
+| `syncId` | `0` | MTE3/MTE2 管道同步事件 ID（0-7）。若 kernel 在相同 ID 上使用了其他管道屏障，则需覆盖此值。|
+| `baseConfig` | `{kDefaultSdmaBlockBytes, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`。适用于大多数单队列传输场景。|
+| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA 通道组索引。默认内部使用 `get_block_idx()` 映射到当前 AI Core。多 block 或自定义通道映射场景下需覆盖此值。|
+
+### URMA 构建（仅 NPU_ARCH 3510）
+
+> URMA（User-level RDMA Memory Access）是 Ascend950（NPU_ARCH 3510）上的硬件加速 RDMA 传输引擎。
+
+```cpp
+#ifdef PTO_URMA_SUPPORTED
+template <DmaEngine engine>
+PTO_INTERNAL bool BuildAsyncSession(__gm__ uint8_t *workspace,
+                                    uint32_t destRankId,
+                                    AsyncSession &session);
+#endif
+```
+
+| 参数 | 说明 |
+|---|---|
+| `workspace` | 由主机侧 `UrmaWorkspaceManager` 分配的 GM 指针。|
+| `destRankId` | 此会话通信的远端 PE rank id。对于 `TPUT_ASYNC`，这是数据写入的目标 rank。|
+| `session` | 输出的 `AsyncSession` 对象。|
+
+URMA 不需要 `scratchTile`——轮询通过 `ld_dev`/`st_dev` 硬件原语直接操作。
+
+## 约束
+
+### 类型约束
+
+- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
+- `GlobalSrcData::layout == GlobalDstData::layout`
+
+### 传输约束
+
+- SDMA 和 URMA 路径均要求源 tensor 为**扁平连续的逻辑一维**
+- 若不满足一维连续要求，当前实现返回无效 async event（`handle == 0`）
+
+### 内存约束
+
+- SDMA workspace 必须是由主机侧 `SdmaWorkspaceManager` 分配的有效 GM 指针
+- URMA workspace 必须是由主机侧 `UrmaWorkspaceManager` 分配的有效 GM 指针
+- URMA 仅在 NPU_ARCH 3510（Ascend950）上可用
+- 传给 `UrmaWorkspaceManager::Init()` 的对称数据缓冲区必须由大页内存支撑（使用 `ACL_MEM_MALLOC_HUGE_ONLY` 分配）
+
+## scratchTile 的作用
+
+`scratchTile` **不是**用于存放用户数据负载的暂存缓冲区。
+它被转换为 `TmpBuffer`，用作临时 UB 工作区，用于：
+
+- 写入/读取 SDMA 控制字（flag、sq_tail、channel_info）
+- 轮询事件完成标志
+- 完成时提交队列尾部
+
+实际数据负载直接在 GM 缓冲区之间传输；`scratchTile` 仅用于控制和同步元数据。
+
+## scratchTile 类型与大小约束
+
+- 必须是 `pto::Tile` 类型
+- 必须是 UB/Vec tile（`ScratchTile::Loc == TileType::Vec`）
+- 可用字节数至少为 `sizeof(uint64_t)`（8 字节）
+
+推荐使用：`Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>`（256B）。
+
+## 完成语义（Quiet 语义）
+
+不同引擎的底层完成机制不同，但用户侧的 quiet 语义行为一致：
+
+- **SDMA**：`TPUT_ASYNC` 仅提交数据传输 SQE，flag SQE 延迟到 `Wait` 时提交，通过轮询 flag 判断完成。
+- **URMA**：`TPUT_ASYNC` 立即提交 RDMA WRITE WQE 并敲门铃。`Wait` 通过轮询 Completion Queue（CQ）等待所有预期的 CQE 被消费。
+
+- `event.Wait(session)` — 阻塞，直到**自上次 Wait 以来所有已发出的异步操作**全部完成
+
+这意味着多次 `TPUT_ASYNC` 调用后，只需对最后一个返回的 `AsyncEvent` 调用一次 `Wait`，即可等待所有 pending 操作完成（类似 shmem 的 quiet 语义）。
+
+wait 成功后，所有已发出的 `dstGlobalData` 写入均已全部完成。
+
+## 示例
+
+### 单次传输
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/common/pto_tile.hpp>
+
+using namespace pto;
+
+template <typename T>
+__global__ AICORE void SimplePut(__gm__ T *remoteDst, __gm__ T *localSrc,
+                                 __gm__ uint8_t *sdmaWorkspace)
+{
+    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
+    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
+
+    ShapeDyn shape(1, 1, 1, 1, 1024);
+    StrideDyn stride(1024, 1024, 1024, 1024, 1);
+    GT dstG(remoteDst, shape, stride);
+    GT srcG(localSrc,  shape, stride);
+
+    ScratchTile scratchTile;
+    TASSIGN(scratchTile, 0x0);
+
+    comm::AsyncSession session;
+    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
+        return;
+    }
+
+    auto event = comm::TPUT_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
+    (void)event.Wait(session);
+}
+```
+
+### 批量传输（Quiet 语义）
+
+```cpp
+template <typename T>
+__global__ AICORE void BatchPut(__gm__ T *remoteDstBase, __gm__ T *localSrc,
+                                __gm__ uint8_t *sdmaWorkspace, int nranks)
+{
+    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
+    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
+
+    ShapeDyn shape(1, 1, 1, 1, 1024);
+    StrideDyn stride(1024, 1024, 1024, 1024, 1);
+    GT srcG(localSrc, shape, stride);
+
+    ScratchTile scratchTile;
+    TASSIGN(scratchTile, 0x0);
+
+    comm::AsyncSession session;
+    if (!comm::BuildAsyncSession(scratchTile, sdmaWorkspace, session)) {
+        return;
+    }
+
+    comm::AsyncEvent lastEvent;
+    for (int rank = 0; rank < nranks; ++rank) {
+        GT dstG(remoteDstBase + rank * 1024, shape, stride);
+        lastEvent = comm::TPUT_ASYNC(dstG, srcG, session);
+    }
+    (void)lastEvent.Wait(session);  // 一次 Wait 等待所有 pending 操作
+}
+```
+
+### URMA 示例（NPU_ARCH 3510）
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/common/pto_tile.hpp>
+
+using namespace pto;
+
+template <typename T>
+__global__ AICORE void SimplePutUrma(__gm__ T *remoteDst, __gm__ T *localSrc,
+                                     __gm__ uint8_t *urmaWorkspace, uint32_t destRankId)
+{
+    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
+
+    ShapeDyn shape(1, 1, 1, 1, 1024);
+    StrideDyn stride(1024, 1024, 1024, 1024, 1);
+    GT dstG(remoteDst, shape, stride);
+    GT srcG(localSrc, shape, stride);
+
+    comm::AsyncSession session;
+    if (!comm::BuildAsyncSession<comm::DmaEngine::URMA>(urmaWorkspace, destRankId, session)) {
+        return;
+    }
+
+    auto event = comm::TPUT_ASYNC<comm::DmaEngine::URMA>(dstG, srcG, session);
+    (void)event.Wait(session);
+}
+```
+
+## 目标Profile限制
+
+- SDMA 在所有目标上可用。URMA 仅在 Ascend950（Ascend 9xx）上可用。
+- CPU 模拟器不支持异步通信操作。
+- `AsyncSession` 是引擎无关的；切换引擎需要重新编译。
+
+## 相关页面
+
+- 通信概述：[通信与运行时](../other/communication-and-runtime_zh.md)
+- 同步对应：[TPUT](./TPUT_zh.md)
+- 异步读：[TGET_ASYNC](./TGET_ASYNC_zh.md)
+- 指令集：[其他与通信](../other/README_zh.md)
diff --git a/docs/isa/comm/TPUT_zh.md b/docs/isa/comm/TPUT_zh.md
index 05fe864c..b8a2c8c3 100644
--- a/docs/isa/comm/TPUT_zh.md
+++ b/docs/isa/comm/TPUT_zh.md
@@ -1,18 +1,14 @@
 # TPUT
 
-## 简介
+## 概述
 
 `TPUT` 是远程写原语：把当前 NPU 本地 GM 中的数据写到远端 NPU 的 GM。它通过 UB 中的暂存 Tile 完成 GM→UB→GM 路径。
 
 当 `GlobalTensor` 的行或列超出单个 UB Tile 容量时，`TPUT` 会自动沿 `DIM_3` 和 `DIM_4` 做二维滑动分块。
 
-## 数学语义
+只有本地 NPU 执行 TPUT；远端 NPU 是被动的。
 
-对有效区域内每个元素 `(i, j)`：
-
-$$ \mathrm{dst}^{\mathrm{remote}}_{i,j} = \mathrm{src}^{\mathrm{local}}_{i,j} $$
-
-## 汇编语法
+## 语法
 
 PTO-AS 形式：
 
@@ -52,6 +48,34 @@ PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobal
                           TileData &stagingTileData, AtomicType atomicType, WaitEvents&... events);
 ```
 
+### 原子类型
+
+| 值 | 行为 |
+|----|------|
+| `AtomicType::AtomicNone` | 直接写入，无原子语义 |
+| `AtomicType::AtomicAdd` | 原子地将源值加到目标地址 |
+
+## 输入
+
+| 操作数 | 类型 | 描述 |
+|--------|------|------|
+| `dstGlobalData` | `GlobalTensor` | 远端目标，必须指向目标 NPU 的 GM |
+| `srcGlobalData` | `GlobalTensor` | 本地源，必须指向当前 NPU 的 GM |
+| `stagingTileData` | `Tile` | UB 暂存 Tile，用于 GM→UB→GM 传输路径 |
+| `pingTile` / `pongTile` | `Tile` | 用于乒乓双缓冲的两个 UB 暂存 Tile |
+| `atomicType` | `AtomicType` | 原子操作模式（可选，默认为 `AtomicNone`） |
+| `WaitEvents...` | `RecordEvent...` | 在发起 PUT 前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 描述 |
+|------|------|------|
+| `RecordEvent` | event | 标记远程写入完成的事件令牌 |
+
+## 副作用
+
+此操作从本地 GM 读取数据并写入远端 GM。它通过返回的事件令牌建立同步边界。
+
 ## 约束
 
 ### 类型约束
@@ -66,12 +90,27 @@ PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobal
 - `srcGlobalData` 必须指向本地地址（当前 NPU）
 - `stagingTileData`、`pingTile`、`pongTile` 必须预先在 UB 中分配
 
-### 原子与双缓冲约束
+### 传输约束
+
+- 传输大小由 `GlobalTensor` 的 shape 决定；自动分块以适配 UB 暂存缓冲区
+- 自动分块时，行（`DIM_3`）和列（`DIM_4`）会根据需要细分
+
+### 原子约束
+
+- `atomicType` 支持 `AtomicNone` 和 `AtomicAdd`
+
+### 乒乓约束
 
-- 当前接口支持 `AtomicNone` 与 `AtomicAdd`
 - `pingTile` 与 `pongTile` 的类型和维度必须一致
 - 两者必须位于不重叠的 UB 偏移处
 
+## 目标Profile限制
+
+- 点对点通信仅在 A2/A3 和 A5 Profile 上支持。CPU 模拟不支持远程内存访问。
+- 传输大张量时请使用乒乓双缓冲，以重叠连续传输，提高流水线利用率。
+- `TPUT` 需要有效的远端 GM 地址；远端 NPU 必须已分配对应的内存区域。
+- `AtomicAdd` 可用于分布式归约累积模式（如梯度聚合）。
+
 ## 示例
 
 ### 基础形式
@@ -120,5 +159,6 @@ comm::TPUT(dstG, srcG, stagingTile, AtomicType::AtomicAdd);
 ## 相关页面
 
 - [通信与运行时](../other/communication-and-runtime_zh.md)
-- [TGET](./TGET_zh.md)
-- [TSCATTER](./TSCATTER_zh.md)
+- 逆操作：[TGET](./TGET_zh.md)
+- 集合通信：[TBROADCAST](./TBROADCAST_zh.md)、[TGATHER](./TGATHER_zh.md)、[TSCATTLER](./TSCATTER_zh.md)、[TREDUCE](./TREDUCE_zh.md)
+- 指令集：[其他与通信](../other/README_zh.md)
diff --git a/docs/isa/comm/TREDUCE.md b/docs/isa/comm/TREDUCE.md
index 577beb46..f46133fe 100644
--- a/docs/isa/comm/TREDUCE.md
+++ b/docs/isa/comm/TREDUCE.md
@@ -1,70 +1,116 @@
 ﻿# TREDUCE
 
-## Introduction
+`TREDUCE` is part of the [Communication and Runtime](../other/communication-and-runtime.md) instruction set.
 
-Reduce operation: gather data from multiple remote NPUs and perform element-wise reduction locally.
+## Summary
 
+Collective reduction: gather data from all ranks in a parallel group and perform element-wise reduction locally. Only the root NPU executes `TREDUCE`; non-root ranks only ensure their source buffers are ready. Executing `TREDUCE` on a non-root rank has undefined behavior.
 
-Only the root needs to execute `TREDUCE`. Non-root ranks only need to ensure their source buffers are ready and remain valid for the duration of the operation. Calling `TREDUCE` on non-root ranks is undefined behavior.
+When the GlobalTensor exceeds the UB tile capacity, the reduction is automatically chunked via 2D sliding.
 
-**Large Tile Support**: When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, the reduction is automatically chunked via 2D sliding.
+## Mechanism
 
-## Math Interpretation
-
-For each element `(i, j)` in the valid region:
+`TREDUCE` gathers source data from all ranks in the parallel group and reduces it into the root NPU's destination buffer. For each element `(i, j)` in the valid region:
 
 $$ \mathrm{dst}^{\mathrm{local}}_{i,j} = \bigoplus_{r=0}^{N-1} \mathrm{src}^{(r)}_{i,j} $$
 
-where $N$ is the number of ranks and $\oplus$ is the reduction operation (sum, max, min, etc.).
+where $N$ is the number of ranks and $\oplus$ is the reduction operator.
+
+### Supported reduction operators
 
-## Assembly Syntax
+|| Operator | Symbol | Notes |
+||----------|--------|-------|
+|| Sum | `Sum` | Additive reduction |
+|| Max | `Max` | Maximum |
+|| Min | `Min` | Minimum |
 
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
+## Syntax
 
-Synchronous form:
+### PTO Assembly Form
 
 ```text
 treduce %group, %dst {op = #pto.reduce_op<Sum>} : (!pto.group<...>, !pto.memref<...>)
 treduce %group, %dst {op = #pto.reduce_op<Max>} : (!pto.group<...>, !pto.memref<...>)
 ```
-Lowering introduces internal accumulator and receive tiles for the reduce pipeline; the C++ intrinsic requires explicit `accTileData`, `recvTileData` (or `accTileData`, `pingTileData`, `pongTileData`) operand(s).
+
+Lowering introduces accumulator and receive tiles internally. The C++ intrinsic exposes these explicitly.
 
 ## C++ Intrinsic
 
 Declared in `include/pto/comm/pto_comm_inst.hpp`:
 
 ```cpp
-// Basic reduce (accumulator + receive tile)
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+// Basic reduce — accumulator + receive tile
+template <typename ParallelGroupType, typename GlobalDstData,
+          typename TileData, typename... WaitEvents>
 PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                              TileData &accTileData, TileData &recvTileData, ReduceOp op, WaitEvents&... events);
+                              TileData &accTileData, TileData &recvTileData,
+                              ReduceOp op, WaitEvents&... events);
 
-// Ping-pong reduce (accumulator + ping + pong tiles for double buffering)
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+// Ping-pong reduce — accumulator + ping + pong tiles for double buffering
+template <typename ParallelGroupType, typename GlobalDstData,
+          typename TileData, typename... WaitEvents>
 PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
                               TileData &accTileData, TileData &pingTileData, TileData &pongTileData,
                               ReduceOp op, WaitEvents&... events);
 ```
 
+## Inputs
+
+|| Operand | Type | Description |
+||---------|------|-------------|
+|| `parallelGroup` | `ParallelGroup` | Parallel group descriptor; `GetRootIdx()` identifies the reduce root |
+|| `dstGlobalData` | `GlobalTensor` | Local destination buffer on the root NPU |
+|| `accTileData` | `Tile` | Accumulator tile in UB for partial reduction |
+|| `recvTileData` | `Tile` | Receive tile in UB for incoming remote data |
+|| `pingTileData` / `pongTileData` | `Tile` | Two UB tiles for ping-pong double buffering |
+|| `op` | `ReduceOp` | Reduction operator (`Sum`, `Max`, `Min`, etc.) |
+|| `WaitEvents...` | `RecordEvent...` | Events to wait on before issuing the reduction |
+
+## Expected Outputs
+
+|| Result | Type | Description |
+||--------|------|-------------|
+|| `RecordEvent` | event | Token signaling reduction completion |
+
+## Side Effects
+
+This operation reads from all ranks' global memory and writes to the root's global memory. It establishes synchronization edges through the returned event token.
+
 ## Constraints
 
-- **Type constraints**:
-    - `ParallelGroup::value_type::RawDType` must equal `GlobalDstData::RawDType`.
-    - `TileData::DType` must equal `GlobalDstData::RawDType`.
-- **Memory constraints**:
-    - `dstGlobalData` must point to local address (on current NPU).
-    - `accTileData`, `recvTileData` (or `accTileData`, `pingTileData`, `pongTileData`) must be pre-allocated UB tiles.
-- **ParallelGroup constraints**:
-    - `parallelGroup.tensors[r]` must refer to rank `r`'s source buffer (remote GM as seen by the root).
-    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the reduce root.
-    - All source tensors are assumed to have the same shape and strides.
-- **Chunked mode constraints** (when data exceeds a single UB tile):
-    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
-    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
+### Type constraints
+
+- `ParallelGroup::value_type::RawDType` must equal `GlobalDstData::RawDType`.
+- `TileData::DType` must equal `GlobalDstData::RawDType`.
+
+### Memory constraints
+
+- `dstGlobalData` must point to local address (on the root NPU).
+- `accTileData` and `recvTileData` (or `accTileData`, `pingTileData`, `pongTileData`) must be pre-allocated in UB.
+
+### Parallel group constraints
+
+- `parallelGroup.tensors[r]` must refer to rank `r`'s source buffer (remote GM as seen from the root).
+- `parallelGroup.GetRootIdx()` identifies the calling NPU as the reduce root.
+- All source tensors must have the same shape and strides.
+
+### Chunked mode constraints
+
+When the GlobalTensor exceeds a single UB tile in rows or columns:
+
+- If `TileData` has a static `ValidRow`, `GetShape(DIM_3)` must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` `ValidRow` for partial row support.
+- If `TileData` has a static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` `ValidCol` for partial column support.
+
+## Target-Profile Restrictions
+
+- Collective communication is supported on A2/A3 and A5 profiles. CPU simulation does not support collective operations.
+- Use ping-pong double buffering for large transfers to overlap communication with computation.
+- `TREDUCE` requires a properly initialized `ParallelGroup` covering all participating NPUs.
 
 ## Examples
 
-### Basic Reduce Sum
+### Reduce sum
 
 ```cpp
 #include <pto/comm/pto_comm_inst.hpp>
@@ -75,13 +121,11 @@ template <typename T, int SIZE, int NRANKS>
 void reduce_sum(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
     using TileT = Tile<TileType::Vec, T, 1, SIZE>;
     using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
-                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
 
-    // Stack-allocated tensors
     GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
+    for (int i = 0; i < NRANKS; ++i)
         tensors[i] = GTensor(group_addrs[i]);
-    }
 
     comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
     GTensor dstG(result);
@@ -91,23 +135,18 @@ void reduce_sum(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
 }
 ```
 
-### Max Reduce
+### Reduce max
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
 template <typename T, int SIZE, int NRANKS>
 void reduce_max(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
     using TileT = Tile<TileType::Vec, T, 1, SIZE>;
     using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
-                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
 
     GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
+    for (int i = 0; i < NRANKS; ++i)
         tensors[i] = GTensor(group_addrs[i]);
-    }
 
     comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
     GTensor dstG(result);
@@ -116,3 +155,10 @@ void reduce_max(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
     comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Max);
 }
 ```
+
+## Related Ops / Instruction Set Links
+
+- Communication overview: [Communication and Runtime](../other/communication-and-runtime.md)
+- Collective operations: [TBROADCAST](./TBROADCAST.md), [TGATHER](./TGATHER.md), [TSCATTLER](./TSCATTER.md)
+- Point-to-point: [TGET](./TGET.md), [TPUT](./TPUT.md)
+- Instruction set: [Other and Communication](../other/README.md)
diff --git a/docs/isa/comm/TREDUCE_zh.md b/docs/isa/comm/TREDUCE_zh.md
index 5f165efe..a5b56173 100644
--- a/docs/isa/comm/TREDUCE_zh.md
+++ b/docs/isa/comm/TREDUCE_zh.md
@@ -1,112 +1,164 @@
-# TREDUCE
-
-## 简介
-
-Reduce 操作：从多个远端 NPU 收集数据并在本地执行逐元素归约。
-
-只有根节点需要执行 `TREDUCE`。非根节点只需确保在操作期间其源缓冲区已就绪且保持有效。在非根节点上调用 `TREDUCE` 属于未定义行为。
-
-**大 Tile 支持**：当 GlobalTensor 在行和/或列方向超出 UB Tile 容量时，归约操作将通过二维滑动自动分块。
-
-## 数学语义
-
-对有效区域内每个元素 `(i, j)`：
-
-$$\mathrm{dst}^{\mathrm{local}}_{i,j} = \bigoplus_{r=0}^{N-1} \mathrm{src}^{(r)}_{i,j}$$
-
-其中 $N$ 为 rank 总数，$\oplus$ 为归约运算（求和、取最大值、取最小值等）。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-treduce %group, %dst {op = #pto.reduce_op<Sum>} : (!pto.group<...>, !pto.memref<...>)
-treduce %group, %dst {op = #pto.reduce_op<Max>} : (!pto.group<...>, !pto.memref<...>)
-```
-
-降级时会为 reduce 流水线引入内部累加 Tile 和接收 Tile；C++ 内建接口需要显式传入 `accTileData`、`recvTileData`（或 `accTileData`、`pingTileData`、`pongTileData`）操作数。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-// 基础 reduce（累加 Tile + 接收 Tile）
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                              TileData &accTileData, TileData &recvTileData, ReduceOp op, WaitEvents&... events);
-
-// 乒乓 reduce（累加 Tile + ping/pong Tile 实现双缓冲）
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                              TileData &accTileData, TileData &pingTileData, TileData &pongTileData,
-                              ReduceOp op, WaitEvents&... events);
-```
-
-## 约束
-
-- **类型约束**：
-    - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalDstData::RawDType`。
-    - `TileData::DType` 必须等于 `GlobalDstData::RawDType`。
-- **内存约束**：
-    - `dstGlobalData` 必须指向本地内存（当前 NPU）。
-    - `accTileData`、`recvTileData`（或 `accTileData`、`pingTileData`、`pongTileData`）必须为预先分配的 UB Tile。
-- **ParallelGroup 约束**：
-    - `parallelGroup.tensors[r]` 必须指向 rank `r` 的源缓冲区（从根节点视角看到的远端 GM）。
-    - `parallelGroup.GetRootIdx()` 标识调用方 NPU 为 reduce 根节点。
-    - 所有源 tensor 假定具有相同的形状和步幅。
-- **分块模式约束**（数据超出单个 UB Tile 时）：
-    - 若 `TileData` 具有静态 `ValidRow`，则 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足一行的情况，请使用 `DYNAMIC` ValidRow 的 Tile。
-    - 若 `TileData` 具有静态 `ValidCol`，则 `GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足一列的情况，请使用 `DYNAMIC` ValidCol 的 Tile。
-
-## 示例
-
-### 基础求和归约
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int SIZE, int NRANKS>
-void reduce_sum(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    using TileT   = Tile<TileType::Vec, T, 1, SIZE>;
-    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
-                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
-
-    GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GTensor(group_addrs[i]);
-
-    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
-    GTensor dstG(result);
-    TileT accTile, recvTile;
-    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Sum);
-}
-```
-
-### 最大值归约
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int SIZE, int NRANKS>
-void reduce_max(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    using TileT   = Tile<TileType::Vec, T, 1, SIZE>;
-    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
-                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
-
-    GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GTensor(group_addrs[i]);
-
-    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
-    GTensor dstG(result);
-    TileT accTile, recvTile;
-    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Max);
-}
-```
-
+# TREDUCE
+
+`TREDUCE` 是[通信与运行时](../other/communication-and-runtime_zh.md)指令集的一部分。
+
+## 概述
+
+归约操作：从并行组中所有 rank 收集数据，并在根节点 NPU 上执行逐元素归约。只有根节点执行 `TREDUCE`；非根节点只需确保源缓冲区已就绪。调用 `TREDUCE` 的非根节点属于未定义行为。
+
+当 GlobalTensor 在行或列方向超出 UB Tile 容量时，归约会自动通过二维滑动分块。
+
+## 机制
+
+`TREDUCE` 从并行组内所有 rank 收集源数据，并在根节点 NPU 的目标缓冲区中完成归约。对有效区域中每个元素 `(i, j)`：
+
+$$ \mathrm{dst}^{\mathrm{local}}_{i,j} = \bigoplus_{r=0}^{N-1} \mathrm{src}^{(r)}_{i,j} $$
+
+其中 $N$ 为 rank 总数，$\oplus$ 为归约运算符。
+
+### 支持的归约运算符
+
+|| 运算符 | 说明 |
+||--------|------|
+|| `Sum` | 加法归约 |
+|| `Max` | 最大值 |
+|| `Min` | 最小值 |
+
+## 语法
+
+### PTO 汇编形式
+
+```text
+treduce %group, %dst {op = #pto.reduce_op<Sum>} : (!pto.group<...>, !pto.memref<...>)
+treduce %group, %dst {op = #pto.reduce_op<Max>} : (!pto.group<...>, !pto.memref<...>)
+```
+
+Lowering 会引入累加器和接收 Tile。C++ 内建接口显式暴露这些 Tile。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+// 基础归约 — 累加 Tile + 接收 Tile
+template <typename ParallelGroupType, typename GlobalDstData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
+                            TileData &accTileData, TileData &recvTileData,
+                            ReduceOp op, WaitEvents&... events);
+
+// 乒乓归约 — 累加 Tile + ping/pong Tile 实现双缓冲
+template <typename ParallelGroupType, typename GlobalDstData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
+                            TileData &accTileData, TileData &pingTileData, TileData &pongTileData,
+                            ReduceOp op, WaitEvents&... events);
+```
+
+## 输入
+
+|| 操作数 | 类型 | 说明 |
+||--------|------|------|
+|| `parallelGroup` | `ParallelGroup` | 并行组描述符；`GetRootIdx()` 标识归约根节点 |
+|| `dstGlobalData` | `GlobalTensor` | 根节点上的本地目标缓冲区 |
+|| `accTileData` | `Tile` | 用于部分归约的 UB 累加 Tile |
+|| `recvTileData` | `Tile` | 用于接收远端数据的 UB 接收 Tile |
+|| `pingTileData` / `pongTileData` | `Tile` | 乒乓双缓冲用的两个 UB Tile |
+|| `op` | `ReduceOp` | 归约运算符（`Sum`、`Max`、`Min` 等） |
+|| `WaitEvents...` | `RecordEvent...` | 发指令前要等待的事件 |
+
+## 预期输出
+
+|| 结果 | 类型 | 说明 |
+||------|------|------|
+|| `RecordEvent` | event | 标记归约完成的事件令牌 |
+
+## 副作用
+
+本指令从所有 rank 的全局内存读取数据并写入根节点的全局内存。通过返回的事件令牌建立同步边界。
+
+## 约束
+
+### 类型约束
+
+- `ParallelGroup::value_type::RawDType` 必须等于 `GlobalDstData::RawDType`
+- `TileData::DType` 必须等于 `GlobalDstData::RawDType`
+
+### 内存约束
+
+- `dstGlobalData` 必须指向本地地址（根节点 NPU）
+- `accTileData` 和 `recvTileData`（或 `accTileData`、`pingTileData`、`pongTileData`）必须预先在 UB 中分配
+
+### 并行组约束
+
+- `parallelGroup.tensors[r]` 必须指向 rank `r` 的源缓冲区（从根节点视角看到的远端 GM）
+- `parallelGroup.GetRootIdx()` 标识调用方 NPU 为归约根节点
+- 所有源 Tensor 必须具有相同的形状和步幅
+
+### 分块约束
+
+当 GlobalTensor 在行方向或列方向超过单个 UB Tile 时：
+
+- 若 `TileData` 具有静态 `ValidRow`，`GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足整行，应使用动态 `ValidRow` 的 Tile
+- 若 `TileData` 具有静态 `ValidCol`，`GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足整列，应使用动态 `ValidCol` 的 Tile
+
+## 目标Profile限制
+
+- 集合通信在 A2/A3 和 A5 上支持。CPU 模拟器不支持集合通信。
+- 大数据量传输建议使用乒乓双缓冲，以重叠通信与计算。
+- `TREDUCE` 需要正确初始化的 `ParallelGroup`，覆盖所有参与的 NPU。
+
+## 示例
+
+### 求和归约
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int SIZE, int NRANKS>
+void reduce_sum(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    using TileT   = Tile<TileType::Vec, T, 1, SIZE>;
+    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
+                                     BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
+
+    GTensor tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i)
+        tensors[i] = GTensor(group_addrs[i]);
+
+    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
+    GTensor dstG(result);
+    TileT accTile, recvTile;
+
+    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Sum);
+}
+```
+
+### 最大值归约
+
+```cpp
+template <typename T, int SIZE, int NRANKS>
+void reduce_max(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    using TileT   = Tile<TileType::Vec, T, 1, SIZE>;
+    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
+                                     BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
+
+    GTensor tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i)
+        tensors[i] = GTensor(group_addrs[i]);
+
+    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
+    GTensor dstG(result);
+    TileT accTile, recvTile;
+
+    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Max);
+}
+```
+
+## 相关页面
+
+- 通信概述：[通信与运行时](../other/communication-and-runtime_zh.md)
+- 集合通信：[TBROADCAST](./TBROADCAST_zh.md)、[TGATHER](./TGATHER_zh.md)、[TSCATTLER](./TSCATTER_zh.md)
+- 点对点通信：[TGET](./TGET_zh.md)、[TPUT](./TPUT_zh.md)
+- 指令集：[其他与通信](../other/README_zh.md)
diff --git a/docs/isa/comm/TSCATTER.md b/docs/isa/comm/TSCATTER.md
index 97c37270..3893c203 100644
--- a/docs/isa/comm/TSCATTER.md
+++ b/docs/isa/comm/TSCATTER.md
@@ -1,70 +1,104 @@
 ﻿# TSCATTER
 
-## Introduction
+`TSCATTER` is part of the [Communication and Runtime](../other/communication-and-runtime.md) instruction set.
 
-Scatter operation: the calling NPU (root) distributes data to all ranks in the parallel group by splitting the local source tensor along **DIM_3** (row dimension). This is the inverse of `TGATHER`.
+## Summary
 
+Collective scatter: the root NPU distributes data to all ranks in a parallel group by splitting the local source tensor along DIM_3 (row dimension). This is the inverse of `TGATHER`. Only the root executes `TSCATTER`; non-root ranks only ensure their destination buffers are allocated. Executing `TSCATTER` on a non-root rank has undefined behavior.
 
-Only the root needs to execute `TSCATTER`. Non-root ranks only need to ensure their destination buffers are allocated and writable for the duration of the operation. Calling `TSCATTER` on non-root ranks is undefined behavior.
+When per-rank data exceeds the UB tile capacity, the transfer is automatically chunked via 2D sliding.
 
-**Large Tile Support**: When the per-rank data exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding.
-
-## Math Interpretation
+## Mechanism
 
 The local source tensor has shape $(D_0, D_1, D_2, N \times H, W)$, where $N$ is the number of ranks and each rank receives $H$ rows. After the operation:
 
 $$\mathrm{dst}^{(r)}_{d_0, d_1, d_2,\; i,\; j} = \mathrm{src}^{\mathrm{local}}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
 
-## Assembly Syntax
-
-PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+## Syntax
 
-Synchronous form:
+### PTO Assembly Form
 
 ```text
 tscatter %group, %src : (!pto.group<...>, !pto.memref<...>)
 ```
-Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
+
+UB staging tiles are introduced during lowering. The C++ intrinsic exposes them explicitly.
 
 ## C++ Intrinsic
 
 Declared in `include/pto/comm/pto_comm_inst.hpp`:
 
 ```cpp
-// Basic scatter (single staging tile)
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+// Basic scatter — single staging tile
+template <typename ParallelGroupType, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
 PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
                               TileData &stagingTileData, WaitEvents&... events);
 
-// Ping-pong scatter (double buffering with two staging tiles)
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+// Ping-pong scatter — two staging tiles for double buffering
+template <typename ParallelGroupType, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
 PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
                               TileData &pingTile, TileData &pongTile, WaitEvents&... events);
 ```
 
+## Inputs
+
+|| Operand | Type | Description |
+||---------|------|-------------|
+|| `parallelGroup` | `ParallelGroup` | Parallel group descriptor; `GetRootIdx()` identifies the scatter root |
+|| `srcGlobalData` | `GlobalTensor` | Local source buffer on the root NPU; must contain data for all ranks |
+|| `stagingTileData` | `Tile` | UB staging tile for the GM→UB→GM transfer path |
+|| `pingTile` / `pongTile` | `Tile` | Two UB staging tiles for ping-pong double buffering |
+|| `WaitEvents...` | `RecordEvent...` | Events to wait on before issuing the scatter |
+
+## Expected Outputs
+
+|| Result | Type | Description |
+||--------|------|-------------|
+|| `RecordEvent` | event | Token signaling scatter completion |
+
+## Side Effects
+
+This operation reads from the root's global memory and writes to all ranks' global memory. It establishes synchronization edges through the returned event token.
+
 ## Constraints
 
-- **Type constraints**:
-    - `ParallelGroup::value_type::RawDType` must equal `GlobalSrcData::RawDType`.
-    - `TileData::DType` must equal `GlobalSrcData::RawDType`.
-- **Memory constraints**:
-    - `srcGlobalData` must point to local memory (current NPU) and be large enough to hold data for all ranks. Specifically, `srcGlobalData.GetShape(DIM_3)` must be $\geq N \times H$ where $H$ is each rank's `GetShape(DIM_3)`.
-    - If `srcGlobalData.GetShape(DIM_3) > N × H`, only the first `N × H` rows are read; remaining rows are ignored.
-    - `stagingTileData` (or `pingTile` / `pongTile`) must be pre-allocated in UB.
-- **ParallelGroup constraints**:
-    - `parallelGroup.tensors[r]` must refer to rank `r`'s destination buffer (remote GM as seen by the root).
-    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the scatter root.
-    - All destination tensors are assumed to have the same shape and strides; behavior is undefined if they differ.
-- **Chunked mode constraints** (when per-rank data exceeds a single UB tile):
-    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's destination must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
-    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
+### Type constraints
+
+- `ParallelGroup::value_type::RawDType` must equal `GlobalSrcData::RawDType`.
+- `TileData::DType` must equal `GlobalSrcData::RawDType`.
+
+### Memory constraints
+
+- `srcGlobalData` must point to local memory and be large enough to hold data for all ranks. Specifically, `srcGlobalData.GetShape(DIM_3)` must be $\geq N \times H$.
+- If `srcGlobalData.GetShape(DIM_3) > N \times H`, only the first $N \times H$ rows are read; remaining rows are ignored.
+- `stagingTileData` / `pingTile` / `pongTile` must be pre-allocated in UB.
+
+### Parallel group constraints
+
+- `parallelGroup.tensors[r]` must refer to rank `r`'s destination buffer (remote GM as seen from the root).
+- `parallelGroup.GetRootIdx()` identifies the calling NPU as the scatter root.
+- All destination tensors must have the same shape and strides.
+
+### Chunked mode constraints
+
+When per-rank data exceeds a single UB tile in rows or columns:
+
+- If `TileData` has a static `ValidRow`, each rank's destination `GetShape(DIM_3)` must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` `ValidRow` for partial row support.
+- If `TileData` has a static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` `ValidCol` for partial column support.
+
+## Target-Profile Restrictions
+
+- Collective communication is supported on A2/A3 and A5 profiles. CPU simulation does not support collective operations.
+- Use ping-pong double buffering for large transfers to overlap communication with computation.
+- `TSCATTER` requires a properly initialized `ParallelGroup` covering all participating NPUs.
 
 ## Examples
 
-### Basic Scatter (Single Staging Tile)
+### Basic scatter
 
-Root has `NRANKS * ROWS` rows of width `COLS`. Each rank receives `ROWS × COLS`, split along DIM_3.
-The tile size (`TILE_ROWS × TILE_COLS`) can be smaller than the per-rank data — when it is, the implementation automatically chunks the transfer along both DIM_3 and DIM_4 via 2D sliding.
+Root has `NRANKS × ROWS` rows of width `COLS`. Each rank receives `ROWS × COLS`:
 
 ```cpp
 #include <pto/comm/pto_comm_inst.hpp>
@@ -75,14 +109,13 @@ template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRAN
 void scatter(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
     using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
     using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
     using GSource = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
 
     GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
+    for (int i = 0; i < NRANKS; ++i)
         tensors[i] = GPerRank(group_addrs[i]);
-    }
 
     comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
     GSource srcG(local_data);
@@ -92,35 +125,34 @@ void scatter(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
 }
 ```
 
-### Ping-Pong Scatter (Double Buffering)
-
-Uses two UB tiles to overlap TLOAD of the next chunk (MTE2) with TSTORE of the current chunk (MTE3).
+### Ping-pong scatter
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
 template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
 void scatter_pingpong(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
-    // Tile can be smaller than the data in both dimensions
     using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
     using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
     using GSource = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+                                     BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
 
     GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
+    for (int i = 0; i < NRANKS; ++i)
         tensors[i] = GPerRank(group_addrs[i]);
-    }
 
     comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
     GSource srcG(local_data);
     TileT pingTile(TILE_ROWS, TILE_COLS);
     TileT pongTile(TILE_ROWS, TILE_COLS);
 
-    // Ping-pong: overlaps TLOAD and TSTORE for better throughput
     comm::TSCATTER(group, srcG, pingTile, pongTile);
 }
 ```
+
+## Related Ops / Instruction Set Links
+
+- Communication overview: [Communication and Runtime](../other/communication-and-runtime.md)
+- Inverse operation: [TGATHER](./TGATHER.md)
+- Collective operations: [TBROADCAST](./TBROADCAST.md), [TGATHER](./TGATHER.md), [TREDUCE](./TREDUCE.md)
+- Point-to-point: [TGET](./TGET.md), [TPUT](./TPUT.md)
+- Instruction set: [Other and Communication](../other/README.md)
diff --git a/docs/isa/comm/TSCATTER_zh.md b/docs/isa/comm/TSCATTER_zh.md
index c353beab..3b8f2e26 100644
--- a/docs/isa/comm/TSCATTER_zh.md
+++ b/docs/isa/comm/TSCATTER_zh.md
@@ -1,120 +1,158 @@
-# TSCATTER
-
-## 简介
-
-Scatter 操作：调用方 NPU（根节点）将本地源 tensor 沿 **DIM_3**（行维度）拆分后分发到并行组中所有 rank。该操作是 `TGATHER` 的逆操作。
-
-只有根节点需要执行 `TSCATTER`。非根节点只需确保在操作期间其目标缓冲区已分配且可写。在非根节点上调用 `TSCATTER` 属于未定义行为。
-
-**大 Tile 支持**：当每 rank 的数据在行和/或列方向超出 UB Tile 容量时，传输将通过二维滑动自动分块。
-
-## 数学语义
-
-本地源 tensor 的形状为 $(D_0, D_1, D_2, N \times H, W)$，其中 $N$ 为 rank 总数，每个 rank 接收 $H$ 行。操作完成后：
-
-$$\mathrm{dst}^{(r)}_{d_0, d_1, d_2,\; i,\; j} = \mathrm{src}^{\mathrm{local}}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-tscatter %group, %src : (!pto.group<...>, !pto.memref<...>)
-```
-
-降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-// 基础 scatter（单暂存 Tile）
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                              TileData &stagingTileData, WaitEvents&... events);
-
-// 乒乓 scatter（使用两个暂存 Tile 实现双缓冲）
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                              TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-## 约束
-
-- **类型约束**：
-    - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalSrcData::RawDType`。
-    - `TileData::DType` 必须等于 `GlobalSrcData::RawDType`。
-- **内存约束**：
-    - `srcGlobalData` 必须指向本地内存（当前 NPU），且足够容纳所有 rank 的数据。具体要求：`srcGlobalData.GetShape(DIM_3)` 必须 $\geq N \times H$，其中 $H$ 为每个 rank 的 `GetShape(DIM_3)`。
-    - 若 `srcGlobalData.GetShape(DIM_3) > N × H`，则只读取前 `N × H` 行，其余行被忽略。
-    - `stagingTileData`（或 `pingTile` / `pongTile`）必须预先在 UB 中分配。
-- **ParallelGroup 约束**：
-    - `parallelGroup.tensors[r]` 必须指向 rank `r` 的目标缓冲区（从根节点视角看到的远端 GM）。
-    - `parallelGroup.GetRootIdx()` 标识调用方 NPU 为 scatter 根节点。
-    - 所有目标 tensor 假定具有相同的形状和步幅；否则行为未定义。
-- **分块模式约束**（每 rank 数据超出单个 UB Tile 时）：
-    - 若 `TileData` 具有静态 `ValidRow`，则每个 rank 目标数据的 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足一行的情况，请使用 `DYNAMIC` ValidRow 的 Tile。
-    - 若 `TileData` 具有静态 `ValidCol`，则 `GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足一列的情况，请使用 `DYNAMIC` ValidCol 的 Tile。
-
-## 示例
-
-### 基础 Scatter（单暂存 Tile）
-
-根节点拥有 `NRANKS * ROWS` 行、宽度为 `COLS` 的数据，每个 rank 接收 `ROWS × COLS`，沿 DIM_3 拆分。
-Tile 大小可小于每 rank 的数据——此时实现会自动通过二维滑动进行分块传输。
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void scatter(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
-    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GSource  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GSource srcG(local_data);
-    TileT stagingTile(TILE_ROWS, TILE_COLS);
-    comm::TSCATTER(group, srcG, stagingTile);
-}
-```
-
-### 乒乓 Scatter（双缓冲）
-
-使用两个 UB Tile，将下一块的 TLOAD（MTE2）与当前块的 TSTORE（MTE3）重叠执行。
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void scatter_pingpong(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
-    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GSource  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GSource srcG(local_data);
-    TileT pingTile(TILE_ROWS, TILE_COLS);
-    TileT pongTile(TILE_ROWS, TILE_COLS);
-    // 乒乓模式：将 TLOAD 与 TSTORE 重叠执行以提升吞吐量
-    comm::TSCATTER(group, srcG, pingTile, pongTile);
-}
-```
-
+# TSCATTER
+
+`TSCATTLER` 是[通信与运行时](../other/communication-and-runtime_zh.md)指令集的一部分。
+
+## 概述
+
+集合 Scatter 操作：根节点 NPU 将本地源 tensor 沿 DIM_3（行维度）拆分后分发到并行组中所有 rank。该操作是 `TGATHER` 的逆操作。只有根节点执行 `TSCATTLER`；非根节点只需确保目标缓冲区已分配。调用 `TSCATTLER` 的非根节点属于未定义行为。
+
+当每个 rank 的数据超出 UB Tile 容量时，传输会自动通过二维滑动分块。
+
+## 机制
+
+本地源 tensor 的形状为 $(D_0, D_1, D_2, N \times H, W)$，其中 $N$ 为 rank 总数，每个 rank 接收 $H$ 行。操作完成后：
+
+$$\mathrm{dst}^{(r)}_{d_0, d_1, d_2,\; i,\; j} = \mathrm{src}^{\mathrm{local}}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
+
+## 语法
+
+### PTO 汇编形式
+
+```text
+tscatter %group, %src : (!pto.group<...>, !pto.memref<...>)
+```
+
+Lowering 会引入 UB 暂存 Tile。C++ 内建接口显式暴露这些 Tile。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+// 基础 scatter — 单个暂存 Tile
+template <typename ParallelGroupType, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
+                           TileData &stagingTileData, WaitEvents&... events);
+
+// 乒乓 scatter — 两个暂存 Tile 实现双缓冲
+template <typename ParallelGroupType, typename GlobalSrcData,
+          typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
+                           TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+## 输入
+
+|| 操作数 | 类型 | 说明 |
+||--------|------|------|
+|| `parallelGroup` | `ParallelGroup` | 并行组描述符；`GetRootIdx()` 标识 scatter 根节点 |
+|| `srcGlobalData` | `GlobalTensor` | 根节点上的本地源缓冲区；必须包含所有 rank 的数据 |
+|| `stagingTileData` | `Tile` | GM→UB→GM 传输路径上的 UB 暂存 Tile |
+|| `pingTile` / `pongTile` | `Tile` | 双缓冲用的两个 UB 暂存 Tile |
+|| `WaitEvents...` | `RecordEvent...` | 发指令前要等待的事件 |
+
+## 预期输出
+
+|| 结果 | 类型 | 说明 |
+||------|------|------|
+|| `RecordEvent` | event | 标记 scatter 完成的事件令牌 |
+
+## 副作用
+
+本指令从根节点的全局内存读取数据并写入所有 rank 的全局内存。通过返回的事件令牌建立同步边界。
+
+## 约束
+
+### 类型约束
+
+- `ParallelGroup::value_type::RawDType` 必须等于 `GlobalSrcData::RawDType`
+- `TileData::DType` 必须等于 `GlobalSrcData::RawDType`
+
+### 内存约束
+
+- `srcGlobalData` 必须指向本地内存，且足够容纳所有 rank 的数据。具体要求：`srcGlobalData.GetShape(DIM_3)` 必须 $\geq N \times H$
+- 若 `srcGlobalData.GetShape(DIM_3) > N \times H`，只读取前 $N \times H$ 行，其余行被忽略
+- `stagingTileData` / `pingTile` / `pongTile` 必须预先在 UB 中分配
+
+### 并行组约束
+
+- `parallelGroup.tensors[r]` 必须指向 rank `r` 的目标缓冲区（从根节点视角看到的远端 GM）
+- `parallelGroup.GetRootIdx()` 标识调用方 NPU 为 scatter 根节点
+- 所有目标 Tensor 必须具有相同的形状和步幅
+
+### 分块约束
+
+当每个 rank 的数据超出单个 UB Tile 的行或列时：
+
+- 若 `TileData` 具有静态 `ValidRow`，每个 rank 目标数据的 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足整行，应使用动态 `ValidRow` 的 Tile
+- 若 `TileData` 具有静态 `ValidCol`，`GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足整列，应使用动态 `ValidCol` 的 Tile
+
+## 目标Profile限制
+
+- 集合通信在 A2/A3 和 A5 上支持。CPU 模拟器不支持集合通信。
+- 大数据量传输建议使用乒乓双缓冲，以重叠通信与计算。
+- `TSCATTLER` 需要正确初始化的 `ParallelGroup`，覆盖所有参与的 NPU。
+
+## 示例
+
+### 基础 scatter
+
+根节点拥有 `NRANKS × ROWS` 行、宽度为 `COLS` 的数据，每个 rank 接收 `ROWS × COLS`：
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void scatter(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
+    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GSource  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                     BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i)
+        tensors[i] = GPerRank(group_addrs[i]);
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GSource srcG(local_data);
+    TileT stagingTile(TILE_ROWS, TILE_COLS);
+
+    comm::TSCATTER(group, srcG, stagingTile);
+}
+```
+
+### 乒乓 scatter
+
+```cpp
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void scatter_pingpong(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
+    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                     BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GSource  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                     BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i)
+        tensors[i] = GPerRank(group_addrs[i]);
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GSource srcG(local_data);
+    TileT pingTile(TILE_ROWS, TILE_COLS);
+    TileT pongTile(TILE_ROWS, TILE_COLS);
+
+    comm::TSCATTER(group, srcG, pingTile, pongTile);
+}
+```
+
+## 相关页面
+
+- 通信概述：[通信与运行时](../other/communication-and-runtime_zh.md)
+- 逆操作：[TGATHER](./TGATHER_zh.md)
+- 集合通信：[TBROADCAST](./TBROADCAST_zh.md)、[TGATHER](./TGATHER_zh.md)、[TREDUCE](./TREDUCE_zh.md)
+- 点对点通信：[TGET](./TGET_zh.md)、[TPUT](./TPUT_zh.md)
+- 指令集：[其他与通信](../other/README_zh.md)
diff --git a/docs/isa/comm/TTEST.md b/docs/isa/comm/TTEST.md
index 4628d1ac..ec1f2744 100644
--- a/docs/isa/comm/TTEST.md
+++ b/docs/isa/comm/TTEST.md
@@ -1,28 +1,28 @@
 ﻿# TTEST
 
-## Introduction
+`TTEST` is part of the [Communication and Runtime](../other/communication-and-runtime.md) instruction set.
 
-Non-blocking test if signal(s) meet comparison condition. Returns `true` if condition is satisfied, `false` otherwise. Used for polling-based synchronization with timeout or interleaved work.
+## Summary
 
-Supports single signal or multi-dimensional signal tensor (up to 5-D, shape derived from GlobalTensor). For tensor, returns `true` only if ALL signals meet the condition.
+Non-blocking test whether signal(s) satisfy a comparison condition. Returns `true` if the condition is met, `false` otherwise. Use `TTEST` for polling-based synchronization when you need to interleave work with waiting, or to avoid blocking indefinitely. For blocking wait semantics, use `TWAIT` instead.
 
-## Math Interpretation
+Supports single scalar signals and multi-dimensional signal tensors (up to 5-D). For tensors, returns `true` only if all signals in the tensor satisfy the condition.
 
-Test and return result:
+## Mechanism
 
-Single signal:
+`TTEST` checks the signal condition and returns immediately. For single signals:
 
 $$ \mathrm{result} = (\mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue}) $$
 
-Signal tensor (all must satisfy):
+For signal tensors (all must satisfy):
 
 $$ \mathrm{result} = \bigwedge_{d_0, d_1, d_2, d_3, d_4} (\mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue}) $$
 
-where `cmp` ∈ {`EQ`, `NE`, `GT`, `GE`, `LT`, `LE`}
+where `cmp` is one of `EQ`, `NE`, `GT`, `GE`, `LT`, `LE`.
 
-## Assembly Syntax
+## Syntax
 
-PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+### PTO Assembly Form
 
 ```text
 %result = ttest %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32) -> i1
@@ -38,31 +38,48 @@ template <typename GlobalSignalData, typename... WaitEvents>
 PTO_INST bool TTEST(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
 ```
 
+## Inputs
+
+|| Operand | Type | Description |
+||---------|------|-------------|
+|| `signalData` | `GlobalSignalData` | Signal or signal tensor; must be `int32_t` |
+|| `cmpValue` | `int32_t` | Comparison threshold value |
+|| `cmp` | `WaitCmp` | Comparison operator |
+|| `WaitEvents...` | `RecordEvent...` | Events to wait on before testing |
+
+## Expected Outputs
+
+|| Result | Type | Description |
+||--------|------|-------------|
+|| `bool` | `true`/`false` | `true` if condition is satisfied, `false` otherwise |
+
+## Side Effects
+
+This operation reads signal state and returns immediately. No blocking or state modification.
+
 ## Constraints
 
-- **Type constraints**:
-    - `GlobalSignalData::DType` must be `int32_t` (32-bit signal).
-- **Memory constraints**:
-    - `signalData` must point to local address (on current NPU).
-- **Return value**:
-    - Returns `true` if condition is satisfied, `false` otherwise.
-    - For signal tensor, returns `true` only if ALL signals satisfy the condition.
-- **Shape semantics**:
-    - For single signal: Shape is `<1,1,1,1,1>`.
-    - For signal tensor: Shape determines the multi-dimensional region (up to 5-D) to test.
-- **Comparison operators** (WaitCmp):
-  | Value | Condition |
-  |-------|-----------|
-  | `EQ` | `signal == cmpValue` |
-  | `NE` | `signal != cmpValue` |
-  | `GT` | `signal > cmpValue` |
-  | `GE` | `signal >= cmpValue` |
-  | `LT` | `signal < cmpValue` |
-  | `LE` | `signal <= cmpValue` |
+### Type constraints
+
+- `GlobalSignalData::DType` must be `int32_t`.
+
+### Memory constraints
+
+- `signalData` must point to a local address.
+
+### Return value
+
+- Returns `true` if the condition is satisfied, `false` otherwise.
+- For signal tensors, returns `true` only if all signals satisfy the condition.
+
+## Target-Profile Restrictions
+
+- `TTEST` is supported on A2/A3 and A5 profiles. CPU simulation may implement simplified polling semantics.
+- Use `TNOTIFY` on the producer side to update the signal.
 
 ## Examples
 
-### Basic Test
+### Basic test
 
 ```cpp
 #include <pto/comm/pto_comm_inst.hpp>
@@ -71,61 +88,43 @@ using namespace pto;
 
 bool check_ready(__gm__ int32_t* local_signal) {
     comm::Signal sig(local_signal);
-
-    // Check if signal == 1
     return comm::TTEST(sig, 1, comm::WaitCmp::EQ);
 }
 ```
 
-### Test Signal Matrix
+### Test signal matrix
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-// Test if all signals from a 4x8 dense grid of workers are ready
 bool check_worker_grid(__gm__ int32_t* signal_matrix) {
     comm::Signal2D<4, 8> grid(signal_matrix);
-
     // Returns true only if all 32 signals == 1
     return comm::TTEST(grid, 1, comm::WaitCmp::EQ);
 }
 ```
 
-### Polling with Timeout
+### Polling with timeout
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
 bool poll_with_timeout(__gm__ int32_t* local_signal, int max_iterations) {
     comm::Signal sig(local_signal);
 
     for (int i = 0; i < max_iterations; ++i) {
-        if (comm::TTEST(sig, 1, comm::WaitCmp::EQ)) {
-            return true;  // Signal received
-        }
-        // Could do other work here between polls
+        if (comm::TTEST(sig, 1, comm::WaitCmp::EQ))
+            return true;
+        // Do other useful work between polls
     }
-    return false;  // Timeout
+    return false;
 }
 ```
 
-### Progress-Based Polling
+### Progress-based polling
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
 void process_with_progress(__gm__ int32_t* local_counter, int expected_count) {
     comm::Signal counter(local_counter);
 
     while (!comm::TTEST(counter, expected_count, comm::WaitCmp::GE)) {
-        // Do some useful work while waiting
-        // ...
+        // Do useful work while waiting
     }
     // All expected signals received
 }
@@ -134,10 +133,6 @@ void process_with_progress(__gm__ int32_t* local_counter, int expected_count) {
 ### Compare TWAIT vs TTEST
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
 void compare_wait_test(__gm__ int32_t* local_signal) {
     comm::Signal sig(local_signal);
 
@@ -148,3 +143,10 @@ void compare_wait_test(__gm__ int32_t* local_signal) {
     bool ready = comm::TTEST(sig, 1, comm::WaitCmp::EQ);
 }
 ```
+
+## Related Ops / Instruction Set Links
+
+- Communication overview: [Communication and Runtime](../other/communication-and-runtime.md)
+- Blocking counterpart: [TWAIT](./TWAIT.md)
+- Signal producer: [TNOTIFY](./TNOTIFY.md)
+- Instruction set: [Other and Communication](../other/README.md)
diff --git a/docs/isa/comm/TTEST_zh.md b/docs/isa/comm/TTEST_zh.md
index 12a4acef..7199dd40 100644
--- a/docs/isa/comm/TTEST_zh.md
+++ b/docs/isa/comm/TTEST_zh.md
@@ -1,32 +1,30 @@
 # TTEST
 
-## 简介
+`TTEST` 是[通信与运行时](../other/communication-and-runtime_zh.md)指令集的一部分。
 
-`TTEST` 是非阻塞检测原语：检查一个或一组信号是否满足比较条件，满足时返回 `true`，否则立即返回 `false`。
+## 概述
 
-它适合：
+非阻塞检测原语：检查一个或一组信号是否满足比较条件，满足时返回 `true`，否则立即返回 `false`。适合轮询式同步、带超时的等待，或在等待期间穿插其他工作。
 
-- 轮询式同步
-- 带超时的等待
-- 在等待期间穿插其他工作
+支持单个标量信号和最多 5 维的信号 tensor。对 tensor 形式，只有当所有元素都满足条件时才返回 `true`。
 
-既支持单个信号，也支持最多 5 维的信号 tensor。对 tensor 形式，只有当 **所有** 元素都满足条件时才返回 `true`。
+如需阻塞等待，请使用 `TWAIT`。
 
-## 数学语义
+## 机制
 
-单个信号：
+`TTEST` 检查信号条件后立即返回。对单个信号：
 
 $$ \mathrm{result} = (\mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue}) $$
 
-信号 tensor：
+对信号 tensor（所有元素必须满足）：
 
 $$ \mathrm{result} = \bigwedge_{d_0, d_1, d_2, d_3, d_4} (\mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue}) $$
 
-其中 `cmp ∈ {EQ, NE, GT, GE, LT, LE}`。
+其中 `cmp` 为 `EQ`、`NE`、`GT`、`GE`、`LT`、`LE` 之一。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：
+### PTO 汇编形式
 
 ```text
 %result = ttest %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32) -> i1
@@ -42,23 +40,50 @@ template <typename GlobalSignalData, typename... WaitEvents>
 PTO_INST bool TTEST(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
 ```
 
+## 输入
+
+|| 操作数 | 类型 | 说明 |
+||--------|------|------|
+|| `signalData` | `GlobalSignalData` | 信号或信号 tensor；必须为 `int32_t` |
+|| `cmpValue` | `int32_t` | 比较阈值 |
+|| `cmp` | `WaitCmp` | 比较运算符 |
+|| `WaitEvents...` | `RecordEvent...` | 检测前要等待的事件 |
+
+## 预期输出
+
+|| 结果 | 类型 | 说明 |
+||------|------|------|
+|| `bool` | `true`/`false` | 条件满足时为 `true`，否则为 `false` |
+
+## 副作用
+
+此操作读取信号状态后立即返回。不阻塞，不修改状态。
+
 ## 约束
 
+### 类型约束
+
 - `GlobalSignalData::DType` 必须为 `int32_t`
-- `signalData` 必须指向本地地址（当前 NPU）
-- 单个信号的形状为 `<1,1,1,1,1>`
-- tensor 形式由其 shape 决定检测区域，并要求所有元素都满足条件
+
+### 内存约束
+
+- `signalData` 必须指向本地地址
 
 ### 比较运算符
 
-| 值 | 条件 |
-| --- | --- |
-| `EQ` | `signal == cmpValue` |
-| `NE` | `signal != cmpValue` |
-| `GT` | `signal > cmpValue` |
-| `GE` | `signal >= cmpValue` |
-| `LT` | `signal < cmpValue` |
-| `LE` | `signal <= cmpValue` |
+|| 值 | 条件 |
+||---|------|
+|| `WaitCmp::EQ` | signal == cmpValue |
+|| `WaitCmp::NE` | signal != cmpValue |
+|| `WaitCmp::GT` | signal > cmpValue |
+|| `WaitCmp::GE` | signal >= cmpValue |
+|| `WaitCmp::LT` | signal < cmpValue |
+|| `WaitCmp::LE` | signal <= cmpValue |
+
+## 目标Profile限制
+
+- `TTEST` 在 A2/A3 和 A5 上支持。CPU 模拟器可能实现简化的轮询语义。
+- 生产者侧使用 `TNOTIFY` 更新信号。
 
 ## 示例
 
@@ -76,10 +101,39 @@ bool check_ready(__gm__ int32_t* local_signal) {
 ```cpp
 bool check_worker_grid(__gm__ int32_t* signal_matrix) {
     comm::Signal2D<4, 8> grid(signal_matrix);
+    // 所有 32 个信号都 == 1 时才返回 true
     return comm::TTEST(grid, 1, comm::WaitCmp::EQ);
 }
 ```
 
-### 与 TWAIT 的区别
+### 带超时的轮询
+
+```cpp
+bool poll_with_timeout(__gm__ int32_t* local_signal, int max_iterations) {
+    comm::Signal sig(local_signal);
+
+    for (int i = 0; i < max_iterations; ++i) {
+        if (comm::TTEST(sig, 1, comm::WaitCmp::EQ))
+            return true;
+        // 在轮询间隔内做其他有用的工作
+    }
+    return false;
+}
+```
+
+### TWAIT vs TTEST
+
+```cpp
+// 阻塞：直到 signal == 1 才返回
+comm::TWAIT(sig, 1, comm::WaitCmp::EQ);
+
+// 非阻塞：立即返回当前检测结果
+bool ready = comm::TTEST(sig, 1, comm::WaitCmp::EQ);
+```
+
+## 相关页面
 
-`TWAIT` 会阻塞直到条件满足；`TTEST` 只返回当前检测结果，不会阻塞调用方。
+- 通信概述：[通信与运行时](../other/communication-and-runtime_zh.md)
+- 阻塞等待：[TWAIT](./TWAIT_zh.md)
+- 信号发送：[TNOTIFY](./TNOTIFY_zh.md)
+- 指令集：[其他与通信](../other/README_zh.md)
diff --git a/docs/isa/comm/TWAIT.md b/docs/isa/comm/TWAIT.md
index fa0ed8df..849a5171 100644
--- a/docs/isa/comm/TWAIT.md
+++ b/docs/isa/comm/TWAIT.md
@@ -1,29 +1,30 @@
 ﻿# TWAIT
 
-## Introduction
+`TWAIT` is part of the [Communication and Runtime](../other/communication-and-runtime.md) instruction set.
 
-Blocking wait until signal(s) meet comparison condition. Used in conjunction with `TNOTIFY` for flag-based synchronization.
+## Summary
 
-Supports single signal or multi-dimensional signal tensor (up to 5-D, shape derived from GlobalTensor).
+Blocking wait for signal(s) to satisfy a comparison condition. Used in conjunction with `TNOTIFY` to implement flag-based producer-consumer synchronization. Supports single scalar signals and multi-dimensional signal tensors (up to 5-D).
 
+`TWAIT` is a blocking call: it does not return until the condition is satisfied. For non-blocking polling, see `TTEST`.
 
-## Math Interpretation
+## Mechanism
 
-Wait (spin) until the following condition is satisfied:
+`TWAIT` spins until all checked signals satisfy the specified comparison. For single signals, it waits on one value. For signal tensors, all elements in the tensor must satisfy the condition simultaneously.
 
 Single signal:
 
-$$ \mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue} $$
+$$ \text{wait until}\ \mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue} $$
 
 Signal tensor (all elements must satisfy):
 
 $$ \forall d_0, d_1, d_2, d_3, d_4: \mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue} $$
 
-where `cmp` ∈ {`EQ`, `NE`, `GT`, `GE`, `LT`, `LE`}
+where `cmp` is one of `EQ`, `NE`, `GT`, `GE`, `LT`, `LE`.
 
-## Assembly Syntax
+## Syntax
 
-PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+### PTO Assembly Form
 
 ```text
 twait %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32)
@@ -39,28 +40,57 @@ template <typename GlobalSignalData, typename... WaitEvents>
 PTO_INST void TWAIT(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
 ```
 
+### Comparison operators
+
+|| Value | Condition |
+||-------|-----------|
+|| `WaitCmp::EQ` | signal == cmpValue |
+|| `WaitCmp::NE` | signal != cmpValue |
+|| `WaitCmp::GT` | signal > cmpValue |
+|| `WaitCmp::GE` | signal >= cmpValue |
+|| `WaitCmp::LT` | signal < cmpValue |
+|| `WaitCmp::LE` | signal <= cmpValue |
+
+## Inputs
+
+|| Operand | Type | Description |
+||---------|------|-------------|
+|| `signalData` | `GlobalSignalData` | Signal or signal tensor; must be `int32_t` |
+|| `cmpValue` | `int32_t` | Comparison threshold value |
+|| `cmp` | `WaitCmp` | Comparison operator |
+|| `WaitEvents...` | `RecordEvent...` | Events to wait on before entering the spin loop |
+
+## Expected Outputs
+
+None. This operation blocks until the condition is satisfied and then returns.
+
+## Side Effects
+
+This operation may block indefinitely if the signal never satisfies the condition. No architectural state is modified.
+
 ## Constraints
 
-- **Type constraints**:
-    - `GlobalSignalData::DType` must be `int32_t` (32-bit signal).
-- **Memory constraints**:
-    - `signalData` must point to local address (on current NPU).
-- **Shape semantics**:
-    - For single signal: Shape is `<1,1,1,1,1>`.
-    - For signal tensor: Shape determines the multi-dimensional region (up to 5-D) to wait on. All signals in the tensor must satisfy the condition.
-- **Comparison operators** (WaitCmp):
-  | Value | Condition |
-  |-------|-----------|
-  | `EQ` | `signal == cmpValue` |
-  | `NE` | `signal != cmpValue` |
-  | `GT` | `signal > cmpValue` |
-  | `GE` | `signal >= cmpValue` |
-  | `LT` | `signal < cmpValue` |
-  | `LE` | `signal <= cmpValue` |
+### Type constraints
+
+- `GlobalSignalData::DType` must be `int32_t`.
+
+### Memory constraints
+
+- `signalData` must point to a local address.
+
+### Shape semantics
+
+- For a single signal, the effective shape is `<1,1,1,1,1>`.
+- For a signal tensor, the shape determines the multi-dimensional region to wait on; all elements must satisfy the condition.
+
+## Target-Profile Restrictions
+
+- `TWAIT` is supported on A2/A3 and A5 profiles. CPU simulation may not implement blocking wait semantics.
+- Use `TNOTIFY` on the producer side to signal when data is ready.
 
 ## Examples
 
-### Wait for Single Signal
+### Wait for single signal
 
 ```cpp
 #include <pto/comm/pto_comm_inst.hpp>
@@ -69,63 +99,54 @@ using namespace pto;
 
 void wait_for_ready(__gm__ int32_t* local_signal) {
     comm::Signal sig(local_signal);
-
-    // Wait until signal == 1
     comm::TWAIT(sig, 1, comm::WaitCmp::EQ);
 }
 ```
 
-### Wait for Signal Matrix
+### Wait for signal matrix
 
 ```cpp
 #include <pto/comm/pto_comm_inst.hpp>
 
 using namespace pto;
 
-// Wait for signals from a 4x8 dense grid of workers
 void wait_worker_grid(__gm__ int32_t* signal_matrix) {
     comm::Signal2D<4, 8> grid(signal_matrix);
-
-    // Wait until all 32 signals == 1
+    // Waits until all 32 signals == 1
     comm::TWAIT(grid, 1, comm::WaitCmp::EQ);
 }
 ```
 
-### Wait for Counter Threshold
+### Wait for counter threshold
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
 void wait_for_count(__gm__ int32_t* local_counter, int expected_count) {
     comm::Signal counter(local_counter);
-
-    // Wait until counter >= expected_count
     comm::TWAIT(counter, expected_count, comm::WaitCmp::GE);
 }
 ```
 
-### Producer-Consumer Pattern
+### Producer-consumer pattern
 
 ```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-// Producer: notify when data is ready
+// Producer: signals when data is ready
 void producer(__gm__ int32_t* remote_flag) {
     // ... produce data ...
-
     comm::Signal flag(remote_flag);
     comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
 }
 
-// Consumer: wait for data
+// Consumer: blocks until data is signaled
 void consumer(__gm__ int32_t* local_flag) {
     comm::Signal flag(local_flag);
     comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
-
     // ... consume data ...
 }
 ```
+
+## Related Ops / Instruction Set Links
+
+- Communication overview: [Communication and Runtime](../other/communication-and-runtime.md)
+- Signal producer: [TNOTIFY](./TNOTIFY.md)
+- Non-blocking counterpart: [TTEST](./TTEST.md)
+- Instruction set: [Other and Communication](../other/README.md)
diff --git a/docs/isa/comm/TWAIT_zh.md b/docs/isa/comm/TWAIT_zh.md
index f37e6fa4..39fb8205 100644
--- a/docs/isa/comm/TWAIT_zh.md
+++ b/docs/isa/comm/TWAIT_zh.md
@@ -1,26 +1,28 @@
 # TWAIT
 
-## 简介
+`TWAIT` 是[通信与运行时](../other/communication-and-runtime_zh.md)指令集的一部分。
 
-`TWAIT` 是阻塞等待原语：在本地信号满足比较条件之前一直等待。它通常与 `TNOTIFY` 配合使用，实现基于标志的同步。
+## 概述
 
-既支持单个信号，也支持最多 5 维的信号 tensor。对 tensor 形式，要求所有元素都满足比较条件后才结束等待。
+阻塞等待原语：在本地信号满足比较条件之前一直等待。常与 `TNOTIFY` 配合使用，实现基于标志的生产者-消费者同步。支持单个标量信号和最多 5 维的信号 tensor。对 tensor 形式，要求所有元素都满足比较条件后才结束等待。
 
-## 数学语义
+`TWAIT` 是阻塞调用：不满足条件时会一直等待。如需非阻塞轮询，请使用 `TTEST`。
 
-单个信号：
+## 机制
 
-$$ \mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue} $$
+`TWAIT` 轮询信号条件直到满足。对单个信号：
 
-信号 tensor：
+$$ \text{wait until}\ \mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue} $$
+
+对信号 tensor（所有元素必须同时满足）：
 
 $$ \forall d_0, d_1, d_2, d_3, d_4:\ \mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue} $$
 
-其中 `cmp ∈ {EQ, NE, GT, GE, LT, LE}`。
+其中 `cmp` 为 `EQ`、`NE`、`GT`、`GE`、`LT`、`LE` 之一。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：
+### PTO 汇编形式
 
 ```text
 twait %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32)
@@ -36,23 +38,48 @@ template <typename GlobalSignalData, typename... WaitEvents>
 PTO_INST void TWAIT(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
 ```
 
+## 输入
+
+|| 操作数 | 类型 | 说明 |
+||--------|------|------|
+|| `signalData` | `GlobalSignalData` | 信号或信号 tensor；必须为 `int32_t` |
+|| `cmpValue` | `int32_t` | 比较阈值 |
+|| `cmp` | `WaitCmp` | 比较运算符 |
+|| `WaitEvents...` | `RecordEvent...` | 进入轮询前要等待的事件 |
+
+## 预期输出
+
+无。此操作阻塞直到条件满足，然后返回。
+
+## 副作用
+
+此操作可能无限阻塞（如果信号永不满足条件）。不修改架构状态。
+
 ## 约束
 
+### 类型约束
+
 - `GlobalSignalData::DType` 必须为 `int32_t`
-- `signalData` 必须指向本地地址（当前 NPU）
-- 单个信号的形状为 `<1,1,1,1,1>`
-- tensor 形式由其 shape 决定等待区域，并要求所有元素满足条件
+
+### 内存约束
+
+- `signalData` 必须指向本地地址
 
 ### 比较运算符
 
-| 值 | 条件 |
-| --- | --- |
-| `EQ` | `signal == cmpValue` |
-| `NE` | `signal != cmpValue` |
-| `GT` | `signal > cmpValue` |
-| `GE` | `signal >= cmpValue` |
-| `LT` | `signal < cmpValue` |
-| `LE` | `signal <= cmpValue` |
+|| 值 | 条件 |
+||---|------|
+|| `WaitCmp::EQ` | signal == cmpValue |
+|| `WaitCmp::NE` | signal != cmpValue |
+|| `WaitCmp::GT` | signal > cmpValue |
+|| `WaitCmp::GE` | signal >= cmpValue |
+|| `WaitCmp::LT` | signal < cmpValue |
+|| `WaitCmp::LE` | signal <= cmpValue |
+
+## 目标Profile限制
+
+- `TWAIT` 在 A2/A3 和 A5 上支持。CPU 模拟器可能不实现阻塞等待语义。
+- 生产者侧使用 `TNOTIFY` 更新信号。
 
 ## 示例
 
@@ -65,6 +92,16 @@ void wait_for_ready(__gm__ int32_t* local_signal) {
 }
 ```
 
+### 等待信号矩阵
+
+```cpp
+void wait_worker_grid(__gm__ int32_t* signal_matrix) {
+    comm::Signal2D<4, 8> grid(signal_matrix);
+    // 等待所有 32 个信号 == 1
+    comm::TWAIT(grid, 1, comm::WaitCmp::EQ);
+}
+```
+
 ### 等待计数器达到阈值
 
 ```cpp
@@ -74,16 +111,26 @@ void wait_for_count(__gm__ int32_t* local_counter, int expected_count) {
 }
 ```
 
-### 与 TNOTIFY 配合
+### 生产者-消费者模式
 
 ```cpp
+// 生产者：数据就绪后发信号
 void producer(__gm__ int32_t* remote_flag) {
     comm::Signal flag(remote_flag);
     comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
 }
 
+// 消费者：阻塞等待数据
 void consumer(__gm__ int32_t* local_flag) {
     comm::Signal flag(local_flag);
     comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
+    // ... 处理数据 ...
 }
 ```
+
+## 相关页面
+
+- 通信概述：[通信与运行时](../other/communication-and-runtime_zh.md)
+- 信号发送：[TNOTIFY](./TNOTIFY_zh.md)
+- 非阻塞轮询：[TTEST](./TTEST_zh.md)
+- 指令集：[其他与通信](../other/README_zh.md)
diff --git a/docs/isa/instruction-families/README.md b/docs/isa/instruction-families/README.md
index 3292d793..549ce178 100644
--- a/docs/isa/instruction-families/README.md
+++ b/docs/isa/instruction-families/README.md
@@ -1,21 +1,59 @@
 # Instruction Set Contracts
 
-Instruction set pages describe shared contracts that apply across related PTO operations. They sit between the model chapters and the per-op reference pages. For how individual opcode pages are structured, see [format of instruction descriptions](../reference/format-of-instruction-descriptions.md).
+Instruction set pages describe shared contracts that apply across related PTO operations. They sit between the model chapters and the per-op reference pages.
 
-## Overview
+## Four Named Instruction Sets
 
-PTO ISA groups its instructions into four named instruction sets:
+| Instruction Set | Prefix | Families | Description |
+|----------------|--------|----------|-------------|
+| [Tile Instruction Set](./tile-families.md) | `pto.t*` | 8 | Tile-oriented compute, data movement, layout operations |
+| [Vector Instruction Set](./vector-families.md) | `pto.v*` | 9 | Micro-instructions for vector pipeline execution |
+| [Scalar and Control Instruction Set](./scalar-and-control-families.md) | `pto.*` | 6 | Configuration, synchronization, DMA, predicate operations |
+| [Other Instruction Set](./other-families.md) | `pto.*` | 2 | Collective communication and runtime support |
 
-| Instruction Set | Prefix | Pipeline | Description |
-|-----------------|--------|----------|-------------|
-| [Tile Instruction Set](./tile-families.md) | `pto.t*` | Tile | Primary tile-oriented compute, data movement, layout operations |
-| [Vector Instruction Set](./vector-families.md) | `pto.v*` | Vector | Micro-instructions for vector pipeline execution |
-| [Scalar And Control Instruction Set](./scalar-and-control-families.md) | `pto.*` | Scalar/Control | Configuration, synchronization, DMA, predicate operations |
-| [Other Instruction Set](./other-families.md) | `pto.*` | Communication | Collective communication and runtime support |
+## Navigation Map
+
+```
+Tile Instruction Set
+├── Sync and Config             → tassign, tsync, tsetf32mode, tsetfmatrix, tset_img2col_*, tsubview
+├── Elementwise Tile-Tile      → tadd, tsub, tmul, tdiv, tmin, tmax, tcmp, tcvt, tsel, etc.
+├── Tile-Scalar and Immediate  → tadds, tsubs, tmuls, tdivs, tcmps, tsels, texpands, etc.
+├── Reduce and Expand          → trowsum, tcolmax, trowexpand, tcolexpand, etc.
+├── Memory and Data Movement   → tload, tprefetch, tstore, tstore_fp, mgather, mscatter
+├── Matrix and Matrix-Vector    → tgemv, tgemv_mx, tmatmul, tmatmul_acc, tmatmul_bias, etc.
+├── Layout and Rearrangement   → tmov, ttrans, textract, tinsert, timg2col, tfillpad, etc.
+└── Irregular and Complex     → tprint, tsort32, tgather, tscatter, tquant, etc.
+
+Vector Instruction Set
+├── Vector Load Store           → vlds, vldas, vgather2, vsld, vsst, vscatter, etc.
+├── Predicate and Materialization → vbr, vdup
+├── Unary Vector Ops            → vabs, vneg, vexp, vln, vsqrt, vrec, vrelu, etc.
+├── Binary Vector Ops            → vadd, vsub, vmul, vdiv, vmax, vmin, vand, vor, etc.
+├── Vector-Scalar Ops           → vadds, vsubs, vmuls, vshls, vlrelu, etc.
+├── Conversion Ops               → vci, vcvt, vtrc
+├── Reduction Ops               → vcadd, vcmax, vcmin, vcgadd, vcgmax, etc.
+├── Compare and Select          → vcmp, vcmps, vsel, vselr, vselrv2
+├── Data Rearrangement          → vintlv, vslide, vshift, vpack, vzunpack, etc.
+└── SFU and DSA                → vprelu, vexpdiff, vaxpy, vtranspose, vsort32, etc.
+
+Scalar and Control Instruction Set
+├── Control and Configuration   → nop, barrier, yield; tsetf32mode, tsetfmatrix
+├── Pipeline Sync              → set_flag, wait_flag, pipe_barrier, mem_bar, get_buf
+├── DMA Copy                  → copy_gm_to_ubuf, copy_ubuf_to_gm, copy_ubuf_to_ubuf
+├── Predicate Load Store       → pld, plds, psts, pst, pstu
+├── Predicate Generation        → pset_b8/b16/b32, pge_b8/b16/b32, plt_b8/b16/b32
+│                               → pand, por, pxor, pnot, psel, ppack, punpack
+├── Shared Arithmetic          → Scalar arithmetic shared across instruction sets
+└── Shared SCF                → Scalar structured control flow
+
+Communication Instruction Set
+├── Collective Ops              → tbroadcast, tget, tput, tgather, tscatter, treduce, tnotify, ttest, twait
+└── Non-ISA Supporting Ops     → talias, taxpy, tconcat, tdequant, tfree, thistogram, tpack, tpop, tpush, trandom
+```
 
-## What An Instruction Set Contract Must State
+## What an Instruction Set Contract Must State
 
-Each instruction set page provides the following:
+Each instruction set page provides:
 
 1. **Mechanism** — What the instruction set is for, explained in one short section.
 2. **Shared operand model** — Common input/output roles and how they interact.
@@ -25,54 +63,12 @@ Each instruction set page provides the following:
 6. **Target-profile narrowing** — Where A2/A3 and A5 differ in what the set accepts.
 7. **Operation list** — Pointers to each per-op page under `ops/`.
 
-Instruction set pages do not repeat per-op details; they set the contract for the group.
-
-## Navigation Map
-
-```
-Instruction Sets
-├── Tile Instruction Set
-│   ├── Sync and Config            → pto.tassign, pto.tsync, pto.tsettf32mode, pto.tset_img2col_*, etc.
-│   ├── Elementwise Tile-Tile      → pto.tadd, pto.tmul, pto.tcmp, pto.tcvt, pto.tsel, etc.
-│   ├── Tile-Scalar and Immediate  → pto.tadds, pto.tmuls, pto.tmins, pto.texpands, etc.
-│   ├── Reduce and Expand          → pto.trowsum, pto.tcolmax, pto.trowexpand, pto.tcolexpand, etc.
-│   ├── Memory and Data Movement   → pto.tload, pto.tstore, pto.tstore_fp, pto.mgather, pto.mscatter
-│   ├── Matrix and Matrix-Vector    → pto.tgemv, pto.tgemv_mx, pto.tmatmul, pto.tmatmul_acc, pto.tmatmul_bias, etc.
-│   ├── Layout and Rearrangement   → pto.tmov, pto.ttrans, pto.textract, pto.tinsert, pto.timg2col, etc.
-│   └── Irregular and Complex      → pto.tmrgsort, pto.tsort32, pto.tquant, pto.tprint, pto.tci, pto.ttri, etc.
-│
-├── Vector Instruction Set
-│   ├── Vector Load Store          → pto.vlds, pto.vldas, pto.vgather2, pto.vsld, pto.vsst, pto.vscatter, etc.
-│   ├── Predicate and Materialization → pto.vbr, pto.vdup
-│   ├── Unary Vector Instructions          → pto.vabs, pto.vneg, pto.vexp, pto.vsqrt, pto.vrec, pto.vrelu, pto.vnot, etc.
-│   ├── Binary Vector Instructions          → pto.vadd, pto.vsub, pto.vmul, pto.vmax, pto.vmin, pto.vand, pto.vor, etc.
-│   ├── Vector-Scalar Instructions            → pto.vadds, pto.vmuls, pto.vshls, pto.vlrelu, etc.
-│   ├── Conversion Ops             → pto.vci, pto.vcvt, pto.vtrc
-│   ├── Reduction Instructions              → pto.vcadd, pto.vcmax, pto.vcmin, pto.vcgadd, pto.vcgmax, pto.vcpadd, etc.
-│   ├── Compare and Select         → pto.vcmp, pto.vcmps, pto.vsel, pto.vselr, pto.vselrv2
-│   ├── Data Rearrangement         → pto.vintlv, pto.vdintlv, pto.vslide, pto.vshift, pto.vpack, pto.vzunpack, etc.
-│   └── SFU and DSA Instructions      → pto.vprelu, pto.vexpdiff, pto.vaxpy, pto.vtranspose, pto.vsort32, etc.
-│
-├── Scalar And Control Instruction Set
-│   ├── Control and Configuration  → pto.nop, pto.barrier, pto.yield, legacy mode/config ops such as pto.tsethf32mode and pto.tsetfmatrix
-│   ├── Pipeline Sync             → pto.set_flag, pto.wait_flag, pto.pipe_barrier, pto.mem_bar, etc.
-│   ├── DMA Copy                  → pto.copy_gm_to_ubuf, pto.copy_ubuf_to_gm, pto.copy_ubuf_to_ubuf, etc.
-│   ├── Predicate Load Store       → pto.pld, pto.plds, pto.pldi, pto.pst, pto.psts, pto.psti, pto.pstu
-│   ├── Predicate Generation       → pto.pset_b8, pto.pge_b8, pto.plt_b8, pto.pand, pto.por, pto.pxor, pto.pnot, etc.
-│   ├── Shared Arithmetic          → Scalar arithmetic ops shared across instruction sets
-│   └── Shared SCF               → Scalar structured control flow
-│
-└── Other Instruction Set
-    ├── Communication and Runtime  → pto.tbroadcast, pto.tget, pto.tput, pto.treduce, pto.tscatter, pto.tgather, pto.tnotify, pto.ttest, pto.twait, etc.
-    └── Non-ISA Supporting Ops    → pto.talias, pto.tconcat, pto.tfree, pto.tquant, pto.tdequant, pto.tpack, pto.thistogram, pto.tpop, pto.tpush, pto.trandom, etc.
-```
-
 ## Normative Language
 
 Instruction set pages use **MUST**, **SHOULD**, and **MAY** only for rules that a test, verifier, or review can check. Prefer plain language for explanation.
 
 ## See Also
 
-- [Instruction overview](../instruction-surfaces/README.md) — High-level instruction-set descriptions
+- [Instruction overview](../instruction-surfaces/README.md) — High-level map of all four instruction sets
 - [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page format standard
 - [Diagnostics and illegal cases](../reference/diagnostics-and-illegal-cases.md) — What makes a PTO program illegal
diff --git a/docs/isa/instruction-families/README_zh.md b/docs/isa/instruction-families/README_zh.md
index c8622939..9fa5cdf8 100644
--- a/docs/isa/instruction-families/README_zh.md
+++ b/docs/isa/instruction-families/README_zh.md
@@ -1,19 +1,62 @@
 # 指令族
 
-本章描述 PTO ISA 的指令族（Instruction Set）——共享约束和行为的指令分组。每个族定义了该族所有指令共同遵循的规则。
+本章描述 PTO ISA 的**指令族**（Instruction Set）——共享约束和行为的指令分组。每个族定义了该族所有指令共同遵循的规则，是 per-op 页面之上的抽象层。
 
-## 本章内容
+## 四类指令集
 
-- [指令族总览](README_zh.md) — 完整导航地图和族规范模板
-- [Tile 指令族](tile-families_zh.md) — Tile 指令集下的 8 个指令族（逐元素、归约、布局等）
-- [Vector 指令族](vector-families_zh.md) — Vector 指令集下的 9 个指令族
-- [标量与控制指令族](scalar-and-control-families_zh.md) — 标量、控制和配置的 6 个指令族
-- [其他指令族](other-families_zh.md) — 通信和其他支持指令族
+| 指令集 | 前缀 | 指令族数 | 说明 |
+|--------|------|----------|------|
+| [Tile 指令族](./tile-families_zh.md) | `pto.t*` | 8 | 逐元素、归约、布局、矩阵乘、数据搬运等 |
+| [Vector 指令族](./vector-families_zh.md) | `pto.v*` | 9 | 向量加载存储、一元/二元向量、归约、SFU 等 |
+| [标量与控制指令族](./scalar-and-control-families_zh.md) | `pto.*` | 6 | 同步、DMA、谓词、标量算术等 |
+| [其他指令族](./other-families_zh.md) | `pto.*` | 2 | 通信与支撑操作 |
 
 ## 指令集与指令族的关系
 
-- **指令集（Instruction Set）** 按功能角色分类指令（Tile / Vector / Scalar&Control / Other）
-- **族（Instruction Set）** 共享约束、行为模式和规范语言；同一族的指令共享家族概览页中的共同约束
+| 层级 | 定义 |
+|------|------|
+| **指令集（Instruction Set）** | 按功能角色（Tile / Vector / Scalar&Control / Other）分类指令 |
+| **族（Family）** | 同一族内共享约束、行为模式和规范语言 |
+
+## 导航地图
+
+```
+Tile 指令族
+├── 同步与配置             → tassign、tsync、tsetf32mode、tsetfmatrix、tset_img2col_*、tsubview
+├── 逐元素 Tile-Tile       → tadd、tsub、tmul、tdiv、tmin、tmax、tcmp、tcvt、tsel 等
+├── Tile-标量与立即数      → tadds、tsubs、tmuls、tdi等等vs、tcmps、tsels 等
+├── 归约与扩展            → trowsum、tcolmax、trowexpand、tcolexpand 等
+├── 内存与数据搬运         → tload、tprefetch、tstore、tstore_fp、mgather、mscatter
+├── 矩阵与矩阵-向量        → tgemv、tgemv_mx、tmatmul、tmatmul_acc、tmatmul_bias 等
+├── 布局与重排            → tmov、ttrans、textract、tinsert、timg2col、tfillpad 等
+└── 非常规与复杂操作       → tprint、tsort32、tgather、tscatter、tquant 等
+
+Vector 指令族
+├── 向量加载存储             → vlds、vldas、vgather2、vsld、vsst、vscatter 等
+├── 谓词与物化              → vbr、vdup
+├── 一元向量运算            → vabs、vneg、vexp、vln、vsqrt、vrec、vrelu 等
+├── 二元向量运算            → vadd、vsub、vmul、vdiv、vmax、vmin、vand、vor 等
+├── 向量-标量运算           → vadds、vsubs、vmuls、vshls、vlrelu 等
+├── 类型转换                → vci、vcvt、vtrc
+├── 归约指令                → vcadd、vcmax、vcmin、vcgadd、vcgmax 等
+├── 比较与选择              → vcmp、vcmps、vsel、vselr、vselrv2
+├── 数据重排                → vintlv、vslide、vshift、vpack、vzunpack 等
+└── SFU 与 DSA             → vprelu、vexpdiff、vaxpy、vtranspose、vsort32 等
+
+标量与控制指令族
+├── 控制与配置              → nop、barrier、yield；tsetf32mode、tsetfmatrix
+├── 流水线同步             → set_flag、wait_flag、pipe_barrier、mem_bar、get_buf
+├── DMA 拷贝               → copy_gm_to_ubuf、copy_ubuf_to_gm、copy_ubuf_to_ubuf
+├── 谓词加载存储            → pld、plds、psts、pst、pstu
+├── 谓词生成                → pset_b8/b16/b32、pge_b8/b16/b32、plt_b8/b16/b32
+│                             → pand、por、pxor、pnot、psel、ppack、punpack
+├── 共享标量算术            → 跨指令集共享的标量算术运算
+└── 共享结构化控制流        → 标量结构化控制流
+
+通信指令族
+├── 集体操作                → tbroadcast、tget、tput、tgather、tscatter、treduce、tnotify、ttest、twait
+└── 非 ISA 支撑操作          → talias、taxpy、tconcat、tdequant、tfree、thistogram、tpack、tpop、tpush、trandom
+```
 
 ## 每个族必须定义的内容
 
@@ -21,10 +64,18 @@
 2. **Shared Operand Model** — 共同的操作数模型和交互方式
 3. **Common 副作用** — 所有族内操作共享的副作用
 4. **Shared Constraints** — 适用于全族的合法性规则
-5. **Cases That Are Not Allowed** — 全族禁止的条件
+5. **Cases That Is Not Allowed** — 全族禁止的条件
 6. **Target-Profile Narrowing** — A2/A3 和 A5 的差异
 7. **Operation List** — 指向各 per-op 页面的链接
 
 ## 章节定位
 
 本章属于手册第 7 章（指令集）的一部分。族文档是 per-op 页面的上一层抽象，同一族的指令共享家族概览页中的共同约束。
+
+## 相关页面
+
+- [指令集总览](../instruction-surfaces/README_zh.md) — 四类指令集总览与数据流关系
+- [Tile 参考](../tile/README_zh.md) — Tile 指令逐条参考
+- [Vector 参考](../vector/README_zh.md) — 向量指令逐条参考
+- [标量与控制参考](../scalar/README_zh.md) — 标量与控制指令逐条参考
+- [其他与通信参考](../other/README_zh.md) — 通信与支撑操作逐条参考
diff --git a/docs/isa/instruction-surfaces/README.md b/docs/isa/instruction-surfaces/README.md
index be1f25a9..9eea7641 100644
--- a/docs/isa/instruction-surfaces/README.md
+++ b/docs/isa/instruction-surfaces/README.md
@@ -4,116 +4,73 @@ PTO ISA is organized into four instruction sets, each representing a distinct me
 
 ## Overview
 
-| Instruction Set | Prefix | Pipeline | Primary Role | Operands |
-|-----------------|--------|----------|-------------|----------|
+| Instruction Set | Prefix | Pipeline | Primary Role | Typical Operands |
+|----------------|--------|----------|-------------|-----------------|
 | [Tile Instructions](./tile-instructions.md) | `pto.t*` | All (via tile buffers) | Tile-oriented compute, data movement, layout transforms, synchronization | `!pto.tile<...>`, `!pto.tile_buf<...>`, `!pto.partition_tensor_view<...>` |
 | [Vector Instructions](./vector-instructions.md) | `pto.v*` | Vector Pipe (V) | Vector micro-instructions: lane-level compute, masking, alignment state | `!pto.vreg<NxT>`, `!pto.mask`, `!pto.ptr<T, ub>` |
-| [Scalar And Control](./scalar-and-control-instructions.md) | `pto.*` | Scalar Unit, DMA | Configuration, control flow, DMA setup, synchronization, predicates | Scalar regs, pipe ids, event ids, buffer ids |
+| [Scalar and Control](./scalar-and-control-instructions.md) | `pto.*` | Scalar Unit, DMA | Configuration, control flow, DMA setup, synchronization, predicates | Scalar regs, pipe ids, event ids, buffer ids |
 | [Other Instructions](./other-instructions.md) | `pto.*` | Inter-NPU | Collective communication, runtime support, tile sequence operations | `!pto.group<N>`, tile sequences, allocation handles |
 
-## Why These Instruction Sets Exist
+## Why Four Instruction Sets
 
-PTO has four instruction sets because different parts of the architecture expose different kinds of state. Mixing tile-level and vector-level state in one opcode space would blur the ISA contract.
+PTO is not a flat list of opcodes — it layers by architecturally visible state. The reason is direct: tile, vector, scalar/control, and communication each expose different kinds of state. Mixing them into one flat layer would blur the ISA contract.
 
-### Tile Instructions (`pto.t*`)
-
-Tile instructions reason about tiles: bounded multi-dimensional arrays with architecturally visible shape, layout, role, and valid-region metadata. The primary operands are tile registers (`!pto.tile<T, R, C>` or `!pto.tile_buf<...>`). Tile instructions produce destination tiles, change valid-region interpretations, or establish synchronization edges.
-
-```
-Input:   Tile operands, scalar modifiers, GlobalTensor views
-Output:  Tile payload, synchronization edges
-Domain:  Valid regions, tile layouts, tile shapes, location intents
-```
-
-### Vector Instructions (`pto.v*`)
-
-Vector instructions expose the vector pipeline directly. Operands are vector registers (`!pto.vreg<NxT>`), scalar values, and predicate masks. Vector instructions are the fine-grained compute layer beneath tile instructions. The full register width is always meaningful — there is no valid-region abstraction at the vector level.
-
-```
-Input:   Vector registers, scalar registers, predicates, memory addresses
-Output:  Vector registers, scalar registers, memory writes
-Domain:  Vector length N, lane masks, alignment state, distribution modes
-```
-
-### Scalar And Control Instructions (`pto.*`)
-
-Scalar/control instructions handle configuration, control flow, synchronization, DMA setup, and predicate state. They set up the execution shell around tile and vector payload regions. Most do not produce tile or vector payloads; they produce control effects, event tokens, or predicate masks.
-
-```
-Input:   Scalar registers, pipe ids, event ids, buffer ids, DMA loop parameters
-Output:  Control state, event tokens, predicate masks, configured DMA state
-Domain:  Configuration tuples, pipe/event spaces, DMA loop sizes and strides
-```
-
-### Other Instructions (`pto.*`)
-
-Communication and supporting operations carry their own side effects and ordering rules that do not fit into the tile/vector/scalar model. Examples include collective broadcasts across NPUs and alias/concatenation operations on tile sequences.
-
-```
-Input:   Collective groups, tile sequences, allocation handles
-Output:  Collective results, modified tile sequences, allocation state
-Domain:  Parallel groups, tile sequences, memory allocation
-```
+| Instruction Set | Core Abstraction | Primary Responsibilities |
+|----------------|-----------------|-------------------------|
+| Tile (`pto.t*`) | tile: architecturally visible objects with shape, layout, role, valid region | GM↔tile movement, elementwise/reduce/layout/matmul ops, sync edges |
+| Vector (`pto.v*`) | vreg, predicates, vector-visible UB | Vector register ops, lane-level masking, UB↔vreg movement |
+| Scalar/Control (`pto.*`) | Scalar regs, pipe/event ids, buffer ids | Sync edges, DMA config, predicate construction, control flow |
+| Other (`pto.*`) | Collective groups, tile sequences, allocation handles | Collective comm, tile sequence ops, memory management |
 
 ## Instruction Data Flow
 
 The four instruction sets form a layered execution model:
 
 ```
-┌─────────────────────────────────────────────────────────────┐
-│  GM (off-chip device memory)                                │
-└──────────┬──────────────────────────────────────┬───────────┘
-           │                                      │
-           │  Tile Instructions: TLOAD/TSTORE          │
-           │  Vector Instructions: copy_gm_to_ubuf / copy_ubuf_to_gm
-           ▼                                      ▼
-┌─────────────────────────────────────────────────────────────┐
-│  Unified Buffer (UB, 256 KB on-chip)                      │
-│  !pto.ptr<T, ub> — shared staging area                    │
-└──────┬──────────────────────────────────────────┬──────────┘
-       │                                      │
-       │  Tile Instructions: implicit tile↔UB       │
-       │  Vector Instructions: vlds / vsts          │
-       ▼                                      ▼
+GM (off-chip device memory)
+        │
+        ├── Tile instructions: TLOAD / TSTORE
+        └── Vector path: copy_gm_to_ubuf / copy_ubuf_to_gm
+        ▼
+Vector tile buffer (hardware implementation is UB)
+        │
+        ├── Tile instructions: direct read/write to tile buffer
+        └── Vector instructions: vlds / vsts
+        ▼
 ┌─────────────────┐              ┌─────────────────────────────┐
-│  Tile Buffers   │              │  Vector Registers          │
-│  !pto.tile_buf  │              │  !pto.vreg<NxT>           │
-│  (Vec/Mat/Acc/  │              │  (N lanes)                │
-│   Left/Right)   │              │                           │
+│  Tile Buffers   │              │  Vector Registers           │
+│  (Vec/Mat/Acc/  │              │  !pto.vreg<NxT>            │
+│   Left/Right)    │              │                             │
 └────────┬─────────┘              └──────────────┬────────────┘
-         │                                     │
-         │  Tile Instructions: pto.t*                       │  Vector Instructions: pto.v*
-         │  (TMATMUL via Mat/Acc slots)       │  (vadd, vmul, etc.)
-         │                                     │
-         │  ◄── Matrix Multiply Unit (M)       │  ◄── Vector Pipeline (V)
+         │                                       │
+         │  Tile instructions: pto.t*         │  Vector instructions: pto.v*
+         │  (TMATMUL via Mat/Left/Right/Acc)  │  (vadd, vmul, vcmp, ...)
+         │                                       │
+         │  ◄── Matrix Multiply Unit            │  ◄── Vector Pipeline
          └─────────────────────────────────────┘
                        │
-                       │  Tile Instructions: TSTORE
-                       │  Vector Instructions: vsts → copy_ubuf_to_gm
                        ▼
-         [vector tile buffer → GM]
+              [tile buffer → GM]
 ```
 
 ## Instruction Count Summary
 
-| Instruction Set | Groups | Operations | Notes |
-|-----------------|--------|------------|-------|
-| Tile | 8 | ~120 | Full matmul, elementwise, reduce, layout |
+| Instruction Set | Families | Operations | Notes |
+|----------------|----------|------------|-------|
+| Tile | 8 | ~120 | Full matmul, elementwise, reduce, layout, data movement |
 | Vector | 9 | ~99 | Full vector compute, load/store, SFU |
-| Scalar/Control | 6 | ~60 | Sync, DMA, predicates |
+| Scalar/Control | 6 | ~60 | Sync, DMA, predicates, control |
 | Other/Communication | 2 | ~24 | Collective ops, supporting ops |
 
 ## Normative Language
 
-Instruction text always means what happens in the declared valid region unless the page explicitly defines behavior outside it. PTO is **tile-first** and **valid-region-first**.
-
-Use **MUST**, **SHOULD**, and **MAY** only for rules that a test, verifier, or review can check. Prefer plain language for explanation.
+Instruction set pages describe shared contracts for groups of operations — they do not repeat per-op details. Use **MUST / SHOULD / MAY** only for rules that a verifier, test, or review can check. Prefer plain language for explanation.
 
 ## See Also
 
-- [Instruction set contracts](../instruction-families/README.md) — Group-level contracts
+- [Instruction set contracts](../instruction-families/README.md) — Group-level contracts for all four sets
+- [Tile reference](../tile/README.md) — Tile instruction per-op reference
+- [Vector reference](../vector/README.md) — Vector instruction per-op reference
+- [Scalar reference](../scalar/README.md) — Scalar and control per-op reference
+- [Other reference](../other/README.md) — Communication and supporting ops
 - [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page format standard
-- [Tile ISA reference](../tile/README.md) — Tile instruction per-op reference
-- [Vector ISA reference](../vector/README.md) — Vector instruction per-op reference
-- [Scalar ISA reference](../scalar/README.md) — Scalar instruction per-op reference
-- [Other ISA reference](../other/README.md) — Communication and supporting ops
diff --git a/docs/isa/instruction-surfaces/README_zh.md b/docs/isa/instruction-surfaces/README_zh.md
index d95b4b98..a6d33015 100644
--- a/docs/isa/instruction-surfaces/README_zh.md
+++ b/docs/isa/instruction-surfaces/README_zh.md
@@ -1,6 +1,6 @@
 # 指令集总览
 
-PTO ISA 被组织成四类指令集。每一类指令集都对应不同的机制、不同的操作数域，以及不同的执行路径。在阅读单条指令页之前，先理解这一层的分工非常重要。
+PTO ISA 被组织成四类指令集，每类代表一种不同的机制、不同的操作数域和不同的执行路径。在阅读单条指令页之前，先理解这一层的分工非常重要。
 
 ## 总览
 
@@ -15,88 +15,42 @@ PTO ISA 被组织成四类指令集。每一类指令集都对应不同的机制
 
 PTO 不是把所有 opcode 塞进一个扁平列表里，而是按架构可见状态来分层。原因很直接：tile、vector、scalar/control、communication 各自暴露的是不同类型的状态，如果把它们混成一层，会让 ISA 契约变得模糊。
 
-### Tile 指令集（`pto.t*`）
-
-Tile 指令围绕 **tile** 建模。tile 是带 shape、layout、role、valid region 元数据的架构可见对象。它们的主要职责是：
-
-- 在 GM 和本地 tile buffer 之间搬运数据
-- 在 tile 这一层执行逐元素运算、归约、扩展、布局变换和 matmul
-- 建立 tile 级同步边
-
-```text
-输入：tile、标量修饰符、GlobalTensor 视图
-输出：tile payload、valid region 变化、同步边
-关注点：shape / layout / role / valid region / 目标 profile 收缩
-```
-
-### 向量指令集（`pto.v*`）
-
-向量指令直接暴露向量流水线。它们处理的是向量寄存器、谓词寄存器和向量 tile buffer（硬件实现为 UB），而不是 tile 级 valid region。
-
-```text
-输入：vreg、标量、谓词、UB 指针、对齐状态
-输出：vreg、谓词、UB 写回
-关注点：lane、mask、对齐状态、distribution mode、目标 profile 限制
-```
-
-### 标量与控制指令集（`pto.*`）
-
-标量与控制指令不直接产生 tile 或向量 payload。它们负责建立执行外壳：
-
-- 同步与 producer-consumer 顺序
-- DMA 配置与启动
-- 谓词构造与谓词搬运
-- 标量控制流与控制状态
-
-```text
-输入：标量值、pipe/event id、buffer id、DMA 参数
-输出：控制状态、事件边、谓词、DMA 配置
-关注点：顺序、配置、控制、可见状态
-```
-
-### 其他指令集（`pto.*`）
-
-这一类用于放不能自然归入 tile / vector / scalar-control 的内容，例如：
-
-- 通信与运行时
-- 非 ISA 但仍与手册主线相关的支撑操作
-- tile 序列、分配句柄等辅助结构
+| 指令集 | 核心抽象 | 主要职责 |
+|--------|----------|----------|
+| Tile（`pto.t*`） | tile：带 shape、layout、role、valid region 的架构可见对象 | GM ↔ tile 搬运、逐元素/归约/布局/matmul 运算、同步 |
+| 向量（`pto.v*`） | vreg、谓词、向量可见 UB | 向量寄存器操作、lane 级 mask、UB ↔ vreg 搬运 |
+| 标量与控制（`pto.*`） | 标量寄存器、pipe/event id、buffer id | 同步边、DMA 配置、谓词构造、控制流 |
+| 其他（`pto.*`） | 集体组、tile 序列、分配句柄 | 集体通信、tile 序列操作、内存管理 |
 
 ## 指令级数据流关系
 
 四类指令集共同组成 PTO 的执行层次：
 
 ```text
-┌─────────────────────────────────────────────────────────────┐
-│  GM（片外全局内存）                                         │
-└──────────┬──────────────────────────────────────┬───────────┘
-           │                                      │
-           │  Tile 指令：TLOAD / TSTORE                │
-           │  Vector 路径：copy_gm_to_ubuf / copy_ubuf_to_gm
-           ▼                                      ▼
-┌─────────────────────────────────────────────────────────────┐
-│  本地 tile buffer                                           │
-│  其中 Vec tile buffer 的硬件实现就是 UB                      │
-└──────┬──────────────────────────────────────────┬──────────┘
-       │                                      │
-       │  Tile 指令：直接读写 tile buffer           │
-       │  Vector 指令：vlds / vsts                 │
-       ▼                                      ▼
+GM（片外全局内存）
+        │
+        ├── Tile 指令：TLOAD / TSTORE
+        └── Vector 路径：copy_gm_to_ubuf / copy_ubuf_to_gm
+        ▼
+向量 tile buffer（硬件实现为 UB）
+        │
+        ├── Tile 指令：直接读写 tile buffer
+        └── Vector 指令：vlds / vsts
+        ▼
 ┌─────────────────┐              ┌─────────────────────────────┐
 │  Tile Buffers   │              │  Vector Registers           │
-│  !pto.tile_buf  │              │  !pto.vreg<NxT>            │
-│  (Vec/Mat/Acc/  │              │                             │
-│   Left/Right)   │              │                             │
+│  (Vec/Mat/Acc/  │              │  !pto.vreg<NxT>            │
+│   Left/Right)    │              │                             │
 └────────┬─────────┘              └──────────────┬────────────┘
-         │                                     │
-         │  Tile 指令：pto.t*                         │  向量指令：pto.v*
-         │  (TMATMUL 通过 Mat / Left / Right / Acc) │  (vadd, vmul, vcmp, ...)
-         │                                     │
-         │  ◄── Matrix Multiply Unit           │  ◄── Vector Pipeline
+         │                                       │
+         │  Tile 指令：pto.t*                  │  向量指令：pto.v*
+         │  (TMATMUL 通过 Mat/Left/Right/Acc)  │  (vadd, vmul, vcmp, ...)
+         │                                       │
+         │  ◄── Matrix Multiply Unit            │  ◄── Vector Pipeline
          └─────────────────────────────────────┘
                        │
                        ▼
-                [本地 tile buffer → GM]
+              [tile buffer → GM]
 ```
 
 ## 指令数量摘要
@@ -110,13 +64,13 @@ Tile 指令围绕 **tile** 建模。tile 是带 shape、layout、role、valid re
 
 ## 规范语言
 
-指令集页描述的是“这一组操作共同遵守什么契约”，不是逐条重复 opcode 说明。文中使用 **MUST / SHOULD / MAY** 时，应只用于 verifier、测试或 review 能够检查的规则；解释性内容应尽量用自然语言而不是模板句。
+指令集页描述的是"这一组操作共同遵守什么契约"，不是逐条重复 opcode 说明。文中使用 **MUST / SHOULD / MAY** 时，应只用于 verifier、测试或 review 能够检查的规则；解释性内容应尽量用自然语言。
 
 ## 相关页面
 
-- [指令族](../instruction-families/README_zh.md)
-- [指令描述格式](../reference/format-of-instruction-descriptions_zh.md)
-- [Tile 参考入口](../tile/README_zh.md)
-- [Vector 参考入口](../vector/README_zh.md)
-- [标量与控制参考入口](../scalar/README_zh.md)
-- [其他与通信参考入口](../other/README_zh.md)
+- [指令族](../instruction-families/README_zh.md) — Tile / Vector / 标量 / 通信指令族
+- [Tile 参考入口](../tile/README_zh.md) — Tile 指令逐条参考
+- [Vector 参考入口](../vector/README_zh.md) — 向量指令逐条参考
+- [标量与控制参考入口](../scalar/README_zh.md) — 标量与控制指令逐条参考
+- [其他与通信参考入口](../other/README_zh.md) — 通信与支撑操作逐条参考
+- [指令描述格式](../reference/format-of-instruction-descriptions_zh.md) — per-op 页面格式标准
diff --git a/docs/isa/instruction-surfaces/scalar-and-control-instructions_zh.md b/docs/isa/instruction-surfaces/scalar-and-control-instructions_zh.md
index 1cb8ab36..cb6ac971 100644
--- a/docs/isa/instruction-surfaces/scalar-and-control-instructions_zh.md
+++ b/docs/isa/instruction-surfaces/scalar-and-control-instructions_zh.md
@@ -2,27 +2,9 @@
 
 `pto.*` 中的标量与控制指令集负责建立执行外壳：配置、同步、DMA、谓词和控制流。它们围绕 tile 与 vector 有效载荷工作，而不是直接产出 tile / vector payload。
 
-## 指令集概览
+## 概述
 
-标量与控制指令不直接生成 tile 或向量结果，它们主要生成：
-
-- 控制效果（barrier、控制流推进、状态更新）
-- 明确的 producer-consumer 顺序边
-- 谓词 mask
-- DMA 参数与 DMA 状态
-- 标量结果值
-
-标量操作数是单值的，它们的职责是把 tile / vector 这两条 payload 路径串起来。
-
-## 指令分类
-
-| 类别 | 说明 | 示例 |
-|------|------|------|
-| 控制与配置 | NOP、barrier、yield 以及模式配置 | `nop`、`barrier`、`yield`、`tsethf32mode`、`tsetfmatrix` |
-| 流水线同步 | 跨流水线的事件同步与 barrier | `set_flag`、`wait_flag`、`pipe_barrier` |
-| DMA 拷贝 | GM↔向量 tile buffer 的搬运配置与发起 | `copy_gm_to_ubuf`、`copy_ubuf_to_gm`、`set_loop_size_outtoub` |
-| 谓词加载存储 | 谓词在内存与谓词寄存器之间搬运 | `pld`、`plds`、`pst`、`pstu` |
-| 谓词生成与代数 | 构造 tail mask、布尔组合、谓词重排 | `pset_b8`、`pge_b8`、`plt_b8`、`pand`、`por`、`pxor` |
+标量与控制指令不直接生成 tile 或向量结果，它们主要产生控制效果、明确的 producer-consumer 顺序边、谓词 mask、DMA 参数与状态，以及标量结果值。标量操作数是单值的，它们的职责是把 tile / vector 这两条 payload 路径串起来。
 
 ## 输入
 
@@ -53,39 +35,6 @@
 | DMA 拷贝 | 发起 GM 与向量 tile buffer 之间的数据搬运 |
 | 谓词加载存储 | 读写谓词对应的内存表示 |
 
-## 事件模型
-
-标量 / 控制同步使用显式事件模型。一个事件由 `(src_pipe, dst_pipe, event_id)` 三元组标识：
-
-| 字段 | 典型取值 | 含义 |
-|------|----------|------|
-| `src_pipe` | `PIPE_MTE1`、`PIPE_MTE2`、`PIPE_MTE3`、`PIPE_V`、`PIPE_M` | 产生事件的流水线 |
-| `dst_pipe` | `PIPE_MTE1`、`PIPE_MTE2`、`PIPE_MTE3`、`PIPE_V`、`PIPE_M` | 消费事件的流水线 |
-| `event_id` | 0–15（取决于目标 profile） | 事件槽编号 |
-
-```text
-Producer pipe                               Consumer pipe
-   │                                            │
-   │  发起 DMA 或计算                            │
-   ▼                                            │
-set_flag(src_pipe, dst_pipe, EVENT_ID)          │
-   │                                            │
-   │                               wait_flag(src_pipe, dst_pipe, EVENT_ID)
-   │                                            │
-   ▼                                            ▼
-结果或数据可见                                 后续操作继续
-```
-
-## 不同 profile 的 pipe 空间
-
-| Pipe | CPU Sim | A2A3 | A5 |
-|------|:-------:|:----:|:--:|
-| `PIPE_MTE1` | 模拟 | 支持 | 支持 |
-| `PIPE_MTE2` | 模拟 | 支持 | 支持 |
-| `PIPE_MTE3` | 模拟 | 支持 | 支持 |
-| `PIPE_V` | 模拟 | 桥接/模拟 | 原生 |
-| `PIPE_M` | 模拟 | 支持 | 支持 |
-
 ## 约束
 
 - pipe / event 空间必须符合所选 profile。
@@ -94,7 +43,7 @@ set_flag(src_pipe, dst_pipe, EVENT_ID)          │
 - 谓词宽度必须和目标操作匹配。
 - 依赖外壳不能跳过：例如，在 `copy_gm_to_ubuf` 完成前就直接执行依赖其结果的 `vlds` 是非法的。
 
-## 不允许的情形
+## 异常与非法情形
 
 - 等待一个从未建立的事件
 - 使用目标 profile 不支持的 pipe 或 event 标识
@@ -102,6 +51,16 @@ set_flag(src_pipe, dst_pipe, EVENT_ID)          │
 - 混用和目标向量宽度不匹配的谓词宽度
 - 没有正确等待就跨越 DMA / vector / tile 顺序边
 
+## 指令分类
+
+| 类别 | 说明 | 示例 |
+|------|------|------|
+| 控制与配置 | NOP、barrier、yield 以及模式配置 | `nop`、`barrier`、`yield`、`tsethf32mode`、`tsetfmatrix` |
+| 流水线同步 | 跨流水线的事件同步与 barrier | `set_flag`、`wait_flag`、`pipe_barrier` |
+| DMA 拷贝 | GM↔向量 tile buffer 的搬运配置与发起 | `copy_gm_to_ubuf`、`copy_ubuf_to_gm`、`set_loop_size_outtoub` |
+| 谓词加载存储 | 谓词在内存与谓词寄存器之间搬运 | `pld`、`plds`、`pst`、`pstu` |
+| 谓词生成与代数 | 构造 tail mask、布尔组合、谓词重排 | `pset_b8`、`pge_b8`、`plt_b8`、`pand`、`por`、`pxor` |
+
 ## 语法
 
 ### PTO-AS 形式
@@ -141,6 +100,39 @@ PTO_INST void copy_ubuf_to_gm(gm_ptr dst, ub_ptr src, uint64_t sid,
                               uint64_t reserved, uint64_t dst_stride, uint64_t src_stride);
 ```
 
+## 事件模型
+
+标量 / 控制同步使用显式事件模型。一个事件由 `(src_pipe, dst_pipe, event_id)` 三元组标识：
+
+| 字段 | 典型取值 | 含义 |
+|------|----------|------|
+| `src_pipe` | `PIPE_MTE1`、`PIPE_MTE2`、`PIPE_MTE3`、`PIPE_V`、`PIPE_M` | 产生事件的流水线 |
+| `dst_pipe` | `PIPE_MTE1`、`PIPE_MTE2`、`PIPE_MTE3`、`PIPE_V`、`PIPE_M` | 消费事件的流水线 |
+| `event_id` | 0–15（取决于目标 profile） | 事件槽编号 |
+
+```text
+Producer pipe                               Consumer pipe
+   │                                            │
+   │  发起 DMA 或计算                            │
+   ▼                                            │
+set_flag(src_pipe, dst_pipe, EVENT_ID)          │
+   │                                            │
+   │                               wait_flag(src_pipe, dst_pipe, EVENT_ID)
+   │                                            │
+   ▼                                            ▼
+结果或数据可见                                 后续操作继续
+```
+
+## 不同 profile 的 pipe 空间
+
+| Pipe | CPU Sim | A2A3 | A5 |
+|------|:-------:|:----:|:--:|
+| `PIPE_MTE1` | 模拟 | 支持 | 支持 |
+| `PIPE_MTE2` | 模拟 | 支持 | 支持 |
+| `PIPE_MTE3` | 模拟 | 支持 | 支持 |
+| `PIPE_V` | 模拟 | 桥接/模拟 | 原生 |
+| `PIPE_M` | 模拟 | 支持 | 支持 |
+
 ## 相关页面
 
 - [标量与控制指令族](../instruction-families/scalar-and-control-families_zh.md)
diff --git a/docs/isa/other/README.md b/docs/isa/other/README.md
index 1ec57dfb..c9ad779d 100644
--- a/docs/isa/other/README.md
+++ b/docs/isa/other/README.md
@@ -2,27 +2,29 @@
 
 Other and communication operations cover behavior that does not fit cleanly into the tile, vector, or scalar/control buckets.
 
-## Communication And Runtime
+## Two Categories
 
-Inter-NPU collective communication and synchronization.
+### Communication and Runtime
 
-| Instruction Set | Description |
-|--------|-------------|
-| [TBROADCAST](../comm/TBROADCAST.md) | Broadcast data from root NPU to all ranks |
-| [TGET](../comm/TGET.md) | Get data from a remote NPU |
-| [TGET_ASYNC](../comm/TGET_ASYNC.md) | Asynchronous variant of TGET |
-| [TNOTIFY](../comm/TNOTIFY.md) | Notify other ranks of an event |
-| [TPUT](../comm/TPUT.md) | Put data to a remote NPU |
-| [TPUT_ASYNC](../comm/TPUT_ASYNC.md) | Asynchronous variant of TPUT |
-| [TREDUCE](../comm/TREDUCE.md) | Collective reduction across all ranks |
-| [TSCATTER](../comm/TSCATTER.md) | Scatter data from root NPU to all ranks |
-| [TGATHER](../comm/TGATHER.md) | Gather data from all ranks to root NPU |
-| [TTEST](../comm/TTEST.md) | Test if a notification has been received |
-| [TWAIT](../comm/TWAIT.md) | Wait for a notification |
+Inter-NPU collective communication and synchronization primitives.
 
-See [Communication and Runtime](./communication-and-runtime.md) for the instruction set contract.
+| Instruction | Description | Sync Type |
+|-------------|-------------|-----------|
+| [TBROADCAST](../comm/TBROADCAST.md) | Broadcast data from root NPU to all ranks | Sync |
+| [TGET](../comm/TGET.md) | Get data from a remote NPU | Sync |
+| [TGET_ASYNC](../comm/TGET_ASYNC.md) | Asynchronously get data from a remote NPU | Async |
+| [TPUT](../comm/TPUT.md) | Put data to a remote NPU | Sync |
+| [TPUT_ASYNC](../comm/TPUT_ASYNC.md) | Asynchronously put data to a remote NPU | Async |
+| [TNOTIFY](../comm/TNOTIFY.md) | Notify other ranks of an event | Sync |
+| [TWAIT](../comm/TWAIT.md) | Wait for a notification | Sync |
+| [TTEST](../comm/TTEST.md) | Test if a notification has been received | Sync |
+| [TGATHER](../comm/TGATHER.md) | Gather data from all ranks to root NPU | Sync |
+| [TSCATTER](../comm/TSCATTER.md) | Scatter data from root NPU to all ranks | Sync |
+| [TREDUCE](../comm/TREDUCE.md) | Collective reduction across all ranks | Sync |
 
-## Non-ISA Supporting Operations
+[Communication and Runtime contract →](./communication-and-runtime.md)
+
+### Non-ISA Supporting Operations
 
 Convenience operations over tile sequences or memory management.
 
@@ -40,10 +42,12 @@ Convenience operations over tile sequences or memory management.
 | [TRANDOM](../TRANDOM.md) | Fill tile with random values | Generation |
 | [TQUANT](../TQUANT.md) | Quantize a tile to integer format | Quantize |
 
-See [Non-ISA and Supporting Ops](./non-isa-and-supporting-ops.md) for the instruction set contract.
+[Non-ISA and Supporting Ops contract →](./non-isa-and-supporting-ops.md)
 
 ## See Also
 
-- [Other instruction set](../instruction-surfaces/other-instructions.md) — High-level instruction set description
-- [Instruction set contracts](../instruction-families/README.md) — Normative contracts for all instruction sets
-- [Instruction set overview](../instruction-surfaces/README.md) — Map of all four instruction sets
+| Page | Content |
+|------|---------|
+| [Instruction overview](../instruction-surfaces/other-instructions.md) | High-level description of Other instruction set |
+| [Instruction families](../instruction-families/README.md) | Normative contracts for all instruction sets |
+| [Instruction set overview](../instruction-surfaces/README.md) | Map of all four instruction sets |
diff --git a/docs/isa/other/README_zh.md b/docs/isa/other/README_zh.md
index 2a6e71dd..f9b36d2c 100644
--- a/docs/isa/other/README_zh.md
+++ b/docs/isa/other/README_zh.md
@@ -2,11 +2,52 @@
 
 本节包含不属于 Tile、Vector 或标量/控制主干的残余指令和通信操作。
 
-## 本章内容
+## 两大分类
 
-- [通信与运行时](communication-and-runtime_zh.md) — 点对点通信、集合操作和运行时支持
-- [非 ISA 与支持操作](non-isa-and-supporting-ops_zh.md) — 边界外的支持操作
+### 通信与运行时（Communication and Runtime）
 
-## 章节定位
+核间集体通信与同步原语。
 
-本章属于手册第 7 章（指令集）的补充部分。当一个操作不属于 Tile/Vector/标量主干时，归入本节。
+| 指令 | 说明 | 同步类型 |
+|------|------|----------|
+| [TBROADCAST](../comm/TBROADCAST_zh.md) | 从 root NPU 广播数据到所有 rank | 同步 |
+| [TGET](../comm/TGET_zh.md) | 从远程 NPU 获取数据 | 同步 |
+| [TGET_ASYNC](../comm/TGET_ASYNC_zh.md) | 从远程 NPU 异步获取数据 | 异步 |
+| [TPUT](../comm/TPUT_zh.md) | 向远程 NPU 发送数据 | 同步 |
+| [TPUT_ASYNC](../comm/TPUT_ASYNC_zh.md) | 向远程 NPU 异步发送数据 | 异步 |
+| [TNOTIFY](../comm/TNOTIFY_zh.md) | 通知其他 rank 某个事件发生 | 同步 |
+| [TWAIT](../comm/TWAIT_zh.md) | 等待通知到达 | 同步 |
+| [TTEST](../comm/TTEST_zh.md) | 测试通知是否已到达 | 同步 |
+| [TGATHER](../comm/TGATHER_zh.md) | 从所有 rank 收集数据到 root NPU | 同步 |
+| [TSCATTER](../comm/TSCATTER_zh.md) | 从 root NPU 散发数据到所有 rank | 同步 |
+| [TREDUCE](../comm/TREDUCE_zh.md) | 在所有 rank 上做集体归约 | 同步 |
+
+[通信与运行时契约 →](./communication-and-runtime_zh.md)
+
+### 非 ISA 支撑操作（Non-ISA Supporting Operations）
+
+面向 tile 序列或内存管理的便利操作。
+
+| 操作 | 说明 | 分类 |
+|------|------|------|
+| [TALIAS](../TALIAS_zh.md) | 创建 tile 的别名视图（无数据拷贝） | 别名 |
+| [TAXPY](../TAXPY_zh.md) | 融合乘加：`dst = src0 * scalar + src1` | 融合计算 |
+| [TCONCAT](../TCONCAT_zh.md) | 沿维度拼接两个 tile | tile 序列 |
+| [TDEQUANT](../TDEQUANT_zh.md) | 从量化格式反量化 tile | 量化 |
+| [TFREE](../TFREE_zh.md) | 释放先前分配的 tile 或 buffer | 内存 |
+| [THISTOGRAM](../THISTOGRAM_zh.md) | 计算 tile 值的直方图 | 统计 |
+| [TPACK](../TPACK_zh.md) | 将多个 tile 打包进单个 tile buffer | tile 序列 |
+| [TPOP](../TPOP_zh.md) | 谓词 mask 的置 1 位计数 | 谓词 |
+| [TPUSH](../TPUSH_zh.md) | 谓词 mask 的置 0 位计数 | 谓词 |
+| [TRANDOM](../TRANDOM_zh.md) | 用随机值填充 tile | 生成 |
+| [TQUANT](../TQUANT_zh.md) | 将 tile 量化为整数格式 | 量化 |
+
+[非 ISA 支撑操作契约 →](./non-isa-and-supporting-ops_zh.md)
+
+## 相关页面
+
+| 页面 | 内容 |
+|------|------|
+| [指令集总览](../instruction-surfaces/other-instructions_zh.md) | 其他指令集的高层描述 |
+| [指令族](../instruction-families/README_zh.md) | 所有指令集的规范契约 |
+| [指令集总览](../instruction-surfaces/README_zh.md) | 四大指令集地图 |
diff --git a/docs/isa/reference/README.md b/docs/isa/reference/README.md
index 8b3b9fda..ce25f603 100644
--- a/docs/isa/reference/README.md
+++ b/docs/isa/reference/README.md
@@ -1,9 +1,30 @@
 # Reference Notes
 
-These notes support the main PTO ISA manual.
+These notes support the main PTO ISA manual, covering format guidelines, glossary, diagnostics, portability, and source of truth.
 
-- [Format Of Instruction Descriptions](./format-of-instruction-descriptions.md)
-- [Glossary](./glossary.md)
-- [Diagnostics And Illegal Cases](./diagnostics-and-illegal-cases.md)
-- [Portability And Target Profiles](./portability-and-target-profiles.md)
-- [Source Of Truth](./source-of-truth.md)
+## Choose by Need
+
+| Your need | Start here |
+|-----------|-----------|
+| Understanding per-instruction page format | [Format of instruction descriptions](./format-of-instruction-descriptions.md) |
+| Looking up PTO terminology | [Glossary](./glossary.md) |
+| Understanding what makes a PTO program illegal | [Diagnostics and illegal cases](./diagnostics-and-illegal-cases.md) |
+| Understanding A2/A3 vs A5 feature differences | [Portability and target profiles](./portability-and-target-profiles.md) |
+| Understanding authoritative spec sources | [Source of truth](./source-of-truth.md) |
+
+## Document Index
+
+| Document | Content |
+|----------|---------|
+| [Format of instruction descriptions](./format-of-instruction-descriptions.md) | Per-op page format standard |
+| [Glossary](./glossary.md) | PTO ISA key term definitions |
+| [Diagnostics and illegal cases](./diagnostics-and-illegal-cases.md) | Operational failures and illegal case handling |
+| [Portability and target profiles](./portability-and-target-profiles.md) | PTO portability across target profiles |
+| [Source of truth](./source-of-truth.md) | Authoritative spec sources and priorities for PTO ISA |
+
+## Relationship to Other Docs
+
+This section is the appendix of the manual — consult it on demand rather than reading linearly. Compare with:
+
+- [docs/README.md](../../README.md) — Documentation hub
+- [isa/README.md](../README.md) — ISA reference entry
diff --git a/docs/isa/reference/README_zh.md b/docs/isa/reference/README_zh.md
index 04ce15d6..a1f1b4ae 100644
--- a/docs/isa/reference/README_zh.md
+++ b/docs/isa/reference/README_zh.md
@@ -2,14 +2,31 @@
 
 这些注释支持主要的 PTO ISA 手册，涵盖格式规范、术语表、诊断、可移植性和规范来源。
 
-## 本章内容
+## 按任务选择
 
-- [指令描述格式](format-of-instruction-descriptions_zh.md) — per-op 页面的标准格式规范
-- [术语表](glossary_zh.md) — PTO ISA 中的关键术语定义
-- [诊断与非法情况](diagnostics-and-illegal-cases_zh.md) — 操作失败和非法情况的处理
-- [可移植性与目标 Profile](portability-and-target-profiles_zh.md) — PTO 在不同目标 Profile 之间的可移植性
-- [规范来源](source-of-truth_zh.md) — PTO ISA 规范的权威来源与优先级
+| 你的需求 | 从这里开始 |
+|----------|----------|
+| 了解单条指令页面的格式标准 | [指令描述格式](format-of-instruction-descriptions_zh.md) |
+| 查找 PTO 术语定义 | [术语表](glossary_zh.md) |
+| 了解什么会导致 PTO 程序非法 | [诊断与非法情况](diagnostics-and-illegal-cases_zh.md) |
+| 了解 A2/A3 vs A5 的特性差异 | [可移植性与目标 Profile](portability-and-target-profiles_zh.md) |
+| 了解规范的权威来源与优先级 | [规范来源](source-of-truth_zh.md) |
+
+## 文档索引
+
+| 文档 | 内容 |
+|------|------|
+| [指令描述格式](format-of-instruction-descriptions_zh.md) | per-op 页面的标准格式规范 |
+| [术语表](glossary_zh.md) | PTO ISA 中的关键术语定义 |
+| [诊断与非法情况](diagnostics-and-illegal-cases_zh.md) | 操作失败和非法情况的处理 |
+| [可移植性与目标 Profile](portability-and-target-profiles_zh.md) | PTO 在不同目标 Profile 之间的可移植性 |
+| [规范来源](source-of-truth_zh.md) | PTO ISA 规范的权威来源与优先级 |
 
 ## 章节定位
 
-本章属于手册的第 8 章（支持性参考章节），可在需要时查阅。
+本章属于手册的附录部分，可在需要时查阅。与正文不同，本章不是线性阅读材料，而是按需检索的参考手册。
+
+## 相关页面
+
+- [docs/README_zh.md](../../README_zh.md) — 文档总入口
+- [isa/README_zh.md](../README_zh.md) — ISA 参考入口
diff --git a/docs/isa/scalar/README.md b/docs/isa/scalar/README.md
index 51671455..5e3d44fb 100644
--- a/docs/isa/scalar/README.md
+++ b/docs/isa/scalar/README.md
@@ -1,16 +1,49 @@
-# Scalar And Control Reference
+# Scalar and Control Reference
 
-This tree documents the `pto.*` scalar and control instructions of PTO ISA: synchronization, DMA configuration, predicate-state movement, predicate construction, and the shared scalar source shell around tile and vector payload execution.
+`pto.*` scalar and control instructions manage synchronization, DMA, predicates, control flow, and shared scalar support logic. They provide the execution shell around tile and vector payload regions.
 
-The key distinction is architectural role, not only spelling. `pto.*` pages live here when they expose control, DMA, predicate, or other non-payload state directly. When an instruction set exists only to summarize how those forms interact with vector execution, the vector instruction-set overviews remain linked as related material rather than acting as the primary per-op reference.
+## Organization
 
-## Instruction Sets
+The scalar reference is organized by instruction family, with individual per-op pages under `scalar/ops/`.
 
-- [Control and configuration](./control-and-configuration.md)
-- [PTO micro-instruction reference](./ops/micro-instruction/README.md)
-- [Pipeline sync](./pipeline-sync.md)
-- [DMA copy](./dma-copy.md)
-- [Predicate load store](./predicate-load-store.md)
-- [Predicate generation and algebra](./predicate-generation-and-algebra.md)
-- [Shared scalar arithmetic](./shared-arith.md)
-- [Shared structured control flow](./shared-scf.md)
+## Instruction Families
+
+| Family | Description | Operations |
+|--------|-------------|-----------|
+| [Control and Configuration](./control-and-configuration.md) | NOP, barrier, yield; tsetf32mode, tsethf32mode, tsetfmatrix | `nop`, `barrier`, `yield`, etc. |
+| [PTO Micro-Instruction Reference](./ops/micro-instruction/README.md) | Scalar micro-instructions: BlockDim, pointer ops, vector scope, alignment state | `pto.get_block_idx`, `pto.castptr`, `pto.vecscope`, etc. |
+| [Pipeline Sync](./pipeline-sync.md) | Event-based synchronization between pipes | `set_flag`, `wait_flag`, `wait_flag_dev`, `pipe_barrier`, `mem_bar`, `get_buf`, `rls_buf`, `set_cross_core`, `set_intra_block`, `wait_intra_core` |
+| [DMA Copy](./dma-copy.md) | GM↔UB and UB↔UB data movement | `copy_gm_to_ubuf`, `copy_ubuf_to_gm`, `copy_ubuf_to_ubuf`, loop size/stride setters |
+| [Predicate Load Store](./predicate-load-store.md) | Predicate-aware scalar load/store | `pld`, `plds`, `pldi`, `psts`, `pst`, `psti`, `pstu` |
+| [Predicate Generation and Algebra](./predicate-generation-and-algebra.md) | Predicate construction and logic | `pset_b8/b16/b32`, `pge_b8/b16/b32`, `plt_b8/b16/b32`, `pand`, `por`, `pxor`, `pnot`, `psel`, `ppack`, `punpack`, `pdintlv_b8`, `pintlv_b16` |
+| [Shared Arithmetic](./shared-arith.md) | Scalar arithmetic shared across instruction sets | Scalar arithmetic ops |
+| [Shared SCF](./shared-scf.md) | Scalar structured control flow | `scf.for`, `scf.if`, `scf.while` |
+
+## Common Constraints
+
+- Pipe / event spaces are constrained by the target profile.
+- DMA parameters must be self-consistent.
+- Predicate widths and control parameters must match the target operation.
+- Ordering edges must align with subsequent tile / vector payloads.
+
+## Key Architectural Concepts
+
+### Pipe Types
+
+| Pipe | Role |
+|------|------|
+| `PIPE_V` | Vector pipeline |
+| `PIPE_MTE1` | Memory transfer engine 1 (GM↔UB inbound) |
+| `PIPE_MTE2` | Memory transfer engine 2 (UB↔tile buffer inbound) |
+| `PIPE_MTE3` | Memory transfer engine 3 (tile buffer↔UB↔GM outbound) |
+| `PIPE_CUBE` | Cube/matrix multiply unit |
+
+### Event Synchronization
+
+Events (`event_t`) coordinate asynchronous operations across pipes. Programs set flags (`set_flag`) from one pipe and wait on them from another (`wait_flag`).
+
+## See Also
+
+- [Scalar and control instruction surface](../instruction-surfaces/scalar-and-control-instructions.md) — High-level description
+- [Scalar and control instruction families](../instruction-families/scalar-and-control-families.md) — Normative contracts
+- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page format standard
diff --git a/docs/isa/scalar/README_zh.md b/docs/isa/scalar/README_zh.md
index 49bc1828..e983486c 100644
--- a/docs/isa/scalar/README_zh.md
+++ b/docs/isa/scalar/README_zh.md
@@ -8,24 +8,42 @@
 
 ## 指令族
 
-- 控制与配置
-- PTO 微指令参考
-- 流水线同步
-- DMA 拷贝
-- 谓词加载存储
-- 谓词生成与代数
-- 共享算术
-- 共享 SCF
+| 族 | 说明 | 典型操作 |
+|----|------|----------|
+| [控制与配置](./control-and-configuration_zh.md) | NOP、barrier、yield；tsetf32mode、tsethf32mode、tsetfmatrix | `nop`、`barrier`、`yield` 等 |
+| [PTO 微指令参考](./ops/micro-instruction/README_zh.md) | 标量微指令：BlockDim、指针操作、向量作用域、对齐状态 | `pto.get_block_idx`、`pto.castptr`、`pto.vecscope` 等 |
+| [流水线同步](./pipeline-sync_zh.md) | 基于事件的跨 pipe 同步 | `set_flag`、`wait_flag`、`pipe_barrier`、`mem_bar`、`get_buf`、`rls_buf` 等 |
+| [DMA 拷贝](./dma-copy_zh.md) | GM↔UB 和 UB↔UB 数据搬运 | `copy_gm_to_ubuf`、`copy_ubuf_to_gm`、`copy_ubuf_to_ubuf`、loop size/stride setters |
+| [谓词加载存储](./predicate-load-store_zh.md) | 谓词感知的标量加载/存储 | `pld`、`plds`、`pldi`、`psts`、`pst`、`psti`、`pstu` |
+| [谓词生成与代数](./predicate-generation-and-algebra_zh.md) | 谓词构造与逻辑运算 | `pset_b8/b16/b32`、`pge_b8/b16/b32`、`plt_b8/b16/b32`、`pand`、`por`、`pxor`、`pnot`、`psel` 等 |
+| [共享算术](./shared-arith_zh.md) | 跨指令集共享的标量算术运算 | 标量算术操作 |
+| [共享 SCF](./shared-scf_zh.md) | 标量结构化控制流 | `scf.for`、`scf.if`、`scf.while` |
 
 ## 共享约束
 
-- pipe / event 空间受目标 profile 约束
-- DMA 参数必须自洽
-- 谓词宽度和控制参数必须与目标操作匹配
-- 顺序边必须与后续 tile / vector 有效载荷对齐
+- pipe / event 空间受目标 profile 约束。
+- DMA 参数必须自洽。
+- 谓词宽度和控制参数必须与目标操作匹配。
+- 顺序边必须与后续 tile / vector 有效载荷对齐。
+
+## 关键架构概念
+
+### Pipe 类型
+
+| Pipe | 角色 |
+|------|------|
+| `PIPE_V` | 向量流水线 |
+| `PIPE_MTE1` | 内存传输引擎 1（GM↔UB 入方向） |
+| `PIPE_MTE2` | 内存传输引擎 2（UB↔tile buffer 入方向） |
+| `PIPE_MTE3` | 内存传输引擎 3（tile buffer↔UB↔GM 出方向） |
+| `PIPE_CUBE` | CUBE / 矩阵乘法单元 |
+
+### 事件同步
+
+事件（`event_t`）协调跨 pipe 的异步操作。程序从一个 pipe 设置标志（`set_flag`），从另一个 pipe 等待（`wait_flag`）。
 
 ## 相关页面
 
-- [标量与控制指令集](../instruction-surfaces/scalar-and-control-instructions_zh.md)
-- [标量与控制指令族](../instruction-families/scalar-and-control-families_zh.md)
-- [指令描述格式](../reference/format-of-instruction-descriptions_zh.md)
+- [标量与控制指令集](../instruction-surfaces/scalar-and-control-instructions_zh.md) — 高层描述
+- [标量与控制指令族](../instruction-families/scalar-and-control-families_zh.md) — 规范契约
+- [指令描述格式](../reference/format-of-instruction-descriptions_zh.md) — per-op 页面格式标准
diff --git a/docs/isa/scalar/ops/control-and-configuration/tsetfmatrix_zh.md b/docs/isa/scalar/ops/control-and-configuration/tsetfmatrix_zh.md
index ea5d65e9..e021022e 100644
--- a/docs/isa/scalar/ops/control-and-configuration/tsetfmatrix_zh.md
+++ b/docs/isa/scalar/ops/control-and-configuration/tsetfmatrix_zh.md
@@ -1,36 +1,100 @@
 # pto.tsetfmatrix
 
-`pto.tsetfmatrix` 虽然保留历史 `t` 前缀，但在手册中归入[控制与配置](../../control-and-configuration_zh.md)路径，因为它配置的是标量可见寄存器状态，而不是 tile payload。
+`pto.tsetfmatrix` 属于[控制与配置指令](../../control-and-configuration_zh.md)集。
 
-## 简介
+## 概述
 
-配置后续 IMG2COL 一类路径会读取的 FMATRIX 寄存器状态。
+配置后续 IMG2COL 一类路径会读取的 FMATRIX 寄存器状态。`pto.tsetfmatrix` 从 `Img2colTileConfig` 一类配置对象中提取输入特征图几何信息与 padding 信息，并把它们写入 FMATRIX 寄存器。
 
 ## 机制
 
-`pto.tsetfmatrix` 从 `Img2colTileConfig` 一类配置对象中提取输入特征图几何信息与 padding 信息，并把它们写入 FMATRIX 寄存器。该操作本身不直接变换 tile 数据，因此其架构角色属于控制 / 配置。
+`pto.tsetfmatrix` 从配置对象中提取 feature-map 几何信息与 padding 信息，并写入 FMATRIX 寄存器。该操作本身不直接变换 tile 数据，因此其架构角色属于控制/配置。
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
 ```text
 tsetfmatrix %cfg : !pto.fmatrix_config -> ()
 ```
 
+### AS Level 1（SSA）
+
+```mlir
+pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`。
+
 ## 输入
 
-- `%cfg`：包含 feature-map 几何与 padding 信息的配置对象
-- `FmatrixMode`：选择写入 A 侧还是 B 侧 FMATRIX 寄存器
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%cfg` | 配置 | 包含 feature-map 几何与 padding 信息的配置对象 |
+| `FmatrixMode` | 配置 | 选择写入 A 侧还是 B 侧 FMATRIX 寄存器 |
 
-## 输出
+## 预期输出
 
-该指令不产生新的 SSA payload 值，只更新 FMATRIX 配置状态。
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 无 | - | 该指令不产生新的 SSA payload 值，只更新 FMATRIX 配置状态 |
+
+## 副作用
+
+更新 FMATRIX 寄存器状态，后续 IMG2COL 指令会读取该配置。
 
 ## 约束
 
-- `%cfg` 必须满足所选 target profile 的 IMG2COL 配置要求。
-- 该配置必须出现在依赖它的消费指令之前。
+- `%cfg` 必须满足所选 target profile 的 IMG2COL 配置要求
+- 该配置必须出现在依赖它的消费指令之前
+
+## 异常与非法情形
+
+- 若 `%cfg` 不满足目标 profile 的 IMG2COL 要求，行为未定义
+- 若该配置在依赖它的消费指令之后发出，行为未定义
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持 | 是 | 是 | 是 |
+
+## 示例
+
+### C++
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  // 创建 IMG2COL 配置
+  Img2colTileConfig cfg;
+  cfg.input_h = 224;
+  cfg.input_w = 224;
+  cfg.padding_h = 1;
+  cfg.padding_w = 1;
+
+  // 配置 FMATRIX（写入 A 侧）
+  pto.tsetfmatrix(cfg, FmatrixMode::A);
+}
+```
+
+### PTO-AS
+
+```text
+# 创建并配置 FMATRIX（写入 A 侧）
+%cfg = pto.create_fmatrix_config {input_h = 224, input_w = 224, pad_h = 1, pad_w = 1} : !pto.fmatrix_config
+pto.tsetfmatrix %cfg {mode = a} : !pto.fmatrix_config -> ()
+
+# 写入 B 侧
+pto.tsetfmatrix %cfg {mode = b} : !pto.fmatrix_config -> ()
+```
 
 ## 相关页面
 
-- [控制与配置](../../control-and-configuration_zh.md)
+- 指令集总览：[控制与配置指令](../../control-and-configuration_zh.md)
 - [旧 tile 路径兼容入口](../../../tile/ops/sync-and-config/tsetfmatrix_zh.md)
diff --git a/docs/isa/scalar/ops/control-and-configuration/tsethf32mode_zh.md b/docs/isa/scalar/ops/control-and-configuration/tsethf32mode_zh.md
index 6bcfc9eb..1beed137 100644
--- a/docs/isa/scalar/ops/control-and-configuration/tsethf32mode_zh.md
+++ b/docs/isa/scalar/ops/control-and-configuration/tsethf32mode_zh.md
@@ -1,36 +1,95 @@
 # pto.tsethf32mode
 
-`pto.tsethf32mode` 虽然保留历史 `t` 前缀，但在手册中归入[控制与配置](../../control-and-configuration_zh.md)路径，因为它配置的是标量可见模式状态，而不是 tile payload。
+`pto.tsethf32mode` 属于[控制与配置指令](../../control-and-configuration_zh.md)集。
 
-## 简介
+## 概述
 
-配置后续计算路径使用的 HF32 模式。
+配置后续计算路径使用的 HF32（半精度浮点 32 位）模式。该指令更新后续 HF32 相关计算路径会读取的模式状态，因此它的架构角色属于控制/配置，而不是 tile 算术。
 
 ## 机制
 
-`pto.tsethf32mode` 不会修改 tile payload。本指令更新后续 HF32 相关计算路径会读取的模式状态，因此它的架构角色属于控制 / 配置，而不是 tile 算术。
+`pto.tsethf32mode` 不会修改 tile payload。本指令更新后续 HF32 相关计算路径会读取的模式状态。
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
 ```text
 tsethf32mode {enable = true, mode = ...}
 ```
 
+### AS Level 1（SSA）
+
+```mlir
+pto.tsethf32mode {enable = true, mode = ...}
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`。
+
 ## 输入
 
-- `enable`：启用或关闭 HF32 模式
-- `mode`：选择 HF32 的 rounding mode
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `enable` | 配置 | 布尔值，启用或关闭 HF32 模式 |
+| `mode` | 配置 | HF32 的 rounding mode 选择 |
 
-## 输出
+## 预期输出
 
-该指令不产生新的 SSA payload 值，只更新模式状态。
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 无 | - | 该指令不产生新的 SSA payload 值，只更新模式状态 |
+
+## 副作用
+
+更新后续 HF32 相关计算路径读取的全局模式状态。
 
 ## 约束
 
-- 具体 mode 取值和硬件行为由目标实现定义。
-- 该配置必须出现在依赖它的计算指令之前。
+- 具体 mode 取值和硬件行为由目标实现定义
+- 该配置必须出现在依赖它的计算指令之前
+
+## 异常与非法情形
+
+- 若该配置在依赖它的计算指令之后发出，行为未定义
+- 若指定的 mode 不被目标硬件支持，结果未定义
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持 | 是 | 是 | 是 |
+
+## 示例
+
+### C++
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  // 启用 HF32 模式，使用默认 rounding mode
+  pto.tsethf32mode {enable = true, mode = Hf32RoundingMode::ToEven};
+
+  // 后续计算使用 HF32 模式
+  // ...
+}
+```
+
+### PTO-AS
+
+```text
+# 启用 HF32 模式
+tsethf32mode {enable = true, mode = to_even}
+
+# 关闭 HF32 模式
+tsethf32mode {enable = false}
+```
 
 ## 相关页面
 
-- [控制与配置](../../control-and-configuration_zh.md)
+- 指令集总览：[控制与配置指令](../../control-and-configuration_zh.md)
 - [旧 tile 路径兼容入口](../../../tile/ops/sync-and-config/tsethf32mode_zh.md)
diff --git a/docs/isa/tile/README.md b/docs/isa/tile/README.md
index 56b197b4..df5d8ce0 100644
--- a/docs/isa/tile/README.md
+++ b/docs/isa/tile/README.md
@@ -1,33 +1,36 @@
 # Tile ISA Reference
 
-The `pto.t*` tile instruction set of PTO ISA is organized by instruction set, with standalone per-op pages under `tile/ops/`.
+`pto.t*` is the tile-centric execution surface of the PTO instruction set architecture. It covers tile data loading, elementwise compute, reduction and expansion, layout rearrangement, matrix multiply, explicit synchronization, and a small set of irregular specialized operations.
 
-## Instruction Sets
+This group of documents is organized as "read the family page first, then the individual instruction page." Family pages explain shared mechanisms, roles, constraints, and profile boundaries. Leaf pages under `tile/ops/` give the per-instruction contracts.
 
-| Instruction Set | Description | Operations |
-|--------|-------------|------------|
-| [Sync and Config](./sync-and-config.md) | Resource binding, event setup, mode control | 9 |
-| [Elementwise Tile-Tile](./elementwise-tile-tile.md) | Lane-wise binary and unary operations | 28 |
-| [Tile-Scalar and Immediate](./tile-scalar-and-immediate.md) | Tile combined with scalar operand | 20 |
-| [Reduce and Expand](./reduce-and-expand.md) | Row/column reductions and expansions | 28 |
-| [Memory and Data Movement](./memory-and-data-movement.md) | GM↔tile transfer, gather/scatter | 6 |
-| [Matrix and Matrix-Vector](./matrix-and-matrix-vector.md) | GEMV, matmul, and variants | 8 |
-| [Layout and Rearrangement](./layout-and-rearrangement.md) | Reshape, transpose, extract, insert | 13 |
-| [Irregular and Complex](./irregular-and-complex.md) | Sort, quantize, histogram, print | 14 |
+## Instruction Families
 
-## Quick Reference
+| Family | Description | Typical Operations |
+|--------|-------------|-------------------|
+| [Sync and Config](./sync-and-config.md) | Resource binding, event setup, tile-side mode control | `TASSIGN`, `TSYNC` |
+| [Elementwise Tile-Tile](./elementwise-tile-tile.md) | Tile-to-tile elementwise arithmetic, comparison, and selection | `TADD`, `TMUL`, `TSEL` |
+| [Tile-Scalar and Immediate](./tile-scalar-and-immediate.md) | Tile combined with scalar or immediate operand | `TADDS`, `TMULS` |
+| [Reduce and Expand](./reduce-and-expand.md) | Row/column reductions and axis-wise expansion | `TROWSUM`, `TROWEXPAND` |
+| [Memory and Data Movement](./memory-and-data-movement.md) | GM↔tile transfer and tile-side gather/scatter | `TLOAD`, `TSTORE` |
+| [Matrix and Matrix-Vector](./matrix-and-matrix-vector.md) | Cube-path matrix multiply, GEMV, and variants | `TMATMUL`, `TGEMV` |
+| [Layout and Rearrangement](./layout-and-rearrangement.md) | Reshape, transpose, extract, insert, img2col | `TTRANS`, `TIMG2COL` |
+| [Irregular and Complex](./irregular-and-complex.md) | Sort, quantization, indexed movement, partial reduction | `TSORT32`, `TQUANT` |
 
-### Common Tile Types
+## Common Tile Roles
 
-| Type | Location | Typical Use |
-|------|----------|-------------|
-| `TileType::Vec` | UB | General elementwise operations |
-| `TileType::Mat` | L1 | Matrix multiply operations |
-| `TileType::Left` | L0A | Matrix multiply A operand |
-| `TileType::Right` | L0B | Matrix multiply B operand |
-| `TileType::Acc` | L0C | Matrix multiply accumulator |
+PTO tile roles are architectural abstractions and should not be conflated with a single physical implementation on a given backend. When reading tile instructions, first distinguish the role, then examine dtype, shape, layout, and valid region.
 
-### Memory Capacities (A5)
+| Role | Meaning | Typical Use |
+|------|---------|-------------|
+| `Vec` | Vector tile buffer abstraction | Elementwise, reduction, movement, rearrangement |
+| `Left` | Left matrix operand tile, L0A path | matmul / GEMV left input |
+| `Right` | Right matrix operand tile, L0B path | matmul / GEMV right input |
+| `Acc` | Accumulator / output tile | matmul / GEMV result |
+| `Bias` | Bias tile | `*_bias` variants |
+| `ScaleLeft` / `ScaleRight` | Left/right scale tile for MX block-scale | `*_mx` variants |
+
+## Memory Capacities (A5)
 
 | Tile Type | Memory | Capacity | Alignment |
 |-----------|--------|----------|----------|
@@ -38,12 +41,24 @@ The `pto.t*` tile instruction set of PTO ISA is organized by instruction set, wi
 | `Acc` | L0C | 256 KB | 32 B |
 | `Bias` | Bias | 4 KB | 32 B |
 
-## Navigation
+## Reading Order
+
+If you are new to PTO tile instructions, read in this order:
+
+1. Read [Tile instruction surface](../instruction-surfaces/tile-instructions.md) to understand the boundary between the tile path and scalar/vector paths.
+2. Read [Location intent and legality](../state-and-types/location-intent-and-legality.md) and [Layout](../state-and-types/layout.md) to build the role and layout constraints.
+3. Then enter the corresponding family page.
+4. Finally read the specific leaf page.
+
+## Shared Constraints
 
-The left sidebar provides standalone per-op pages for all tile instructions. Use the instruction set overviews above to understand shared constraints and mechanisms before reading individual opcode pages.
+- Tile `dtype`, `shape`, `layout`, `role`, and `valid region` all may enter legality checking.
+- Most elementwise and rearrangement operations iterate over the destination tile's valid region.
+- Matrix multiply operations are additionally constrained by `Left`/`Right`/`Acc`/`Bias`/scale tile roles.
+- Some high-performance or special-format paths are only available on specific profiles (e.g., A5-only MX block-scale).
 
 ## See Also
 
-- [Tile instructions](../instruction-surfaces/tile-instructions.md)
-- [Tile Instruction Set](../instruction-families/tile-families.md)
+- [Tile instruction surface](../instruction-surfaces/tile-instructions.md)
+- [Tile instruction families](../instruction-families/tile-families.md)
 - [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md)
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tadds_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tadds_zh.md
index 5151b237..c706aade 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/tadds_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/tadds_zh.md
@@ -1,24 +1,22 @@
-# TADDS
+# pto.tadds
 
-## 指令示意图
+`pto.tadds` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TADDS tile operation](../../../../figures/isa/TADDS.svg)
+## 概述
 
-## 简介
+对 Tile 与标量做逐元素加法，结果写入目标 tile。
 
-Tile 与标量的逐元素加法。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} + \mathrm{scalar} $$
 
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。
 
-PTO-AS 形式：参见 [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md).
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tadds %src, %scalar : !pto.tile<...>, f32
@@ -26,45 +24,72 @@ PTO-AS 形式：参见 [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md)
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
 PTO_INST RecordEvent TADDS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 逐元素加法的左操作数 |
+| `%scalar` | 标量 | 逐元素加法的右操作数 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素等于 `src + scalar` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- `dst` 和 `src` 必须使用相同的元素类型。
+- Tile 位置必须是向量。
+- 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+- 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`（A2/A3）。
+- 运行时：`src0.GetValidCol() == dst.GetValidCol()`（A5）。
+- 布局必须彼此兼容。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+
+## 异常与非法情形
+
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | Supported | Supported |
+| `bf16` | Simulated | No | Supported |
+| `i32` | Simulated | Supported | Supported |
+| `i16` | Simulated | Supported | Supported |
+| `i8 / u8` | Simulated | No | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -78,7 +103,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -93,3 +118,15 @@ void example_manual() {
   TADDS(dst, src, 1.0f);
 }
 ```
+
+### PTO-AS
+
+```text
+%dst = tadds %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md
index 7a83378b..d8b79212 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md
@@ -1,24 +1,22 @@
-﻿# TANDS
+﻿# pto.tands
 
-## 指令示意图
+`pto.tands` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TANDS tile operation](../../../../figures/isa/TANDS.svg)
+## 概述
 
-## 简介
+对 Tile 与标量做逐元素按位与，结果写入目标 tile。
 
-Tile 与标量的逐元素按位与。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \;\&\; \mathrm{scalar} $$
 
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tands %src, %scalar : !pto.tile<...>, i32
@@ -26,43 +24,67 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tands ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
 PTO_INST RecordEvent TANDS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 逐元素按位与的左操作数 |
+| `%scalar` | 标量 | 逐元素按位与的右操作数 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素等于 `src & scalar` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - 适用于整数元素类型。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
-- **实现检查 (A5)**:
-    - 适用于 `TEXPANDS` 和 `TAND` 支持的整数元素类型。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- `dst` 和 `src` 必须使用相同的元素类型。
+- `dst` 和 `src` 必须是向量 Tile。
+- 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
+- 布局必须彼此兼容。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+- 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
+
+## 异常与非法情形
+
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 整数类型 | Simulated | Supported | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -77,29 +99,14 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = tands %src, %scalar : !pto.tile<...>, i32
 ```
 
-### PTO 汇编形式
+### AS Level 2（DPS）
 
 ```text
-%dst = tands %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
 pto.tands ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tcmps_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tcmps_zh.md
index 37875342..c945cce5 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/tcmps_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/tcmps_zh.md
@@ -1,26 +1,22 @@
-# TCMPS
+# pto.tcmps
 
-## 指令示意图
+`pto.tcmps` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TCMPS tile operation](../../../../figures/isa/TCMPS.svg)
+## 概述
 
-## 简介
+将 Tile 与标量比较并写入逐元素比较结果到目标 tile。
 
-将 Tile 与标量比较并写入逐元素比较结果。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \left(\mathrm{src}_{i,j}\ \mathrm{cmpMode}\ \mathrm{scalar}\right) $$
 
-The encoding/type of `dst` is implementation-defined (often a mask-like tile).
-
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。支持 `EQ`、`NE`、`LT`、`GT`、`LE`、`GE` 比较模式。
 
-PTO-AS 形式：参见 [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md).
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tcmps %src, %scalar {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
@@ -28,45 +24,71 @@ PTO-AS 形式：参见 [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md)
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp` and `include/pto/common/type.hpp`:
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc0, typename T, typename... WaitEvents>
 PTO_INST RecordEvent TCMPS(TileDataDst& dst, TileDataSrc0& src0, T src1, CmpMode cmpMode, WaitEvents&... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 比较的左操作数 |
+| `%scalar` | 标量 | 比较的右操作数 |
+| `%dst` | 目标 tile | 比较结果写入 |
+| `cmpMode` | 比较模式 | EQ、NE、LT、GT、LE、GE |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素的比较结果 |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `float`, `half`, `uint16_t`, `int16_t`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `float`, `half`, `uint16_t`, `int16_t`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **Common constraints**:
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0` and `dst` must have the same valid row/col.
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-- **Comparison modes**:
-    - Supports `CmpMode::EQ`, `CmpMode::NE`, `CmpMode::LT`, `CmpMode::GT`, `CmpMode::LE`, `CmpMode::GE`.
+- `TileData::DType` 必须是以下之一：`int32_t`、`float`、`half`、`uint16_t`、`int16_t`。
+- Tile 布局必须是行主序。
+- Tile 位置必须是向量。
+- 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+- 运行时：`src0` 和 `dst` 必须有相同的 valid row/col。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+
+## 异常与非法情形
+
+- 不支持的比较模式会被 verifier 拒绝。
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | Supported | Supported |
+| `i32` | Simulated | Supported | Supported |
+| `i16 / u16` | Simulated | Supported | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -82,7 +104,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -99,3 +121,15 @@ void example_manual() {
   TCMPS(dst, src, 0.0f, CmpMode::GT);
 }
 ```
+
+### PTO-AS
+
+```text
+%dst = tcmps %src, %scalar {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md
index 7ac67496..03f59ba7 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md
@@ -1,105 +1,118 @@
-﻿# TDIVS
+﻿# pto.tdivs
 
-## 指令示意图
+`pto.tdivs` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TDIVS tile operation](../../../../figures/isa/TDIVS.svg)
+## 概述
 
-## 简介
+对 Tile 与标量做逐元素除法（Tile/标量 或 标量/Tile），结果写入目标 tile。
 
-与标量的逐元素除法（Tile/标量 或 标量/Tile）。
-
-## 数学语义
+## 机制
 
 对有效区域内的每个元素 `(i, j)`：
 
-- Tile/标量形式：
-
-  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{src}_{i,j}}{\mathrm{scalar}} $$
+- Tile/标量形式：$\mathrm{dst}_{i,j} = \frac{\mathrm{src}_{i,j}}{\mathrm{scalar}}$
+- 标量/Tile 形式：$\mathrm{dst}_{i,j} = \frac{\mathrm{scalar}}{\mathrm{src}_{i,j}}$
 
-- 标量/Tile 形式：
+迭代域由目标 tile 的 valid region 决定。除零行为由目标定义；在 A5 上，Tile/标量形式映射到乘以倒数，并对 `scalar == 0` 使用 `1/0 -> +inf`。
 
-  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{scalar}}{\mathrm{src}_{i,j}} $$
+## 语法
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+### PTO-AS
 
 Tile/标量形式：
-
 ```text
 %dst = tdivs %src, %scalar : !pto.tile<...>, f32
 ```
 
 标量/Tile 形式：
-
 ```text
 %dst = tdivs %scalar, %src : f32, !pto.tile<...>
 ```
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 %dst = pto.tdivs %scalar, %src : (dtype, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 pto.tdivs ins(%scalar, %src : dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
           typename... WaitEvents>
 PTO_INST RecordEvent TDIVS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar,
-                           WaitEvents &... events);
+                           WaitEvents & ... events);
 
 template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
           typename... WaitEvents>
 PTO_INST RecordEvent TDIVS(TileDataDst &dst, typename TileDataDst::DType scalar, TileDataSrc &src0,
-                           WaitEvents &... events)
+                           WaitEvents & ... events)
 ```
 
-`PrecisionType`可指定以下值：
+`PrecisionType` 可指定以下值：
+- `DivAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
+- `DivAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢，仅在 A5 上有效。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 被除数或除数 |
+| `%scalar` | 标量 | 除数或被除数 |
+| `PrecisionType` | 算法选项 | DEFAULT 或 HIGH_PRECISION |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素等于 `src / scalar` 或 `scalar / src` |
+
+## 副作用
 
-* `DivAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
-* `DivAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
+除产生目标 tile 外，没有额外架构副作用。
 
 ## 约束
 
-- **实现检查 (A2A3)**（两个重载）:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **实现检查 (A5)**（两个重载）:
-    - `TileData::DType` 必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **除零**:
-    - 行为由目标定义；在 A5 上，Tile/标量形式映射到乘以倒数，并对 `scalar == 0` 使用 `1/0 -> +inf`。dst.GetValidRow()`且`src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域.
-- **除零**:
-    - 行为由目标定义；在 A5 上，tile/标量形式映射到乘以倒数，并对 `scalar == 0` 使用 `1/0 -> +inf`。
-- **高精度算法**
-    - 仅在A5上有效，`PrecisionType`选项A3上将被忽略。
+- `dst` 和 `src` 必须使用相同的元素类型。
+- Tile 位置必须是向量。
+- 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+- 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
+- 布局必须彼此兼容。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+- A2/A3 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
+- A5 支持的元素类型：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`。
+- `HIGH_PRECISION` 算法选项仅在 A5 上有效，在 A3 上将被忽略。
+
+## 异常与非法情形
+
+- 除零行为由目标定义。
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | Supported | Supported |
+| `i32` | Simulated | Supported | Supported |
+| `i16` | Simulated | Supported | Supported |
+| `i8 / u8` | Simulated | No | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -114,7 +127,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -131,29 +144,16 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = tdivs %src, %scalar : !pto.tile<...>, f32
+%dst = tdivs %scalar, %src : f32, !pto.tile<...>
 ```
 
-### PTO 汇编形式
+### AS Level 2（DPS）
 
 ```text
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-# AS Level 2 (DPS)
 pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+pto.tdivs ins(%scalar, %src : dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md
index 917504ed..9af35581 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md
@@ -1,24 +1,22 @@
-﻿# TFMODS
+﻿# pto.tfmods
 
-## 指令示意图
+`pto.tfmods` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TFMODS tile operation](../../../../figures/isa/TFMODS.svg)
+## 概述
 
-## 简介
+对浮点 Tile 与标量取逐元素浮点余数，结果写入目标 tile。
 
-与标量的逐元素余数：`fmod(src, scalar)`。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$\mathrm{dst}_{i,j} = \mathrm{fmod}(\mathrm{src}_{i,j}, \mathrm{scalar})$$
 
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。除零行为由目标定义，CPU 模拟器在调试构建中会断言。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tfmods %src, %scalar : !pto.tile<...>, f32
@@ -26,43 +24,65 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tfmods ins(%src, %scalar : !pto.tile_buf<...>, f32) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TFMODS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents &... events);
+PTO_INST RecordEvent TFMODS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents & ... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 被除数 |
+| `%scalar` | 标量 | 除数 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素等于 `fmod(src, scalar)` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - 支持的元素类型为 `float` 和 `float32_t`。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - `dst` 和 `src` 必须是行主序。
-    - 运行时：`dst.GetValidRow() == src.GetValidRow() > 0` 且 `dst.GetValidCol() == src.GetValidCol() > 0`。
-- **实现检查 (A5)**:
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - 支持的元素类型为目标实现支持的 2 字节或 4 字节类型（包括 `half` 和 `float`）。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
-    - 运行时：`dst.GetValidRow() == src.GetValidRow()` 且 `dst.GetValidCol() == src.GetValidCol()`。
-- **除零**:
-    - 行为由目标定义；CPU 模拟器在调试构建中会断言。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- `dst` 和 `src` 必须使用相同的元素类型。
+- A2/A3 支持的元素类型：`float` 和 `float32_t`。
+- A5 支持的元素类型：目标实现支持的 2 字节或 4 字节浮点类型。
+- `dst` 和 `src` 必须是向量 Tile。
+- `dst` 和 `src` 必须是行主序。
+- 运行时：`dst.GetValidRow() == src.GetValidRow()` 且 `dst.GetValidCol() == src.GetValidCol()`。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+
+## 异常与非法情形
+
+- 除零行为由目标定义。
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | No | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
@@ -78,29 +98,14 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
+%dst = tfmods %src, %scalar : !pto.tile<...>, f32
 ```
 
-### PTO 汇编形式
+### AS Level 2（DPS）
 
 ```text
-%dst = tfmods %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
 pto.tfmods ins(%src, %scalar : !pto.tile_buf<...>, f32) outs(%dst : !pto.tile_buf<...>)
 ```
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md
index 9e142475..1f4ee9ec 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md
@@ -1,24 +1,22 @@
-﻿# TMAXS
+﻿# pto.tmaxs
 
-## 指令示意图
+`pto.tmaxs` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TMAXS tile operation](../../../../figures/isa/TMAXS.svg)
+## 概述
 
-## 简介
+对 Tile 与标量取逐元素最大值，结果写入目标 tile。
 
-Tile 与标量的逐元素最大值：`max(src, scalar)`。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \max(\mathrm{src}_{i,j}, \mathrm{scalar}) $$
 
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tmaxs %src, %scalar : !pto.tile<...>, f32
@@ -26,40 +24,68 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmaxs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
 PTO_INST RecordEvent TMAXS(TileDataDst& dst, TileDataSrc& src, typename TileDataSrc::DType scalar, WaitEvents&... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 取最大值的左操作数 |
+| `%scalar` | 标量 | 取最大值的右操作数 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素等于 `max(src, scalar)` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`int16_t`、`half`、`float`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **实现检查 (A5)**:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`uint32_t`、`float`、`int16_t`、`uint16_t`、`half`、`bfloat16_t`、`uint8_t`、`int8_t`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **通用约束**:
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`dst` 和 `src` 的有效行列数必须相同。
-    - 标量类型必须与 Tile 数据类型一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- `dst` 和 `src` 必须使用相同的元素类型。
+- 标量类型必须与 Tile 数据类型一致。
+- Tile 位置必须是向量。
+- 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+- 运行时：`dst` 和 `src` 的有效行列数必须相同。
+- Tile 布局必须是行主序。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+
+## 异常与非法情形
+
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | Supported | Supported |
+| `bf16` | Simulated | No | Supported |
+| `i32 / u32` | Simulated | No | Supported |
+| `i16 / u16` | Simulated | No | Supported |
+| `i8 / u8` | Simulated | No | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
@@ -75,29 +101,14 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = tmaxs %src, %scalar : !pto.tile<...>, f32
 ```
 
-### PTO 汇编形式
+### AS Level 2（DPS）
 
 ```text
-%dst = tmaxs %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
 pto.tmaxs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md
index 6da7c9d9..59544dc0 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md
@@ -1,24 +1,22 @@
-﻿# TMINS
+﻿# pto.tmins
 
-## 指令示意图
+`pto.tmins` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TMINS tile operation](../../../../figures/isa/TMINS.svg)
+## 概述
 
-## 简介
+对 Tile 与标量取逐元素最小值，结果写入目标 tile。
 
-Tile 与标量的逐元素最小值。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \min(\mathrm{src}_{i,j}, \mathrm{scalar}) $$
 
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tmins %src, %scalar : !pto.tile<...>, f32
@@ -26,43 +24,73 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmins ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TMINS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents &... events);
+PTO_INST RecordEvent TMINS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents & ... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 取最小值的左操作数 |
+| `%scalar` | 标量 | 取最小值的右操作数 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素等于 `min(src, scalar)` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-- **实现检查 (A5)**:
-    - `TileData::DType` 必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t`。
-    - 运行时：`src.GetValidCol() == dst.GetValidCol()`。
-- **通用约束**:
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - 标量类型必须与 Tile 数据类型一致。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- `dst` 和 `src` 必须使用相同的元素类型。
+- 标量类型必须与 Tile 数据类型一致。
+- Tile 位置必须是向量。
+- 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+- 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`（A2/A3）。
+- 运行时：`src.GetValidCol() == dst.GetValidCol()`（A5）。
+- 布局必须彼此兼容。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+
+## 异常与非法情形
+
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | Supported | Supported |
+| `bf16` | Simulated | No | Supported |
+| `i32` | Simulated | Supported | Supported |
+| `i16` | Simulated | Supported | Supported |
+| `i8 / u8` | Simulated | No | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -76,7 +104,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -92,29 +120,14 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = tmins %src, %scalar : !pto.tile<...>, f32
 ```
 
-### PTO 汇编形式
+### AS Level 2（DPS）
 
 ```text
-%dst = tmins %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
 pto.tmins ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tmuls_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tmuls_zh.md
index 6ff8bdbd..ce84c4db 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/tmuls_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/tmuls_zh.md
@@ -1,24 +1,22 @@
-# TMULS
+# pto.tmuls
 
-## 指令示意图
+`pto.tmuls` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TMULS tile operation](../../../../figures/isa/TMULS.svg)
+## 概述
 
-## 简介
+对 Tile 与标量做逐元素乘法，结果写入目标 tile。
 
-Tile 与标量的逐元素乘法。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \cdot \mathrm{scalar} $$
 
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。
 
-PTO-AS 形式：参见 [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md).
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tmuls %src, %scalar : !pto.tile<...>, f32
@@ -26,45 +24,72 @@ PTO-AS 形式：参见 [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md)
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TMULS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
+PTO_INST RecordEvent TMULS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents & ... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 逐元素乘法的左操作数 |
+| `%scalar` | 标量 | 逐元素乘法的右操作数 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素等于 `src * scalar` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- `dst` 和 `src` 必须使用相同的元素类型。
+- Tile 位置必须是向量。
+- 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+- 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`（A2/A3）。
+- 运行时：`src0.GetValidCol() == dst.GetValidCol()`（A5）。
+- 布局必须彼此兼容。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+
+## 异常与非法情形
+
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | Supported | Supported |
+| `bf16` | Simulated | No | Supported |
+| `i32` | Simulated | Supported | Supported |
+| `i16` | Simulated | Supported | Supported |
+| `i8 / u8` | Simulated | No | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -78,7 +103,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -93,3 +118,15 @@ void example_manual() {
   TMULS(dst, src, 2.0f);
 }
 ```
+
+### PTO-AS
+
+```text
+%dst = tmuls %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md
index 67bc5dd5..e474651f 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md
@@ -1,39 +1,38 @@
-﻿# TORS
+﻿# pto.tors
 
-## 指令示意图
+`pto.tors` 属于[逐元素 Tile-Tile](../../elementwise-tile-tile_zh.md)指令集。
 
-![TORS tile operation](../../../../figures/isa/TORS.svg)
+## 概述
 
-## 简介
+对源 tile 的每个元素与一个立即数（标量）做按位或，结果写入目标 tile。迭代域由目标 tile 的 valid region 决定。
 
-Tile 与标量的逐元素按位或。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对目标 tile 的 valid region 中每个 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \;|\; \mathrm{scalar} $$
 
-## 汇编语法
+标量值在发射时广播到所有参与 lane；超出源 tile valid region 的坐标读到的值属于 implementation-defined。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
-%dst = tors %src, %scalar : !pto.tile<...>, i32
+%dst = tors %src, %scalar : !pto.tile<...>
 ```
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
-pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```mlir
+pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype)
+         outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
@@ -42,30 +41,54 @@ pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_bu
 
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TORS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
+PTO_INST RecordEvent TORS(TileDataDst &dst, TileDataSrc &src,
+                          typename TileDataDst::DType scalar, WaitEvents &... events);
 ```
 
+## 输入
+
+|| 操作数 | 角色 | 说明 |
+|| --- | --- | --- |
+|| `%src` | 源 tile | 在 `dst` valid region 上逐坐标读取 |
+|| `%scalar` | 立即数 | 广播到所有 lane 的整数标量 |
+|| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+|| 结果 | 类型 | 说明 |
+|| --- | --- | --- |
+|| `%dst` | `!pto.tile<...>` | `dst` valid region 内每个元素等于 `src | scalar` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用，不会隐式为无关 tile 流量建立栅栏。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - 适用于整数元素类型。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
-- **实现检查 (A5)**:
-    - 适用于 `TEXPANDS` 和 `TOR` 支持的整数元素类型。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- **类型约束**：源 tile 和目标 tile 必须有相同元素类型，且均为整数类型。
+- **布局约束**：源 tile 和目标 tile 必须有兼容布局。
+- **有效区域**：迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+- **手动模式**：不支持将源 tile 和目标 tile 设置为同一地址（禁止 in-place）。
+
+## 异常与非法情形
+
+- Verifier 拒绝类型不匹配。
+- 后端拒绝不支持的元素类型、布局或目标 profile。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+|| 特性 | CPU Simulator | A2/A3 | A5 |
+|| --- | :---: | :---: | :---: |
+|| 整数类型 | Simulated | Supported | Supported |
+|| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
-
 using namespace pto;
 
 void example() {
@@ -77,29 +100,32 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
+### C++ 手动模式
 
-### 自动模式
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
 
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+void example_manual() {
+  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileDst dst;
+  TileSrc src;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst,  0x3000);
+  TORS(dst, src, 0xffu);
+}
 ```
 
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
 %dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = tors %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[逐元素 Tile-Tile](../../elementwise-tile-tile_zh.md)
+- 上一条指令：[pto.txors](./txors_zh.md)
+- 下一条指令：[pto.tshls](./tshl_zh.md)
+- 类似指令：[pto.tands](./tands_zh.md)、[pto.tor](./tor_zh.md)
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md
index 2c9add6d..d36c0ed2 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md
@@ -1,24 +1,22 @@
-﻿# TREMS
+﻿# pto.trems
 
-## 指令示意图
+`pto.trems` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TREMS tile operation](../../../../figures/isa/TREMS.svg)
+## 概述
 
-## 简介
+对 Tile 与标量取逐元素余数，结果写入目标 tile。
 
-与标量的逐元素余数：`remainder(src, scalar)`。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$\mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \bmod \mathrm{scalar}$$
 
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。除零行为由目标定义，CPU 模拟器在调试构建中会断言。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = trems %src, %scalar : !pto.tile<...>, f32
@@ -26,50 +24,72 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.trems ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
 PTO_INST RecordEvent TREMS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar,
-                           TileDataTmp &tmp, WaitEvents &... events);
+                           TileDataTmp &tmp, WaitEvents & ... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 被除数 |
+| `%scalar` | 标量 | 除数 |
+| `%tmp` | 临时 tile | 临时缓冲区 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素等于 `src % scalar` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - 支持的元素类型：`float` 和 `int32_t`。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - `dst` 和 `src` 必须是行主序。
-    - 运行时：`dst.GetValidRow() == src.GetValidRow() > 0` 且 `dst.GetValidCol() == src.GetValidCol() > 0`。
-    - **tmp 缓冲区要求**：
-      - `tmp.GetValidCol() >= dst.GetValidCol()`（至少与 dst 相同的列数）
-      - `tmp.GetValidRow() >= 1`（至少 1 行）
-      - 数据类型必须与 `TileDataDst::DType` 匹配。
-- **实现检查 (A5)**:
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - 支持的元素类型：`float`、`int32_t`、`uint32_t`、`half`、`int16_t` 和 `uint16_t`。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
-    - 运行时：`dst.GetValidRow() == src.GetValidRow()` 且 `dst.GetValidCol() == src.GetValidCol()`。
-    - 注意：tmp 参数在 A5 上被接受但不进行验证或使用。
-- **除零**:
-    - 行为由目标定义；CPU 模拟器在调试构建中会断言。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **对于 `int32_t` 输入（仅 A2A3）**：`src` 的元素和 `scalar` 必须在 `[-2^24, 2^24]` 范围内（即 `[-16777216, 16777216]`），以确保在计算过程中能精确转换为 float32。
+- `dst` 和 `src` 必须使用相同的元素类型。
+- A2/A3 支持的元素类型：`float` 和 `int32_t`。
+- A5 支持的元素类型：`float`、`int32_t`、`uint32_t`、`half`、`int16_t` 和 `uint16_t`。
+- `dst` 和 `src` 必须是向量 Tile。
+- `dst` 和 `src` 必须是行主序。
+- 运行时：`dst.GetValidRow() == src.GetValidRow()` 且 `dst.GetValidCol() == src.GetValidCol()`。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+- A2/A3 的 tmp 缓冲区要求：`tmp.GetValidCol() >= dst.GetValidCol()`，`tmp.GetValidRow() >= 1`。
+- A5 的 tmp 参数被接受但不进行验证或使用。
+- 对于 `int32_t` 输入（A2/A3）：`src` 的元素和 `scalar` 必须在 `[-2^24, 2^24]` 范围内。
+
+## 异常与非法情形
+
+- 除零行为由目标定义。
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | No | Supported |
+| `i32 / u32` | Simulated | Supported | Supported |
+| `i16 / u16` | Simulated | No | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
@@ -86,29 +106,14 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = trems %src, %scalar : !pto.tile<...>, f32
 ```
 
-### PTO 汇编形式
+### AS Level 2（DPS）
 
 ```text
-%dst = trems %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
 pto.trems ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md
index bd3500b6..b019a028 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md
@@ -1,16 +1,14 @@
-﻿# TSELS
+﻿# pto.tsels
 
-## 指令示意图
+`pto.tsels` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TSELS tile operation](../../../../figures/isa/TSELS.svg)
+## 概述
 
-## 简介
+使用 mask tile 在源 tile 和标量之间做逐元素选择，结果写入目标 tile。
 
-使用 mask tile 在源 Tile 和标量之间进行逐元素选择。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$
 \mathrm{dst}_{i,j} =
@@ -20,11 +18,11 @@ $$
 \end{cases}
 $$
 
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。掩码 tile 被解释为目标定义布局中的打包谓词位。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tsels %mask, %src, %scalar : !pto.tile<...>
@@ -32,47 +30,72 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tsels ins(%mask, %src, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataMask, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TSELS(TileDataDst &dst, TileDataMask &mask, TileDataSrc &src, TileDataTmp &tmp, typename TileDataSrc::DType scalar, WaitEvents &... events);
+PTO_INST RecordEvent TSELS(TileDataDst &dst, TileDataMask &mask, TileDataSrc &src, TileDataTmp &tmp, typename TileDataSrc::DType scalar, WaitEvents & ... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%mask` | 掩码 tile | 控制每个位置选择源还是标量 |
+| `%src` | 源 tile | 条件为真时的输出值 |
+| `%scalar` | 标量 | 条件为假时的输出值 |
+| `%tmp` | 临时 tile | 临时缓冲区 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素由 mask 决定选择 src 或 scalar |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `sizeof(TileDataDst::DType)` 必须是 `2` 或 `4` 字节。
-    - 支持的数据类型为 `half`、`float16_t`、`float` 和 `float32_t`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是行主序。
-    - 运行时：`src.GetValidRow()/GetValidCol()` 必须与 `dst.GetValidRow()/GetValidCol()` 一致。
-- **实现检查 (A5)**:
-    - `sizeof(TileDataDst::DType)` 可以是 `1`、`2` 或 `4` 字节。
-    - 支持的数据类型为 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half` 和 `float`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst`、`mask` 和 `src` 必须是行主序。
-    - 运行时：`src.GetValidRow()/GetValidCol()` 必须与 `dst.GetValidRow()/GetValidCol()` 一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **掩码编码**:
-    - 掩码 Tile 被解释为目标定义布局中的打包谓词位。
+- `dst` 和 `src` 必须使用相同的元素类型。
+- A2/A3：`sizeof(TileDataDst::DType)` 必须是 `2` 或 `4` 字节，支持 `half`、`float16_t`、`float` 和 `float32_t`。
+- A5：`sizeof(TileDataDst::DType)` 可以是 `1`、`2` 或 `4` 字节，支持 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half` 和 `float`。
+- `dst`、`mask` 和 `src` 必须是行主序。
+- 运行时：`src.GetValidRow()/GetValidCol()` 必须与 `dst.GetValidRow()/GetValidCol()` 一致。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+
+## 异常与非法情形
+
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | Supported | Supported |
+| `i32 / u32` | Simulated | No | Supported |
+| `i16 / u16` | Simulated | No | Supported |
+| `i8 / u8` | Simulated | No | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -93,7 +116,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -118,29 +141,14 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = tsels %mask, %src, %scalar : !pto.tile<...>
 ```
 
-### PTO 汇编形式
+### AS Level 2（DPS）
 
 ```text
-%dst = tsels %mask, %src, %scalar : !pto.tile<...>
-# AS Level 2 (DPS)
 pto.tsels ins(%mask, %src, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md
index d3bbfd44..8e4a257e 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md
@@ -1,24 +1,22 @@
-﻿# TSUBS
+﻿# pto.tsubs
 
-## 指令示意图
+`pto.tsubs` 属于[逐元素 Tile-to-Tile 指令](../../elementwise-tile-tile_zh.md)集。
 
-![TSUBS tile operation](../../../../figures/isa/TSUBS.svg)
+## 概述
 
-## 简介
+`TSUBS` 从 Tile 中逐元素减去一个标量，结果写入目标 Tile。对每个元素 `(i, j)` 在有效区域内，执行 `dst[i,j] = src[i,j] - scalar`。
 
-从 Tile 中逐元素减去一个标量。
-
-## 数学语义
+## 机制
 
 对每个元素 `(i, j)` 在有效区域内：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} - \mathrm{scalar} $$
 
-## 汇编语法
+该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tsubs %src, %scalar : !pto.tile<...>, f32
@@ -26,13 +24,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tsubs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -45,25 +43,49 @@ template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
 PTO_INST RecordEvent TSUBS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 目标 Tile |
+| `src` | 输入 | 源 Tile |
+| `scalar` | 标量 | 要减去的标量值 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 逐元素相减结果 |
+
+## 副作用
+
+该指令在执行向量流水线操作前可能隐式插入同步屏障。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
-- **实现检查 (A5)**:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
-    - Tile 位置必须是向量（`TileDataDst::Loc == TileType::Vec` 且 `TileDataSrc::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileDataDst::ValidRow <= TileDataDst::Rows`、`TileDataDst::ValidCol <= TileDataDst::Cols`、`TileDataSrc::ValidRow <= TileDataSrc::Rows`，且 `TileDataSrc::ValidCol <= TileDataSrc::Cols`。
-    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
-- **通用约束**:
-    - `dst` 和 `src0` 必须使用相同的元素类型。
-    - 标量类型必须与 `TileDataSrc::DType` 一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
+- Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
+- 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
+- `dst` 和 `src0` 必须使用相同的元素类型。
+- 标量类型必须与 `TileDataSrc::DType` 一致。
+- 静态有效边界检查（A5）：`TileDataDst::ValidRow <= TileDataDst::Rows`、`TileDataDst::ValidCol <= TileDataDst::Cols`、`TileDataSrc::ValidRow <= TileDataSrc::Rows`，且 `TileDataSrc::ValidCol <= TileDataSrc::Cols`。
+
+## 异常与非法情形
+
+- 当 `TileData::DType` 不属于支持的数据类型列表时，行为未定义。
+- 当 `src` 和 `dst` 的有效区域不匹配时，行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持的数据类型 | 是 | 是 | 是 |
+| 向量位置要求 | - | 是 | 是 |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -76,29 +98,19 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = tsubs %src, %scalar : !pto.tile<...>, f32
 ```
 
-### 手动模式
+### AS Level 2 (DPS)
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```mlir
+pto.tsubs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = tsubs %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tsubs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[逐元素 Tile-to-Tile](../../elementwise-tile-tile_zh.md)
+- [TSUBS](../tile-scalar-and-immediate/tsubs_zh.md)
diff --git a/docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md
index 5090505c..92827fc3 100644
--- a/docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md
+++ b/docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md
@@ -1,24 +1,22 @@
-﻿# TXORS
+﻿# pto.txors
 
-## 指令示意图
+`pto.txors` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TXORS tile operation](../../../../figures/isa/TXORS.svg)
+## 概述
 
-## 简介
+对 Tile 与标量做逐元素按位异或，结果写入目标 tile。
 
-Tile 与标量的逐元素按位异或。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \oplus \mathrm{scalar} $$
 
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = txors %src, %scalar : !pto.tile<...>, i32
@@ -26,40 +24,71 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.txors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
 PTO_INST RecordEvent TXORS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, TileDataTmp &tmp, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 逐元素异或的左操作数 |
+| `%scalar` | 标量 | 逐元素异或的右操作数 |
+| `%tmp` | 临时 tile | 临时缓冲区 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素等于 `src ⊕ scalar` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t` 和 `int16_t`。
-    - `dst`、`src` 和 `tmp` 必须使用相同的元素类型。
-    - 在手动模式下，源、目标和临时存储的内存区域不得重叠。
-- **实现检查 (A5)**:
-    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
-    - `dst` 和 `src` 的元素类型必须一致。
-    - `src.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- `dst`、`src` 和 `tmp` 必须使用相同的元素类型。
+- 布局必须彼此兼容。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+- A2/A3 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t` 和 `int16_t`。
+- A5 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
+- A5 要求 `src.GetValidRow()/GetValidCol()` 与 `dst` 一致。
+- 在手动模式下，源、目标和临时存储的内存区域不得重叠。
+
+## 异常与非法情形
+
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `i8 / u8` | Simulated | Supported | Supported |
+| `i16 / u16` | Simulated | Supported | Supported |
+| `i32 / u32` | Simulated | No | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -76,29 +105,14 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = txors %src, %scalar : !pto.tile<...>, i32
 ```
 
-### PTO 汇编形式
+### AS Level 2（DPS）
 
 ```text
-%dst = txors %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
 pto.txors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
diff --git a/docs/isa/tile/ops/irregular-and-complex/tci_zh.md b/docs/isa/tile/ops/irregular-and-complex/tci_zh.md
index 4d596245..9b543157 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tci_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tci_zh.md
@@ -1,32 +1,14 @@
-# TCI
+# pto.tci
 
-## 指令示意图
+`pto.tci` 属于[不规则与复杂指令](../../irregular-and-complex_zh.md)集。
 
-![TCI tile operation](../../../../figures/isa/TCI.svg)
+## 概述
 
-## 简介
+TCI 在目标 Tile 的有效区域内生成连续整数序列，序列的起始值由标量参数 `S` 指定。当 `descending = false` 时序列递增，当 `descending = true` 时序列递减。对于线性化索引 `k`，升序时 $\mathrm{dst}_{k} = S + k$，降序时 $\mathrm{dst}_{k} = S - k$。线性化顺序取决于 Tile 的布局（实现定义）。
 
-生成连续整数序列到目标 Tile 中。
+## 语法
 
-## 数学语义
-
-For a linearized index `k` over the valid elements:
-
-- Ascending:
-
-  $$ \mathrm{dst}_{k} = S + k $$
-
-- Descending:
-
-  $$ \mathrm{dst}_{k} = S - k $$
-
-The linearization order depends on the tile layout (implementation-defined).
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md).
-
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tci %S {descending = false} : !pto.tile<...>
@@ -34,13 +16,13 @@ PTO-AS 形式：参见 [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md)
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -53,18 +35,38 @@ template <typename TileData, typename T, int descending, typename... WaitEvents>
 PTO_INST RecordEvent TCI(TileData &dst, T start, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 目标 Tile，接收生成的整数序列 |
+| `S` | 标量输入 | 序列的起始值 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 包含连续整数序列的 Tile |
+
+## 副作用
+
+该指令可能会读取或写入 Tile 的有效区域标记。
+
 ## 约束
 
-- **实现检查 (A2A3/A5)**:
-    - `TileData::DType` must be exactly the same type as the scalar template parameter `T`.
-    - `dst/scalar` element types must be identical, and must be one of: `int32_t`, `uint32_t`, `int16_t`, `uint16_t`.
-    - `TileData::Cols != 1` (this is the condition enforced by the implementation).
-- **有效区域**:
-    - The implementation uses `dst.GetValidCol()` as the sequence length and does not consult `dst.GetValidRow()`.
+- `TileData::DType` 必须与标量模板参数 `T` 类型完全一致。
+- `dst`/`scalar` 元素类型必须相同，且必须是以下类型之一：`int32_t`、`uint32_t`、`int16_t`、`uint16_t`。
+- `TileData::Cols != 1`（这是实现强制执行的条件）。
+- 实现使用 `dst.GetValidCol()` 作为序列长度，不使用 `dst.GetValidRow()`。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -78,7 +80,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -92,3 +94,7 @@ void example_manual() {
   TCI<TileT, int32_t, /*descending=*/1>(dst, /*S=*/100);
 }
 ```
+
+## 相关页面
+
+- 指令集总览：[不规则与复杂指令](../../irregular-and-complex_zh.md)
diff --git a/docs/isa/tile/ops/irregular-and-complex/tgather_zh.md b/docs/isa/tile/ops/irregular-and-complex/tgather_zh.md
index 6f49a253..08b12580 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tgather_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tgather_zh.md
@@ -1,28 +1,24 @@
-﻿# TGATHER
+﻿# pto.tgather
 
-## 指令示意图
+`pto.tgather` 属于[不规则与复杂](../../irregular-and-complex_zh.md)指令集。
 
-![TGATHER tile operation](../../../../figures/isa/TGATHER.svg)
-
-## 简介
+## 概述
 
-使用索引 Tile 或编译时掩码模式来收集/选择元素。
+使用索引 Tile 或编译时掩码模式来收集/选择元素。基于索引的 gather 设 `R = dst.GetValidRow()`，`C = dst.GetValidCol()`，对于 `0 <= i < R` 且 `0 <= j < C`，满足 `dst_{i,j} = src0[indices_{i,j}]`。确切的索引解释和边界行为由实现定义。基于掩码模式的 gather 是由 `pto::MaskPattern` 控制的实现定义的选择/归约操作。
 
-## 数学语义
+## 机制
 
-基于索引的 gather（概念性定义）：
-
-设 `R = dst.GetValidRow()`，`C = dst.GetValidCol()`。对于 `0 <= i < R` 且 `0 <= j < C`：
+基于索引的 gather（概念性定义）：设 `R = dst.GetValidRow()`，`C = dst.GetValidCol()`。对于 `0 <= i < R` 且 `0 <= j < C`：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{src0}\!\left[\mathrm{indices}_{i,j}\right] $$
 
-确切的索引解释和边界行为由实现定义。
+确切的索引解释和边界行为由实现定义。基于掩码模式的 gather 是由 `pto::MaskPattern` 控制的实现定义的选择/归约操作。
 
-基于掩码模式的 gather 是由 `pto::MaskPattern` 控制的实现定义的选择/归约操作。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 基于索引的 gather：
 
@@ -38,14 +34,14 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 %dst = pto.tgather %src {maskPattern = #pto.mask_pattern<P0101>}: !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 pto.tgather ins(%src, {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
@@ -56,40 +52,52 @@ pto.tgather ins(%src, {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile_buf<.
 
 ```cpp
 template <typename TileDataD, typename TileDataS0, typename TileDataS1, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(TileDataD &dst, TileDataS0 &src0, TileDataS1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+PTO_INST RecordEvent TGATHER(TileDataD &dst, TileDataS0 &src0, TileDataS1 &src1, TileDataTmp &tmp, WaitEvents & ... events);
 
 template <typename DstTileData, typename SrcTileData, MaskPattern maskPattern, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+PTO_INST RecordEvent TGATHER(DstTileData &dst, SrcTileData &src, WaitEvents & ... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| dst | 输出 Tile | 目标 Tile |
+| src0 | 输入 Tile | 数据源 Tile |
+| src1 / indices | 输入 Tile | 索引 Tile |
+| tmp | 临时 Tile | 临时工作空间 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| dst | Tile | 按索引或掩码模式收集后的 Tile |
+
+## 副作用
+
+索引边界不通过显式运行时断言进行验证；超出范围的索引行为由目标定义。
+
 ## 约束
 
-- **基于索引的 gather：实现检查 (A2A3)**:
-    - `sizeof(DstTileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float` 之一。
-    - `sizeof(Src1TileData::DType)` 对应类型必须是 `int32_t`、`uint32_t` 之一。
-    - `DstTileData::DType` 必须与 `Src0TileData::DType` 类型相同。
-    - `src1.GetValidCol() == Src1TileData::Cols` 且 `dst.GetValidCol() == DstTileData::Cols`。
-- **基于索引的 gather：实现检查 (A5)**:
-    - `sizeof(DstTileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float` 之一。
-    - `sizeof(Src1TileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t` 之一。
-    - `DstTileData::DType` 必须与 `Src0TileData::DType` 类型相同。
-    - `src1.GetValidCol() == Src1TileData::Cols` 且 `dst.GetValidCol() == DstTileData::Cols`。
-- **基于掩码模式的 gather：实现检查 (A2A3)**:
-    - 源元素大小必须是 `2` 或 `4` 字节。
-    - `SrcTileData::DType`/`DstTileData::DType` 必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t` 或 `float` 之一。
-    - `dst` 和 `src` 必须都是 `TileType::Vec` 且行主序。
-    - `sizeof(dst element) == sizeof(src element)` 且 `dst.GetValidCol() == DstTileData::Cols`（连续的目标存储）。
-- **基于掩码模式的 gather：实现检查 (A5)**:
-    - 源元素大小必须是 `1`、`2` 或 `4` 字节。
-    - `dst` 和 `src` 必须都是 `TileType::Vec` 且行主序。
-    - `SrcTileData::DType`/`DstTileData::DType` 必须是 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`、`float8_e4m3_t`、`float8_e5m2_t` 或 `hifloat8_t` 之一。
-    - 支持的数据类型限制为目标定义的集合（通过实现中的 `static_assert` 强制执行），且 `sizeof(dst element) == sizeof(src element)`，`dst.GetValidCol() == DstTileData::Cols`（连续的目标存储）。
-- **边界 / 有效性**:
-    - 索引边界不通过显式运行时断言进行验证；超出范围的索引行为由目标定义。
+基于索引的 gather（A2/A3）：`sizeof(DstTileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float` 之一；`sizeof(Src1TileData::DType)` 对应类型必须是 `int32_t`、`uint32_t` 之一；`DstTileData::DType` 必须与 `Src0TileData::DType` 类型相同；`src1.GetValidCol() == Src1TileData::Cols` 且 `dst.GetValidCol() == DstTileData::Cols`。
+
+基于索引的 gather（A5）：`sizeof(DstTileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float` 之一；`sizeof(Src1TileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t` 之一；`DstTileData::DType` 必须与 `Src0TileData::DType` 类型相同；`src1.GetValidCol() == Src1TileData::Cols` 且 `dst.GetValidCol() == DstTileData::Cols`。
+
+基于掩码模式的 gather（A2/A3）：源元素大小必须是 `2` 或 `4` 字节；`SrcTileData::DType`/`DstTileData::DType` 必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t` 或 `float` 之一；`dst` 和 `src` 必须都是 `TileType::Vec` 且行主序；`sizeof(dst element) == sizeof(src element)` 且 `dst.GetValidCol() == DstTileData::Cols`（连续的目标存储）。
+
+基于掩码模式的 gather（A5）：源元素大小必须是 `1`、`2` 或 `4` 字节；`dst` 和 `src` 必须都是 `TileType::Vec` 且行主序；`SrcTileData::DType`/`DstTileData::DType` 必须是 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`、`float8_e4m3_t`、`float8_e5m2_t` 或 `hifloat8_t` 之一；支持的数据类型限制为目标定义的集合（通过实现中的 `static_assert` 强制执行），且 `sizeof(dst element) == sizeof(src element)`，`dst.GetValidCol() == DstTileData::Cols`（连续的目标存储）。
+
+索引边界不通过显式运行时断言进行验证；超出范围的索引行为由目标定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 基于索引的 gather 索引类型 | - | int32_t、uint32_t | int16_t、uint16_t、int32_t、uint32_t |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -107,7 +115,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -125,29 +133,23 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
 # 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
 # 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
 # pto.tassign %arg0, @tile(0x1000)
 # pto.tassign %arg1, @tile(0x2000)
 %dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- [不规则与复杂指令集](../../irregular-and-complex_zh.md)
+- [TGATHERB](./tgatherb_zh.md)
+- [TSCATTER](./tscatter_zh.md)
+
+![TGATHER tile operation](../../../../figures/isa/TGATHER.svg)
diff --git a/docs/isa/tile/ops/irregular-and-complex/tgatherb_zh.md b/docs/isa/tile/ops/irregular-and-complex/tgatherb_zh.md
index 778d0de1..dcad9c33 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tgatherb_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tgatherb_zh.md
@@ -1,24 +1,24 @@
-# TGATHERB
+# pto.tgatherb
 
-## 指令示意图
+`pto.tgatherb` 属于[不规则与复杂](../../irregular-and-complex_zh.md)指令集。
 
-![TGATHERB tile operation](../../../../figures/isa/TGATHERB.svg)
-
-## 简介
+## 概述
 
-使用字节偏移量收集元素。
+使用字节偏移量收集元素。对每个元素在有效区域内，满足 `dst_{i,j} = *(srcBase + offset_{i,j})`。确切的边界行为由实现定义。
 
-## 数学语义
+## 机制
 
-对每个元素 在有效区域内：
+对每个元素在有效区域内：
 
 $$ \mathrm{dst}_{i,j} = *\left(\mathrm{srcBase} + \mathrm{offset}_{i,j}\right) $$
 
-Exact bounds behavior is implementation-defined.
+确切的边界行为由实现定义。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md).
+### PTO-AS
+
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -28,13 +28,13 @@ PTO-AS 形式：参见 [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md)
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -44,25 +44,44 @@ pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%
 
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename TileDataOffset, typename... WaitEvents>
-PTO_INST RecordEvent TGATHERB(TileDataDst &dst, TileDataSrc &src, TileDataOffset &offset, WaitEvents &... events);
+PTO_INST RecordEvent TGATHERB(TileDataDst &dst, TileDataSrc &src, TileDataOffset &offset, WaitEvents & ... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| dst | 输出 Tile | 目标 Tile |
+| src | 输入 Tile | 数据源 Tile（基地址） |
+| offset | 输入 Tile | 字节偏移量 Tile，元素类型为 `uint32_t` |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| dst | Tile | 按字节偏移收集后的 Tile |
+
+## 副作用
+
+偏移量边界不通过显式运行时断言进行验证；超出范围的偏移行为由目标定义。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - Destination layout must be row-major (`TileDataDst::isRowMajor`).
-    - Destination element size must be `1`, `2`, or `4` bytes (enforced via `static_assert` in the helper).
-    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
-- **实现检查 (A5)**:
-    - Destination element size must be `1`, `2`, or `4` bytes.
-    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
-- **Offset interpretation**:
-    - Offsets are interpreted as `uint32_t` values (byte offsets) by the implementation.
-    - Offset bounds are not validated by explicit runtime assertions; out-of-range offsets are target-defined.
+A2/A3：目标布局必须为行主序（`TileDataDst::isRowMajor`），目标元素大小必须是 `1`、`2` 或 `4` 字节，`SrcTileData::DType`/`DstTileData::DType` 必须是 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t` 或 `float` 之一。
+
+A5：目标元素大小必须是 `1`、`2` 或 `4` 字节，`SrcTileData::DType`/`DstTileData::DType` 必须是 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t` 或 `float` 之一。
+
+偏移解释：偏移量被实现解释为 `uint32_t` 值（字节偏移），偏移量边界不通过显式运行时断言进行验证；超出范围的偏移由目标定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 目标布局要求 | - | 行主序 | - |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -80,7 +99,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -100,3 +119,23 @@ void example_manual() {
   TGATHERB(dst, src, off);
 }
 ```
+
+### PTO-AS
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# 手动模式：先显式绑定资源，再发射指令。
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## 相关页面
+
+- [不规则与复杂指令集](../../irregular-and-complex_zh.md)
+- [TGATHER](./tgather_zh.md)
+
+![TGATHERB tile operation](../../../../figures/isa/TGATHERB.svg)
diff --git a/docs/isa/tile/ops/irregular-and-complex/tmrgsort_zh.md b/docs/isa/tile/ops/irregular-and-complex/tmrgsort_zh.md
index bfb8062b..ea59659c 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tmrgsort_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tmrgsort_zh.md
@@ -1,41 +1,22 @@
-# TMRGSORT
+# pto.tmrgsort
 
-## 指令示意图
+`pto.tmrgsort` 属于[不规则与复杂指令](../../irregular-and-complex_zh.md)集。
 
-![TMRGSORT tile operation](../../../../figures/isa/TMRGSORT.svg)
+## 概述
 
-## 简介
+TMRGSORT 用于把多个已经排好序的列表按目标定义的键顺序做归并。它不是对一个无序 Tile 排序，而是归并多个有序输入。该指令支持两类接口：多列表归并（2/3/4 路输入）和单列表块归并（将源 Tile 中连续放置的 4 个已排序块归并成一个更大的有序结果）。
 
-`TMRGSORT` 用于把多个已经排好序的列表按目标定义的键顺序做归并。它不是“对一个无序 Tile 排序”，而是“归并多个有序输入”。
+TMRGSORT 的输入并不是任意数组，而是按固定结构组织的记录流。当前实现里，一个记录按 8 字节结构处理：float 类型时每条记录占 2 个元素，half 类型时占 4 个元素。CPU 模拟器按每条结构的第一个元素作为排序键，并优先选更大的键值；NPU backend 通过 `vmrgsort4` 完成硬件归并。
 
-这条指令在仓库里有两类接口：
+## 语法
 
-- 多列表归并：2 路 / 3 路 / 4 路输入
-- 单列表块归并：把一个源 Tile 中连续放置的 4 个已排序块再归并成一个更大的有序结果
+### PTO-AS
 
-## 机制
-
-`TMRGSORT` 的输入并不是“任意数组”，而是按固定结构组织的记录流。当前实现里，一个记录按 8 字节结构处理：
-
-- `float` 类型时，每条记录通常占 2 个元素
-- `half` 类型时，每条记录通常占 4 个元素
-
-CPU 模拟器会按每条结构的第一个元素作为排序键，并优先选更大的键值；NPU backend 则通过 `vmrgsort4` 完成硬件归并。也就是说，记录格式和精确比较规则仍然和目标实现强相关。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
-
-示意形式：
-
-```text
-%dst, %executed = tmrgsort %src0, %src1 {exhausted = false}
-    : !pto.tile<...>, !pto.tile<...> -> (!pto.tile<...>, vector<4xi16>)
-```
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
 %dst, %executed = pto.tmrgsort %src0, %src1, %src2, %src3 {exhausted = false}
  : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, vector<4xi16>)
@@ -43,7 +24,7 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmrgsort ins(%src, %blockLen : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 pto.tmrgsort ins(%src0, %src1, %src2, %src3 {exhausted = false} : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
 outs(%dst, %executed : !pto.tile_buf<...>, vector<4xi16>)
@@ -64,44 +45,64 @@ template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
 PTO_INST RecordEvent TMRGSORT(DstTileData &dst, SrcTileData &src, uint32_t blockLen, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 归并结果 Tile |
+| `src` | 输入 | 单列表块归并模式下的源 Tile |
+| `src0~src3` | 输入 | 多列表归并模式下的源 Tile（最多4路） |
+| `tmp` | 临时 | 多列表归并所需的临时 Tile |
+| `blockLen` | 标量输入 | 单列表块归并中每个块的长度 |
+| `executedNumList` | 输出 | 每个输入列表实际消费的记录数 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 归并后的有序结果 |
+| `executed` | vector<4xi16> | 每路输入消费的记录数（多列表归并） |
+
+## 副作用
+
+该指令可能会读写 Tile 的有效区域标记，并使用临时存储。
+
 ## 约束
 
 ### 通用约束
 
-- 所有参与 Tile 都必须是：
-  - `TileType::Vec`
-  - `Rows == 1`
-  - `BLayout::RowMajor`
-- 支持的数据类型是 `half` 或 `float`，并且 `dst/tmp/src*` 的元素类型必须一致。
+- 所有参与 Tile 都必须是 `TileType::Vec`，`Rows == 1`，`BLayout::RowMajor`。
+- 支持的数据类型是 `half` 或 `float`，且 `dst/tmp/src*` 的元素类型必须一致。
 
 ### 多列表归并
 
-- 2 路 / 3 路 / 4 路版本都要求显式传入 `tmp`。
-- `executedNumList` 会返回每个输入列表实际消费了多少条记录。
-- 模板参数 `exhausted` 决定当某一路输入先耗尽时，是否提前挂起/停止归并：
-  - CPU 会按这个布尔值决定是否在任一路耗尽时提前退出
-  - NPU 会把它映射到底层 `vmrgsort4` 的 exhausted 配置位
-- UB 使用量必须满足各 backend 的限制；源码中对 `src* + tmp (+ dst)` 总体积都有检查。
+- 2/3/4 路版本都要求显式传入 `tmp`。
+- `executedNumList` 返回每个输入列表实际消费的记录数。
+- 模板参数 `exhausted` 决定某路输入先耗尽时是否提前停止归并。
+- UB 使用量必须满足各 backend 的限制。
 
 ### 单列表块归并
 
-- 这条接口假设 `src` 中顺序摆放了 4 个已排序块。
-- `blockLen` 表示每个块的长度，并且它本身包含记录值和索引/负载。
-- A2/A3 源码明确要求：
-  - `blockLen` 必须是 `64` 的倍数
-  - `src.GetValidCol()` 必须是 `blockLen * 4` 的整数倍
-  - `repeatTimes = src.GetValidCol() / (blockLen * 4)` 必须在 `[1, 255]`
-- A5 / Kirin9030 走的是同一类硬件归并路径，但这些约束在文档层仍然可以视为安全使用域。
+- 假设 `src` 中顺序摆放了 4 个已排序块。
+- `blockLen` 表示每个块的长度，包含记录值和索引/负载。
+- `blockLen` 必须是 64 的倍数。
+- `src.GetValidCol()` 必须是 `blockLen * 4` 的整数倍。
+- `repeatTimes = src.GetValidCol() / (blockLen * 4)` 必须在 [1, 255] 范围内。
+
+## 异常与非法情形
 
-### 目标说明
+- 输入 Tile 类型不是 `TileType::Vec` 时行为未定义。
+- `blockLen` 不是 64 的倍数时行为未定义。
+- `repeatTimes` 超出 [1, 255] 范围时行为未定义。
 
-- CPU 使用显式归并逻辑。
-- A2/A3 与 A5 使用 `vmrgsort4`。
-- Kirin9030 复用 A5 的 `TMRGSORT` 路径，只在末尾 UB->UB 搬运上用了一层适配。
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
-### 单列表块归并
+### C++ 单列表块归并
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -117,7 +118,7 @@ void example_auto() {
 }
 ```
 
-### 双列表归并
+### C++ 双列表归并
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -134,5 +135,5 @@ void example_merge2() {
 
 ## 相关页面
 
-- [TSORT32](./tsort32_zh.md)
-- [不规则与复杂指令集](../../irregular-and-complex_zh.md)
+- 指令集总览：[不规则与复杂指令](../../irregular-and-complex_zh.md)
+- 相关指令：[TSORT32](./tsort32_zh.md)
diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md b/docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md
index 68dffa01..bfdd4467 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md
@@ -1,31 +1,16 @@
-﻿# TPARTADD
+﻿# pto.tpartadd
 
-## 指令示意图
+`pto.tpartadd` 属于[不规则与复杂指令](../../irregular-and-complex_zh.md)集。
 
-![TPARTADD tile operation](../../../../figures/isa/TPARTADD.svg)
-
-## 简介
+## 概述
 
 在目标有效区域内执行逐元素加法。若某个位置上 `src0` 和 `src1` 都有效，则结果为两者之和；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
 
-## 数学语义
-
-对目标有效区域内的每个元素 `(i, j)`：
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\
-\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\
-\mathrm{src1}_{i,j} & \text{若仅 src1 在 } (i,j) \text{ 处有定义}
-\end{cases}
-$$
-
-## 汇编语法
+对目标有效区域内的每个元素 `(i, j)`：若两个输入都有定义，则 $\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{src1}_{i,j}$；若仅 src0 有定义，则 $\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j}$；若仅 src1 有定义，则 $\mathrm{dst}_{i,j} = \mathrm{src1}_{i,j}$。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tpartadd %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
@@ -33,13 +18,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -52,31 +37,48 @@ template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, ty
 PTO_INST RecordEvent TPARTADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
 
-### 通用约束或检查
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 目标 Tile |
+| `src0` | 输入 | 第一个源 Tile |
+| `src1` | 输入 | 第二个源 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 逐元素加法结果 |
+
+## 副作用
+
+该指令可能会写入 Tile 的有效区域标记。
+
+## 约束
 
 - `dst`、`src0` 和 `src1` 的元素类型必须一致。
 - 目标有效区域定义结果的计算范围。
-- 对目标有效区域内的每个元素：
-  - 若两个输入都有效，则执行该指令对应的逐元素运算；
-  - 若只有一个输入有效，则结果直接取该输入的值。
+- 对目标有效区域内的每个元素：若两个输入都有效，则执行逐元素加法；若只有一个输入有效，则结果直接取该输入的值。
 - 若 `dst` 的有效区域为零，指令直接返回。
 - 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
 - 上述范围之外的有效区域组合，其行为均由具体实现定义。
+- A2A3：支持 `int32_t`、`int16_t`、`half`、`float`，且 `dst`/`src0`/`src1` 必须全部为行主序。
+- A5：支持 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t`。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
+- 输入 Tile 类型不一致时行为未定义。
+- 超出支持的有效区域模式组合时行为由实现定义。
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t`。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -90,7 +92,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -107,29 +109,14 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tpartadd %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[不规则与复杂指令](../../irregular-and-complex_zh.md)
diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md b/docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md
index c2898d24..1b861ebc 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md
@@ -1,31 +1,16 @@
-﻿# TPARTMAX
+﻿# pto.tpartmax
 
-## 指令示意图
+`pto.tpartmax` 属于[不规则与复杂指令](../../irregular-and-complex_zh.md)集。
 
-![TPARTMAX tile operation](../../../../figures/isa/TPARTMAX.svg)
-
-## 简介
+## 概述
 
 在目标有效区域内执行逐元素最大值选择。若某个位置上 `src0` 和 `src1` 都有效，则结果为 `max(src0, src1)`；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
 
-## 数学语义
-
-对目标有效区域内的每个元素 `(i, j)`：
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\\\
-\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\\\
-\mathrm{src1}_{i,j} & \text{若仅 src1 在 } (i,j) \text{ 处有定义}
-\end{cases}
-$$
-
-## 汇编语法
+对目标有效区域内的每个元素 `(i, j)`：若两个输入都有定义，则 $\mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j})$；若仅 src0 有定义，则 $\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j}$；若仅 src1 有定义，则 $\mathrm{dst}_{i,j} = \mathrm{src1}_{i,j}$。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tpartmax %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
@@ -33,13 +18,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -52,31 +37,48 @@ template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, ty
 PTO_INST RecordEvent TPARTMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
 
-### 通用约束或检查
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 目标 Tile |
+| `src0` | 输入 | 第一个源 Tile |
+| `src1` | 输入 | 第二个源 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 逐元素最大值结果 |
+
+## 副作用
+
+该指令可能会写入 Tile 的有效区域标记。
+
+## 约束
 
 - `dst`、`src0` 和 `src1` 的元素类型必须一致。
 - 目标有效区域定义结果的计算范围。
-- 对目标有效区域内的每个元素：
-  - 若两个输入都有效，则执行逐元素最大值运算；
-  - 若只有一个输入有效，则结果直接取该输入的值。
+- 对目标有效区域内的每个元素：若两个输入都有效，则执行逐元素最大值运算；若只有一个输入有效，则结果直接取该输入的值。
 - 若 `dst` 的有效区域为零，指令直接返回。
 - 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
 - 上述范围之外的有效区域组合，其行为均由具体实现定义。
+- A2A3：支持 `int32_t`、`int16_t`、`half`、`float`，且 `dst`/`src0`/`src1` 必须全部为行主序。
+- A5：支持 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
+- 输入 Tile 类型不一致时行为未定义。
+- 超出支持的有效区域模式组合时行为由实现定义。
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -90,7 +92,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -107,29 +109,14 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tpartmax %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[不规则与复杂指令](../../irregular-and-complex_zh.md)
diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md b/docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md
index ee1ddcbc..02f757c9 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md
@@ -1,31 +1,16 @@
-﻿# TPARTMIN
+﻿# pto.tpartmin
 
-## 指令示意图
+`pto.tpartmin` 属于[不规则与复杂指令](../../irregular-and-complex_zh.md)集。
 
-![TPARTMIN tile operation](../../../../figures/isa/TPARTMIN.svg)
-
-## 简介
+## 概述
 
 在目标有效区域内执行逐元素最小值选择。若某个位置上 `src0` 和 `src1` 都有效，则结果为 `min(src0, src1)`；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
 
-## 数学语义
-
-对目标有效区域内的每个元素 `(i, j)`：
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\\\
-\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\\\
-\mathrm{src1}_{i,j} & \text{若仅 src1 在 } (i,j) \text{ 处有定义}
-\end{cases}
-$$
-
-## 汇编语法
+对目标有效区域内的每个元素 `(i, j)`：若两个输入都有定义，则 $\mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j})$；若仅 src0 有定义，则 $\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j}$；若仅 src1 有定义，则 $\mathrm{dst}_{i,j} = \mathrm{src1}_{i,j}$。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tpartmin %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
@@ -33,13 +18,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -52,31 +37,48 @@ template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, ty
 PTO_INST RecordEvent TPARTMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
 
-### 通用约束或检查
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 目标 Tile |
+| `src0` | 输入 | 第一个源 Tile |
+| `src1` | 输入 | 第二个源 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 逐元素最小值结果 |
+
+## 副作用
+
+该指令可能会写入 Tile 的有效区域标记。
+
+## 约束
 
 - `dst`、`src0` 和 `src1` 的元素类型必须一致。
 - 目标有效区域定义结果的计算范围。
-- 对目标有效区域内的每个元素：
-  - 若两个输入都有效，则执行逐元素最小值运算；
-  - 若只有一个输入有效，则结果直接取该输入的值。
+- 对目标有效区域内的每个元素：若两个输入都有效，则执行逐元素最小值运算；若只有一个输入有效，则结果直接取该输入的值。
 - 若 `dst` 的有效区域为零，指令直接返回。
 - 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
 - 上述范围之外的有效区域组合，其行为均由具体实现定义。
+- A2A3：支持 `int32_t`、`int16_t`、`half`、`float`，且 `dst`/`src0`/`src1` 必须全部为行主序。
+- A5：支持 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
+- 输入 Tile 类型不一致时行为未定义。
+- 超出支持的有效区域模式组合时行为由实现定义。
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -90,7 +92,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -107,29 +109,14 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tpartmin %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[不规则与复杂指令](../../irregular-and-complex_zh.md)
diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md b/docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md
index a3032a62..6bcaac4c 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md
@@ -1,31 +1,16 @@
-﻿# TPARTMUL
+﻿# pto.tpartmul
 
-## 指令示意图
+`pto.tpartmul` 属于[不规则与复杂指令](../../irregular-and-complex_zh.md)集。
 
-![TPARTMUL tile operation](../../../../figures/isa/TPARTMUL.svg)
-
-## 简介
+## 概述
 
 在目标有效区域内执行逐元素乘法。若某个位置上 `src0` 和 `src1` 都有效，则结果为两者之积；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
 
-## 数学语义
-
-对目标有效区域内的每个元素 `(i, j)`：
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j} & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\\\
-\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\\\
-\mathrm{src1}_{i,j} & \text{若仅 src1 在 } (i,j) \text{ 处有定义}
-\end{cases}
-$$
-
-## 汇编语法
+对目标有效区域内的每个元素 `(i, j)`：若两个输入都有定义，则 $\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j}$；若仅 src0 有定义，则 $\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j}$；若仅 src1 有定义，则 $\mathrm{dst}_{i,j} = \mathrm{src1}_{i,j}$。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
@@ -33,13 +18,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -52,31 +37,48 @@ template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, ty
 PTO_INST RecordEvent TPARTMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
 
-### 通用约束或检查
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 目标 Tile |
+| `src0` | 输入 | 第一个源 Tile |
+| `src1` | 输入 | 第二个源 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 逐元素乘法结果 |
+
+## 副作用
+
+该指令可能会写入 Tile 的有效区域标记。
+
+## 约束
 
 - `dst`、`src0` 和 `src1` 的元素类型必须一致。
 - 目标有效区域定义结果的计算范围。
-- 对目标有效区域内的每个元素：
-  - 若两个输入都有效，则执行该指令对应的逐元素运算；
-  - 若只有一个输入有效，则结果直接取该输入的值。
+- 对目标有效区域内的每个元素：若两个输入都有效，则执行逐元素乘法；若只有一个输入有效，则结果直接取该输入的值。
 - 若 `dst` 的有效区域为零，指令直接返回。
 - 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
 - 上述范围之外的有效区域组合，其行为均由具体实现定义。
+- A2A3：支持 `int32_t`、`int16_t`、`half`、`float`，且 `dst`/`src0`/`src1` 必须全部为行主序。
+- A5：支持 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t`。
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
+- 输入 Tile 类型不一致时行为未定义。
+- 超出支持的有效区域模式组合时行为由实现定义。
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t`。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -89,7 +91,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -105,29 +107,14 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[不规则与复杂指令](../../irregular-and-complex_zh.md)
diff --git a/docs/isa/tile/ops/irregular-and-complex/tprint_zh.md b/docs/isa/tile/ops/irregular-and-complex/tprint_zh.md
index 5c5afd63..94579d16 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tprint_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tprint_zh.md
@@ -1,30 +1,26 @@
-﻿# TPRINT
+﻿# pto.tprint
 
-## 指令示意图
+`pto.tprint` 属于[不规则与复杂](../../irregular-and-complex_zh.md)指令集。
 
-![TPRINT tile operation](../../../../figures/isa/TPRINT.svg)
-
-## 简介
-
-调试/打印 Tile 中的元素（实现定义）。
+## 概述
 
-从设备代码直接打印 Tile 或 GlobalTensor 的内容以用于调试目的。
+调试/打印 Tile 中的元素（实现定义）。从设备代码直接打印 Tile 或 GlobalTensor 的内容以用于调试目的。`TPRINT` 指令输出存储在 Tile 或 GlobalTensor 中的数据的逻辑视图。它支持常见的数据类型（例如 `float`、`half`、`int8`、`uint32`）和多种内存布局（GlobalTensor 的 `ND`、`DN`、`NZ`；片上缓冲区的向量 tiles）。
 
-`TPRINT` 指令输出存储在 Tile 或 GlobalTensor 中的数据的逻辑视图。它支持常见的数据类型（例如 `float`、`half`、`int8`、`uint32`）和多种内存布局（GlobalTensor 的 `ND`、`DN`、`NZ`；片上缓冲区的向量 tiles）。
-
-> **重要**:
+> **重要**：
 > - 此指令**仅用于开发和调试**。
 > - 它会产生**显著的运行时开销**，**不得在生产 kernel 中使用**。
-> - 如果输出超过内部打印缓冲区，可能会被**截断**。可以通过在编译选项中添加`-DCCEBlockMaxSize=16384`来修改打印缓冲区，默认为16KB。
-> - **需要 CCE 编译选项 `-D_DEBUG --cce-enable-print`**（参见 [行为](#behavior)）。
+> - 如果输出超过内部打印缓冲区，可能会被**截断**。可以通过在编译选项中添加 `-DCCEBlockMaxSize=16384` 来修改打印缓冲区，默认为 16KB。
+> - **需要 CCE 编译选项 `-D_DEBUG --cce-enable-print`**。
 
-## 数学语义
+## 机制
 
 除非另有说明，语义在有效区域上定义，目标相关的行为标记为实现定义。
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ```text
 tprint %src : !pto.tile<...> | !pto.global<...>
@@ -32,47 +28,72 @@ tprint %src : !pto.tile<...> | !pto.global<...>
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
 ```
 
 ## C++ 内建接口
 
 声明于 `include/pto/common/pto_instr.hpp`：
+
 ```cpp
-// 适用于打印GlobalTensor或Vec类型Tile
+// 适用于打印 GlobalTensor 或 Vec 类型 Tile
 template <typename TileData>
 PTO_INST void TPRINT(TileData &src);
 
-// 适用于打印Acc类型Tile和Mat类型Tile(Mat打印仅适用于A3，A5暂不支持)
+// 适用于打印 Acc 类型 Tile 和 Mat 类型 Tile（Mat 打印仅适用于 A3，A5 暂不支持）
 template <typename TileData, typename GlobalData>
 PTO_INTERNAL void TPRINT(TileData &src, GlobalData &tmp);
 ```
 
-### 支持的 T 类型
-- **Tile**：TileType必须是`Vec`、`Acc`、`Mat(仅A3支持)`，并具有支持的元素类型。
-- **GlobalTensor**：必须使用布局 `ND`、`DN` 或 `NZ`，并具有支持的元素类型。
+支持的 T 类型：Tile 的 TileType 必须是 `Vec`、`Acc`、`Mat`（仅 A3 支持），并具有支持的元素类型；GlobalTensor 必须使用布局 `ND`、`DN` 或 `NZ`，并具有支持的元素类型。
+
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| src | 输入 Tile/GlobalTensor | 要打印的 Tile 或 GlobalTensor |
+| tmp | 临时空间（仅 Mat/Acc） | 打印 Mat 或 Acc 类型 Tile 时需要传入 GM 上的临时空间 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| - | 控制台输出 | Tile 或 GlobalTensor 数据的逻辑视图 |
+
+## 副作用
+
+输出存储在 Tile 或 GlobalTensor 中的数据到控制台，产生显著的运行时开销。
 
 ## 约束
 
-- **支持的元素类型**:
-    - 浮点数：`float`、`half`
-    - 有符号整数：`int8_t`、`int16_t`、`int32_t`
-    - 无符号整数：`uint8_t`、`uint16_t`、`uint32_t`
-- **对于 GlobalTensor**：布局必须是 `Layout::ND`、`Layout::DN` 或 `Layout::NZ` 之一。
-- **对于 临时空间**：打印`TileType`为`Mat`或`Acc`的Tile时需要传入gm上的临时空间，临时空间不得小于`TileData::Numel * sizeof(T)`。
-- A5暂不支持`TileType`为`Mat`的Tile打印。
-- **回显信息**: `TileType`为`Mat`时，布局将按照`Layout::ND`进行打印，其他布局可能会导致信息错位。
+支持的元素类型：浮点数为 `float`、`half`；有符号整数为 `int8_t`、`int16_t`、`int32_t`；无符号整数为 `uint8_t`、`uint16_t`、`uint32_t`。
+
+对于 GlobalTensor：布局必须是 `Layout::ND`、`Layout::DN` 或 `Layout::NZ` 之一。
+
+对于临时空间：打印 `TileType` 为 `Mat` 或 `Acc` 的 Tile 时需要传入 GM 上的临时空间，临时空间不得小于 `TileData::Numel * sizeof(T)`。
+
+A5 暂不支持 `TileType` 为 `Mat` 的 Tile 打印。
+
+回显信息：`TileType` 为 `Mat` 时，布局将按照 `Layout::ND` 进行打印，其他布局可能会导致信息错位。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| Vec Tile 打印 | 支持 | 支持 | 支持 |
+| Acc Tile 打印 | 支持 | 支持 | 支持 |
+| Mat Tile 打印 | 支持 | 支持 | 不支持 |
 
 ## 示例
 
-### Print a Tile
+### C++ 打印 Tile
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -92,7 +113,7 @@ PTO_INTERNAL void DebugTile(__gm__ float *src) {
 }
 ```
 
-### Print a GlobalTensor
+### C++ 打印 GlobalTensor
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -107,29 +128,21 @@ PTO_INTERNAL void DebugGlobalTensor(__gm__ float *src) {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
 # 自动模式：由编译器/运行时负责资源放置与调度。
 pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### 手动模式
-
-```text
 # 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
 # pto.tassign %arg0, @tile(0x1000)
 # pto.tassign %arg1, @tile(0x2000)
 pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### PTO 汇编形式
-
-```text
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
 # AS Level 2 (DPS)
 pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
 ```
+
+## 相关页面
+
+- [不规则与复杂指令集](../../irregular-and-complex_zh.md)
+
+![TPRINT tile operation](../../../../figures/isa/TPRINT.svg)
diff --git a/docs/isa/tile/ops/irregular-and-complex/tquant_zh.md b/docs/isa/tile/ops/irregular-and-complex/tquant_zh.md
index 02c4d55f..4614afdf 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tquant_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tquant_zh.md
@@ -1,19 +1,12 @@
-# TQUANT
+# pto.tquant
 
-## 指令示意图
+`pto.tquant` 属于[不规则与复杂指令](../../irregular-and-complex_zh.md)集。
 
-![TQUANT tile operation](../../../../figures/isa/TQUANT.svg)
+## 概述
 
-## 简介
+`TQUANT` 把高精度 Tile 量化成较低精度表示，并在需要时同时产出量化元数据。它不是一条单一模式的指令，而是一组按模板参数分化出来的量化接口。当前仓库里最重要的两类路径是 `INT8_SYM / INT8_ASYM` 和 `MXFP8`。
 
-`TQUANT` 把高精度 Tile 量化成较低精度表示，并在需要时同时产出量化元数据。它不是一条单一模式的指令，而是一组按模板参数分化出来的量化接口。
-
-当前仓库里最重要的两类路径是：
-
-- `INT8_SYM / INT8_ASYM`
-- `MXFP8`
-
-## 模式
+## 机制
 
 ### INT8_SYM
 
@@ -35,25 +28,27 @@ $$ q = \mathrm{round}(x \cdot scale + offset) $$
 
 按组计算共享指数与缩放信息，再生成低精度输出，同时产出辅助元数据：
 
-- `exp`
-- `max`
-- `scaling`
+- `exp`：共享指数
+- `max`：每组绝对值最大值
+- `scaling`：每元素缩放值
 
 在 CPU 模拟器里，`MXFP8` 还额外支持一条 NZ 辅助重排接口，用于生成 `exp_zz`。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+### PTO-AS
+
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -78,52 +73,90 @@ PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataPara &sc
                             TileDataPara *offset = nullptr, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 | 源 Tile（通常为 float32） |
+| `scale` | 输入/参数 | 缩放因子 |
+| `offset` | 输入/参数（可选） | 非对称量化的偏移量 |
+| `exp` | 输出（MXFP8） | 共享指数 |
+| `max` | 输出（MXFP8） | 每组绝对值最大值 |
+| `scaling` | 输出（MXFP8） | 每元素缩放值 |
+| `exp_zz` | 输出（可选） | ZZ 形式的指数 |
+| `vgather_idx` | 输出（可选） | VGather 索引 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 量化后的低精度结果 |
+
+## 副作用
+
+可能修改输出 Tile 的有效区域。
+
 ## 约束
 
 ### A2/A3 实现
 
 - A2/A3 当前只实现：
-  - `INT8_SYM`
-  - `INT8_ASYM`
-- 输入类型必须是 `float32_t`。
-- `INT8_SYM` 输出必须是 `int8_t`。
-- `INT8_ASYM` 输出必须是 `uint8_t`。
-- A2/A3 的实现会先对输入做按行扩展乘法/加法，再通过中间 `TRESHAPE/TCVT` 路径完成量化。
+    - `INT8_SYM`
+    - `INT8_ASYM`
+- 输入类型必须是 `float32_t`
+- `INT8_SYM` 输出必须是 `int8_t`
+- `INT8_ASYM` 输出必须是 `uint8_t`
+- A2/A3 的实现会先对输入做按行扩展乘法/加法，再通过中间 `TRESHAPE/TCVT` 路径完成量化
 
 ### A5 实现
 
 - A5 实现：
-  - `INT8_SYM`
-  - `INT8_ASYM`
-  - `MXFP8`
-- `INT8_*` 路径输入同样要求 `float32_t`。
+    - `INT8_SYM`
+    - `INT8_ASYM`
+    - `MXFP8`
+- `INT8_*` 路径输入同样要求 `float32_t`
 - `MXFP8` 路径会输出：
-  - 量化后的低精度结果 `dst`
-  - 共享指数 `exp`
-  - 每组绝对值最大值 `max`
-  - 每元素缩放值 `scaling`
+    - 量化后的低精度结果 `dst`
+    - 共享指数 `exp`
+    - 每组绝对值最大值 `max`
+    - 每元素缩放值 `scaling`
 - A5 源码里还明确写了：
-  - `E8M0` 指数默认按 ND 形式输出
-  - 如果需要 ZZ 形式指数，应再借助 `TMOV` 等路径做后续转换
+    - `E8M0` 指数默认按 ND 形式输出
+    - 如果需要 ZZ 形式指数，应再借助 `TMOV` 等路径做后续转换
 
 ### CPU 模拟器
 
 - CPU 模拟器支持的模拟面比 NPU backend 更宽：
-  - `INT8_SYM`
-  - `INT8_ASYM`
-  - `MXFP8`
-  - 以及 `MXFP8 + NZ` 辅助重排接口
-- 但 CPU 上能跑通，并不等于所有 NPU backend 都支持同一模式。
+    - `INT8_SYM`
+    - `INT8_ASYM`
+    - `MXFP8`
+    - 以及 `MXFP8 + NZ` 辅助重排接口
+- 但 CPU 上能跑通，并不等于所有 NPU backend 都支持同一模式
 
 ### 使用建议
 
-- 如果目标是 A2/A3，可把 `TQUANT` 理解为“INT8 量化指令族”。
-- 如果目标是 A5，才应把 `MXFP8` 当成主路径之一。
-- 如果你依赖 `exp_zz` 或 `vgather_idx` 这类接口，先确认目标 backend 是否真的实现了它，而不要只看通用 C++ 声明。
+- 如果目标是 A2/A3，可把 `TQUANT` 理解为"INT8 量化指令族"
+- 如果目标是 A5，才应把 `MXFP8` 当成主路径之一
+- 如果你依赖 `exp_zz` 或 `vgather_idx` 这类接口，先确认目标 backend 是否真的实现了它
+
+## 异常与非法情形
+
+- 若输入类型不是 `float32_t` 且目标为 `INT8_*`，行为未定义
+- 若输出类型与量化模式不匹配，编译失败
+- 若 `MXFP8` 在不支持的 profile 上使用，行为未定义
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| INT8_SYM | 是 | 是 | 是 |
+| INT8_ASYM | 是 | 是 | 是 |
+| MXFP8 | 是 | 否 | 是 |
+| MXFP8 + NZ | 是 | 否 | 否 |
 
 ## 示例
 
-### 对称 INT8
+### C++ 对称 INT8
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -142,7 +175,7 @@ void example_int8_sym() {
 }
 ```
 
-### MXFP8
+### C++ MXFP8
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -163,8 +196,21 @@ void example_mxfp8() {
 }
 ```
 
+### PTO-AS
+
+```mlir
+# INT8_SYM 量化
+%dst = pto.tquant %src, %scale : (!pto.tile<f32, 16, 16>, !pto.tile<f32, 16, 1>) -> !pto.tile<i8, 16, 16>
+
+# INT8_ASYM 量化
+%dst = pto.tquant %src, %scale, %offset : (!pto.tile<f32, 16, 16>, !pto.tile<f32, 16, 1>, !pto.tile<f32, 16, 1>) -> !pto.tile<u8, 16, 16>
+
+# MXFP8 量化
+%dst, %exp, %max, %scaling = pto.tquant %src : (!pto.tile<f32, 16, 32>) -> (!pto.tile<i8, 16, 32>, !pto.tile<u8, 1, 16>, !pto.tile<f32, 1, 16>, !pto.tile<f32, 16, 32>)
+```
+
 ## 相关页面
 
+- 指令集总览：[不规则与复杂指令](../../irregular-and-complex_zh.md)
 - [TMOV](../layout-and-rearrangement/tmov_zh.md)
 - [TMOV_FP](../layout-and-rearrangement/tmov-fp_zh.md)
-- [不规则与复杂指令集](../../irregular-and-complex_zh.md)
diff --git a/docs/isa/tile/ops/irregular-and-complex/tscatter_zh.md b/docs/isa/tile/ops/irregular-and-complex/tscatter_zh.md
index e22afb5f..f564c9c5 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tscatter_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tscatter_zh.md
@@ -1,37 +1,26 @@
-# TSCATTER
+# pto.tscatter
 
-## 指令示意图
+`pto.tscatter` 属于[不规则与复杂](../../irregular-and-complex_zh.md)指令集。
 
-![TSCATTER tile operation](../../../../figures/isa/TSCATTER.svg)
-
-## 简介
-
-`TSCATTER` 按索引 Tile 给出的目标偏移，把源 Tile 中的元素分散写入目标 Tile。它适合表达“不规则写回到本地 Tile”这类模式：数据仍然留在 tile 空间里，但目的位置不再由规则的行列映射决定。
-
-和规则搬运不同，`TSCATTER` 的关键输入不是另一个 shape，而是 `indexes`。每个索引元素都表示目标 Tile 在线性存储视角下的一个元素偏移。
+## 概述
 
-## 数学语义
+`TSCATTER` 按索引 Tile 给出的目标偏移，把源 Tile 中的元素分散写入目标 Tile。它适合表达“不规则写回到本地 Tile”这类模式：数据仍然留在 tile 空间里，但目的位置不再由规则的行列映射决定。和规则搬运不同，`TSCATTER` 的关键输入不是另一个 shape，而是 `indexes`。每个索引元素都表示目标 Tile 在线性存储视角下的一个元素偏移。
 
-设：
+## 机制
 
-- `R = indexes.GetValidRow()`
-- `C = indexes.GetValidCol()`
-
-当前实现会先把整个 `dst` Tile 清零，然后对 `0 <= i < R`、`0 <= j < C` 执行：
+设 `R = indexes.GetValidRow()`，`C = indexes.GetValidCol()`。当前实现会先把整个 `dst` Tile 清零，然后对 `0 <= i < R`、`0 <= j < C` 执行：
 
 $$ \mathrm{dst\_flat}_{\mathrm{indexes}_{i,j}} = \mathrm{src}_{i,j} $$
 
 其中 `dst_flat` 表示把目标 Tile 按其存储顺序看成一段线性序列。对标准 row-major Tile 来说，这就是“写到扁平化后的第 `k` 个元素”。
 
-这条语义有三个读者容易忽略的点：
+有三个读者容易忽略的语义点：`TSCATTER` 按 `indexes` 的 valid region 遍历，而不是按 `dst` 的 valid region 遍历；没有被任何索引命中的目标位置，在当前实现里保持为零值，而不是保留调用前的 `dst` 内容；如果多个源元素落到同一个目标位置，最终值属于实现定义行为，当前实现按行优先顺序遍历，因此后写入的元素会覆盖前值。
 
-- `TSCATTER` 按 `indexes` 的 valid region 遍历，而不是按 `dst` 的 valid region 遍历。
-- 没有被任何索引命中的目标位置，在当前实现里保持为零值，而不是保留调用前的 `dst` 内容。
-- 如果多个源元素落到同一个目标位置，最终值属于实现定义行为；当前实现按行优先顺序遍历，因此后写入的元素会覆盖前值。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -41,13 +30,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tscatter %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -57,45 +46,45 @@ pto.tscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst
 
 ```cpp
 template <typename TileDataD, typename TileDataS, typename TileDataI, typename... WaitEvents>
-PTO_INST RecordEvent TSCATTER(TileDataD &dst, TileDataS &src, TileDataI &indexes, WaitEvents &... events);
+PTO_INST RecordEvent TSCATTER(TileDataD &dst, TileDataS &src, TileDataI &indexes, WaitEvents & ... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| dst | 输出 Tile | 目标 Tile |
+| src | 输入 Tile | 源 Tile |
+| indexes | 输入 Tile | 索引 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| dst | Tile | 按索引分散写入后的 Tile，先被清零 |
+
+## 副作用
+
+当前 backend 不会对索引做越界检查。超出目标 Tile 线性范围的索引不属于合法使用域。
+
 ## 约束
 
-### 通用约束
-
-- `dst`、`src`、`indexes` 都必须是 `TileType::Vec`。
-- `dst` 与 `src` 的元素类型必须完全一致。
-- `indexes` 必须是整型 Tile，且索引元素宽度必须和数据元素宽度匹配：
-  - 数据为 4 字节时，索引也必须是 4 字节；
-  - 数据为 2 字节时，索引也必须是 2 字节；
-  - 数据为 1 字节时，索引必须是 2 字节。
-- 实现按 `indexes.GetValidRow()` / `indexes.GetValidCol()` 遍历。可移植代码应保证 `src` 在同一坐标域内可读。
-- 当前 backend 不会对索引做越界检查。超出目标 Tile 线性范围的索引不属于合法使用域。
-- 编译期 valid bounds 必须满足：
-  - `TileDataD::ValidRow <= TileDataD::Rows`
-  - `TileDataD::ValidCol <= TileDataD::Cols`
-  - `TileDataS::ValidRow <= TileDataS::Rows`
-  - `TileDataS::ValidCol <= TileDataS::Cols`
-  - `TileDataI::ValidRow <= TileDataI::Rows`
-  - `TileDataI::ValidCol <= TileDataI::Cols`
-
-### A2/A3 实现检查
-
-- `dst/src` 的元素类型必须属于：
-  `int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t`、`uint8_t`、`half`、`float32_t`、`bfloat16_t`。
-- `indexes` 的元素类型必须属于：
-  `int16_t`、`int32_t`、`uint16_t`、`uint32_t`。
-
-### A5 实现检查
-
-- A5 上的类型限制与 A2/A3 相同：
-  - `dst/src` 支持 `int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t`、`uint8_t`、`half`、`float32_t`、`bfloat16_t`
-  - `indexes` 支持 `int16_t`、`int32_t`、`uint16_t`、`uint32_t`
+通用约束：`dst`、`src`、`indexes` 都必须是 `TileType::Vec`；`dst` 与 `src` 的元素类型必须完全一致；`indexes` 必须是整型 Tile，且索引元素宽度必须和数据元素宽度匹配（数据为 4 字节时索引也必须是 4 字节，数据为 2 字节时索引也必须是 2 字节，数据为 1 字节时索引必须是 2 字节）；实现按 `indexes.GetValidRow()` / `indexes.GetValidCol()` 遍历，可移植代码应保证 `src` 在同一坐标域内可读；当前 backend 不会对索引做越界检查，超出目标 Tile 线性范围的索引不属于合法使用域；编译期 valid bounds 必须满足 `TileDataD::ValidRow <= TileDataD::Rows`、`TileDataD::ValidCol <= TileDataD::Cols`、`TileDataS::ValidRow <= TileDataS::Rows`、`TileDataS::ValidCol <= TileDataS::Cols`、`TileDataI::ValidRow <= TileDataI::Rows`、`TileDataI::ValidCol <= TileDataI::Cols`。
+
+A2/A3 实现检查：`dst/src` 的元素类型必须属于 `int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t`、`uint8_t`、`half`、`float32_t`、`bfloat16_t`；`indexes` 的元素类型必须属于 `int16_t`、`int32_t`、`uint16_t`、`uint32_t`。
+
+A5 实现检查：A5 上的类型限制与 A2/A3 相同，`dst/src` 支持 `int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t`、`uint8_t`、`half`、`float32_t`、`bfloat16_t`，`indexes` 支持 `int16_t`、`int32_t`、`uint16_t`、`uint32_t`。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| dst/src 数据类型 | - | int32_t、int16_t、int8_t、uint32_t、uint16_t、uint8_t、half、float32_t、bfloat16_t | 同 A2/A3 |
+| indexes 数据类型 | - | int16_t、int32_t、uint16_t、uint32_t | 同 A2/A3 |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -111,7 +100,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -130,8 +119,23 @@ void example_manual() {
 }
 ```
 
+### PTO-AS
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tscatter %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# 手动模式：先显式绑定资源，再发射指令。
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tscatter %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
 ## 相关页面
 
 - [不规则与复杂指令集](../../irregular-and-complex_zh.md)
 - [布局参考](../../../state-and-types/layout_zh.md)
 - [数据格式](../../../state-and-types/data-format_zh.md)
+
+![TSCATTER tile operation](../../../../figures/isa/TSCATTER.svg)
diff --git a/docs/isa/tile/ops/irregular-and-complex/tsort32_zh.md b/docs/isa/tile/ops/irregular-and-complex/tsort32_zh.md
index 823b9fa6..0ef091ee 100644
--- a/docs/isa/tile/ops/irregular-and-complex/tsort32_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/tsort32_zh.md
@@ -1,42 +1,16 @@
-﻿# TSORT32
+﻿# pto.tsort32
 
-## 指令示意图
+`pto.tsort32` 属于[不规则与复杂指令](../../irregular-and-complex_zh.md)集。
 
-![TSORT32 tile operation](../../../../figures/isa/TSORT32.svg)
+## 概述
 
-## 简介
+对 `src` 的每个 32 元素块，与 `idx` 中对应的索引一起进行排序，并将排序后的值-索引对写入 `dst`。`idx` 是输入 Tile 而非输出 Tile，提供与 `src` 一起参与重排的索引。`dst` 保存的是排序后的值-索引对，而不只是排序后的值。在 CPU 仿真实现中，按值降序排序；当值相同时，索引较小者优先。
 
-对 `src` 的每个 32 元素块，与 `idx` 中对应的索引一起进行排序，并将排序后的值-索引对写入 `dst`。
+对每一行，TSORT32 按独立的 32 元素块处理 `src`。设第 `b` 个块覆盖列 `32b ... 32b+31`，该块的有效元素数为 `n_b = min(32, C - 32b)`。对于块中的每个有效元素，先构造二元组 `(v_k, i_k) = (\mathrm{src}_{r,32b+k}, \mathrm{idx}_{r,32b+k})`，然后按值对这些二元组排序。
 
-## 数学语义
+## 语法
 
-对每一行，`TSORT32` 会按独立的 32 元素块处理 `src`。设第 `b` 个块覆盖列 `32b ... 32b+31`，该块的有效元素数为 `n_b = min(32, C - 32b)`。
-
-对于块中的每个有效元素，先构造一个二元组：
-
-$$
-(v_k, i_k) = (\mathrm{src}_{r,32b+k}, \mathrm{idx}_{r,32b+k}), \quad 0 \le k < n_b
-$$
-
-然后按值对这些二元组排序，并将排序后的值-索引对写入 `dst`。`dst` 中的具体打包布局由目标实现定义，但从语义上看，每个块的输出可表示为：
-
-$$
-[(v_{\pi(0)}, i_{\pi(0)}), (v_{\pi(1)}, i_{\pi(1)}), \ldots, (v_{\pi(n_b-1)}, i_{\pi(n_b-1)})]
-$$
-
-其中 `π` 是该 32 元素块对应的排序置换。
-
-说明：
-
-- `idx` 是输入 Tile，不是输出 Tile。
-- `dst` 保存的是排序后的值-索引对，而不只是排序后的值。
-- 在 CPU 仿真实现中，按值降序排序；当值相同时，索引较小者优先。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
-
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
@@ -44,13 +18,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -66,23 +40,49 @@ template <typename DstTileData, typename SrcTileData, typename IdxTileData, type
 PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx, TmpTileData &tmp);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 目标 Tile，接收排序后的值-索引对 |
+| `src` | 输入 | 源 Tile，包含待排序的值 |
+| `idx` | 输入 | 索引 Tile，提供与 src 一起参与重排的索引 |
+| `tmp` | 临时 | 可选，用于支持非 32 对齐尾块 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 排序后的值-索引对 |
+
+## 副作用
+
+该指令可能会写入 Tile 的有效区域标记。
+
 ## 约束
 
-- `TSORT32` 不接受 `WaitEvents&...` 参数，也不在内部调用 `TSYNC(...)`；如有需要请显式同步。
-- `idx` 在两个重载中都是必需的输入操作数；它提供与 `src` 一起参与重排的索引。
-- **实现检查 (A2A3/A5)**:
-    - `DstTileData::DType` 必须是 `half` 或 `float`。
-    - `SrcTileData::DType` 必须与 `DstTileData::DType` 匹配。
-    - `IdxTileData::DType` 必须是 `uint32_t`。
-    - `dst`/`src`/`idx` Tile 位置必须是 `TileType::Vec`，且都必须是行主序（`isRowMajor`）。
-- **有效区域**:
-    - 实现使用 `dst.GetValidRow()` 作为行数。
-    - 实现使用 `src.GetValidCol()` 确定每行参与排序的元素数量。
-    - 排序按独立的 32 元素块进行；4 参数重载额外通过 `tmp` 支持非 32 对齐尾块。
+- TSORT32 不接受 `WaitEvents&...` 参数，也不在内部调用 `TSYNC(...)`；如有需要请显式同步。
+- `idx` 在两个重载中都是必需的输入操作数，提供与 `src` 一起参与重排的索引。
+- `DstTileData::DType` 必须是 `half` 或 `float`。
+- `SrcTileData::DType` 必须与 `DstTileData::DType` 匹配。
+- `IdxTileData::DType` 必须是 `uint32_t`。
+- `dst`/`src`/`idx` Tile 类型必须是 `TileType::Vec`，且都必须是行主序（`isRowMajor`）。
+- 实现使用 `dst.GetValidRow()` 作为行数，使用 `src.GetValidCol()` 确定每行参与排序的元素数量。
+- 排序按独立的 32 元素块进行；4 参数重载额外通过 `tmp` 支持非 32 对齐尾块。
+
+## 异常与非法情形
+
+- 输入 Tile 类型不是 `TileType::Vec` 时行为未定义。
+- 元素类型不匹配时行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -100,7 +100,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -121,30 +121,15 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-# pto.tassign %arg2, @tile(0x3000)
-%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- 指令集总览：[不规则与复杂指令](../../irregular-and-complex_zh.md)
+- 相关指令：[TMRGSORT](./tmrgsort_zh.md)
diff --git a/docs/isa/tile/ops/irregular-and-complex/ttri_zh.md b/docs/isa/tile/ops/irregular-and-complex/ttri_zh.md
index ebb3c89a..f45b38cf 100644
--- a/docs/isa/tile/ops/irregular-and-complex/ttri_zh.md
+++ b/docs/isa/tile/ops/irregular-and-complex/ttri_zh.md
@@ -1,54 +1,28 @@
-# TTRI
+# pto.ttri
 
-## 指令示意图
+`pto.ttri` 属于[不规则与复杂指令](../../irregular-and-complex_zh.md)集。
 
-![TTRI tile operation](../../../../figures/isa/TTRI.svg)
+## 概述
 
-## 简介
+TTRI 生成一个三角掩码 Tile。它不读取源 Tile，而是根据目标 Tile 的有效形状和 `diagonal` 参数直接在 `dst` 里写出上三角或下三角的 0/1 模式，常用于注意力 mask、三角区域约束或后续按位/乘法掩码场景。
 
-`TTRI` 生成一个三角掩码 Tile。它不读取源 Tile，而是根据目标 Tile 的有效形状和 `diagonal` 参数直接在 `dst` 里写出上三角或下三角的 0/1 模式。
+设 `R = dst.GetValidRow()`、`C = dst.GetValidCol()`，`d = diagonal`。当 `isUpperOrLower = 0`（下三角）时，若 `j ≤ i + d` 则输出 1，否则输出 0；当 `isUpperOrLower = 1`（上三角）时，若 `j < i + d` 则输出 0，否则输出 1。`diagonal = 0` 表示主对角线；正值向右扩展保留区域，负值则收缩。
 
-这条指令常用于注意力 mask、三角区域约束或后续按位/乘法掩码场景。
+## 语法
 
-## 数学语义
+### PTO-AS
 
-设 `R = dst.GetValidRow()`、`C = dst.GetValidCol()`，`d = diagonal`。
-
-### 下三角形式 `isUpperOrLower = 0`
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-1 & j \le i + d \\
-0 & \text{否则}
-\end{cases}
-$$
-
-### 上三角形式 `isUpperOrLower = 1`
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-0 & j < i + d \\
-1 & \text{否则}
-\end{cases}
-$$
-
-`diagonal = 0` 表示主对角线；正值会把保留区域向右扩展，负值则会收缩。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -61,18 +35,43 @@ template <typename TileData, int isUpperOrLower, typename... WaitEvents>
 PTO_INST RecordEvent TTRI(TileData &dst, int diagonal, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 目标 Tile，接收生成的三角掩码 |
+| `diagonal` | 标量输入 | 对角线偏移量 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 包含上三角或下三角掩码的 Tile |
+
+## 副作用
+
+该指令可能会写入 Tile 的有效区域标记。
+
 ## 约束
 
-- `isUpperOrLower` 只能是：
-  - `0`：下三角
-  - `1`：上三角
+- `isUpperOrLower` 只能是 0（下三角）或 1（上三角）。
 - `dst` 必须是 row-major Tile。
-- 支持的数据类型随目标略有差异：
-  - CPU / A2A3：`int32_t`、`int16_t`、`uint32_t`、`uint16_t`、`half`、`float` 等
-  - A5：额外覆盖 `int8_t`、`uint8_t`、`bfloat16_t`
+- CPU/A2A3 支持的数据类型：`int32_t`、`int16_t`、`uint32_t`、`uint16_t`、`half`、`float` 等。
+- A5 额外支持 `int8_t`、`uint8_t`、`bfloat16_t`。
+
+## 异常与非法情形
+
+- `isUpperOrLower` 值不为 0 或 1 时行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
+### C++ 示例
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -87,5 +86,4 @@ void example() {
 
 ## 相关页面
 
-- [TCMP](../../../TCMP_zh.md)
-- [不规则与复杂指令集](../../irregular-and-complex_zh.md)
+- 指令集总览：[不规则与复杂指令](../../irregular-and-complex_zh.md)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/textract-fp_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/textract-fp_zh.md
index d764a812..73c5009b 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/textract-fp_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/textract-fp_zh.md
@@ -1,36 +1,34 @@
-# TEXTRACT_FP
+# pto.textract.fp
 
-## 指令示意图
+`pto.textract.fp` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TEXTRACT_FP tile operation](../../../../figures/isa/TEXTRACT_FP.svg)
+## 概述
 
-## 简介
-
-`TEXTRACT_FP` 是 `TEXTRACT` 的向量量化版本：它从一个较大的源 Tile 中抽取子块，并结合 `fp` Tile 提供的量化参数，把结果写到目标 Tile。
-
-和 `TINSERT_FP` 一样，当前真实 backend 主要把它用于“从 `Acc` Tile 的某个窗口提取结果，并按量化规则输出到 `Mat` Tile”。
+`TEXTRACT_FP` 是 `TEXTRACT` 的向量量化版本：它从一个较大的源 Tile 中抽取子块，并结合 `fp` Tile 提供的量化参数，把结果写到目标 Tile。当前真实 backend 主要把它用于"从 `Acc` Tile 的某个窗口提取结果，并按量化规则输出到 `Mat` Tile"。
 
 ## 机制
 
 若只看位置关系，它可以概念化为：
 
-$$ \mathrm{dst}_{i,j} = \mathrm{Convert}\!\left(\mathrm{src}_{indexRow + i,\; indexCol + j};\ \mathrm{fp}\right) $$
+$$\mathrm{dst}_{i,j} = \mathrm{Convert}\!\left(\mathrm{src}_{indexRow + i,\; indexCol + j};\ \mathrm{fp}\right)$$
 
 这里的 `indexRow/indexCol` 决定从源 Tile 的哪个子窗口开始提取。对带 `fp` 的路径来说，`indexCol` 还会影响当前使用的量化参数切片；backend 会用它去偏移 `fp` 所指向的配置地址。
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -45,41 +43,50 @@ PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData
                                  uint16_t indexCol, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
 
-### 通用约束
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 Tile | 来自 `Acc` 的源 Tile |
+| `fp` | 量化参数 Tile | 提供向量量化所需的缩放信息，应为 `TileType::Scaling` |
+| `indexRow` | 起始行索引 | 从源 Tile 的哪一行开始提取 |
+| `indexCol` | 起始列索引 | 从源 Tile 的哪一列开始提取 |
+| `dst` | 输出 Tile | 目标 Tile |
 
-- `indexRow + dst.Rows` 不能超过 `src.Rows`。
-- `indexCol + dst.Cols` 不能超过 `src.Cols`。
-- `fp` 的设计意图是承载缩放/量化参数，可移植代码应把它建成 `TileType::Scaling`。
+## 预期输出
 
-### A2/A3 实现
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 从源 Tile 指定位置提取并量化后的子 Tile |
 
-- 这条路径基于 `CheckTMovAccToMat(...)`，因此：
-  - `src` 必须来自 `Acc`
-  - `dst` 必须是 `Mat`
-  - `dst` fractal size 必须是 `512`
-  - `dst` 列宽字节数必须是 `32` 的倍数
-- `FpTileData::Loc` 必须是 `TileType::Scaling`。
-- backend 会按 `indexCol` 对 `fp` 地址做偏移，再设置 FPC。
+## 副作用
 
-### A5 实现
+CPU 模拟器当前接受 `TEXTRACT_FP` 接口，但会忽略 `fp` 参数，退化为普通 `TEXTRACT`。依赖 `fp` 数值的量化行为应以 NPU backend 为准。
 
-- A5 也把这条指令实现为 `Acc -> Mat` 的量化提取。
-- 额外要求：
-  - `dst` 必须是 `TileType::Mat`
-  - `dst` 必须使用 `BLayout::ColMajor + SLayout::RowMajor`
-  - `src` 必须是 `float` 或 `int32_t` 的 `Acc`
-- `FpTileData::Loc` 必须是 `TileType::Scaling`。
-- backend 同样会依据 `indexCol` 偏移 `fp` 地址。
+## 约束
 
-### CPU 模拟器
+- `indexRow + dst.Rows` 不能超过 `src.Rows`
+- `indexCol + dst.Cols` 不能超过 `src.Cols`
+- `fp` 的设计意图是承载缩放/量化参数，可移植代码应把它建成 `TileType::Scaling`
+- A2/A3 这条路径基于 `CheckTMovAccToMat(...)`，因此 `src` 必须来自 `Acc`、`dst` 必须是 `Mat`、`dst` fractal size 必须是 `512`、`dst` 列宽字节数必须是 `32` 的倍数
+- `FpTileData::Loc` 必须是 `TileType::Scaling`，backend 会按 `indexCol` 对 `fp` 地址做偏移再设置 FPC
+- A5 也把这条指令实现为 `Acc -> Mat` 的量化提取，`dst` 必须是 `TileType::Mat`、必须使用 `BLayout::ColMajor + SLayout::RowMajor`、`src` 必须是 `float` 或 `int32_t` 的 `Acc`
+- A5 `FpTileData::Loc` 必须是 `TileType::Scaling`，backend 同样会依据 `indexCol` 偏移 `fp` 地址
 
-- CPU 模拟器当前接受 `TEXTRACT_FP` 接口，但会忽略 `fp` 参数，退化为普通 `TEXTRACT`。
-- 因此，依赖 `fp` 数值的量化行为应以 NPU backend 为准。
+## 异常与非法情形
+
+- 如果提取区域超出源 Tile 范围，行为未定义
+- 如果量化参数格式不正确，backend 可能报错
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -102,3 +109,4 @@ void example() {
 - [TEXTRACT](./textract_zh.md)
 - [TINSERT_FP](./tinsert-fp_zh.md)
 - [TMOV_FP](./tmov-fp_zh.md)
+- 指令集总览：[布局与重排](../../layout-and-rearrangement_zh.md)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md
index bed32e9d..ff94b1b3 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md
@@ -1,26 +1,20 @@
-﻿# TEXTRACT
+﻿# pto.textract
 
-## 指令示意图
+`pto.textract` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TEXTRACT tile operation](../../../../figures/isa/TEXTRACT.svg)
+## 概述
 
-## 简介
+`TEXTRACT` 从较大的源 Tile 中提取一个较小的子 Tile 到目标位置。概念上从较大的 `src` Tile 中，以 `(indexRow, indexCol)` 为起点复制一个较小窗口到 `dst`。确切的映射取决于 tile 布局。
 
-从较大的源 Tile 中提取较小的子 Tile。
-
-## 数学语义
-
-概念上从较大的 `src` Tile 中，以 `(indexRow, indexCol)` 为起点复制一个较小窗口到 `dst`。确切的映射取决于 tile 布局。
+## 机制
 
 设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。对于 `0 <= i < R` 和 `0 <= j < C`：
 
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{indexRow}+i,\; \mathrm{indexCol}+j} $$
-
-## 汇编语法
+$$\mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{indexRow}+i,\; \mathrm{indexCol}+j}$$
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
@@ -28,13 +22,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -58,38 +52,50 @@ template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluP
 PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
 
-### 通用约束或检查
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 Tile | 源 Tile |
+| `indexRow` | 起始行索引 | 从源 Tile 的哪一行开始提取 |
+| `indexCol` | 起始列索引 | 从源 Tile 的哪一列开始提取 |
+| `dst` | 输出 Tile | 目标 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 从源 Tile 指定位置提取的子 Tile |
+
+## 副作用
+
+提取的子 Tile 会被写入到 `dst` 中，`dst` 原来的内容会被覆盖。
+
+## 约束
 
-- `DstTileData::DType` 必须等于 `SrcTileData::DType`。
-- 运行时边界检查：
-  - `indexRow + DstTileData::Rows <= SrcTileData::Rows`
-  - `indexCol + DstTileData::Cols <= SrcTileData::Cols`
+- `DstTileData::DType` 必须等于 `SrcTileData::DType`
+- 运行时边界检查：`indexRow + DstTileData::Rows <= SrcTileData::Rows`
+- 运行时边界检查：`indexCol + DstTileData::Cols <= SrcTileData::Cols`
+- A2/A3 支持的元素类型为 `int8_t`、`half`、`bfloat16_t`、`float`
+- A2/A3 源布局必须满足 `(SFractal == ColMajor && isRowMajor)` 或 `(SFractal == RowMajor && !isRowMajor)`，在以 `TileType::Left` 为目标的 GEMV 场景中还允许 `(SrcTileData::Rows == 1 && SrcTileData::isRowMajor)`
+- A2/A3 目标必须是 `TileType::Left` 或 `TileType::Right`
+- A5 支持的元素类型为 `int8_t`、`hifloat8_t`、`float8_e5m2_t`、`float8_e4m3_t`、`half`、`bfloat16_t`、`float`、`float4_e2m1x2_t`、`float4_e1m2x2_t`、`float8_e8m0_t`
+- A5 源布局对于 `Left` / `Right` 必须满足 `(SFractal == ColMajor && isRowMajor)` 或 `(SFractal == RowMajor && !isRowMajor)`，对于 `ScaleLeft` 必须满足 `(SFractal == RowMajor && isRowMajor)`，对于 `ScaleRight` 必须满足 `(SFractal == ColMajor && !isRowMajor)`
+- A5 目标支持 `TileType::Mat -> TileType::Left/Right/Scale`、`TileType::Acc -> TileType::Mat`（含 relu、标量量化、向量量化形式），以及特定的 `TileType::Vec -> TileType::Mat` 提取路径
 
-### A2A3 实现检查
+## 异常与非法情形
 
-- 支持的元素类型：`int8_t`、`half`、`bfloat16_t`、`float`。
-- 源布局必须满足以下已检查到的 A2A3 提取布局之一：
-  - `(SFractal == ColMajor && isRowMajor)`，或
-  - `(SFractal == RowMajor && !isRowMajor)`。
-- 在以 `TileType::Left` 为目标的 GEMV 场景中，已检查到的源布局还允许 `(SrcTileData::Rows == 1 && SrcTileData::isRowMajor)`。
-- 目标必须是 `TileType::Left` 或 `TileType::Right`，并具有目标支持的布局配置。
+- 如果提取区域超出源 Tile 范围，行为未定义
+- 如果源/目标布局不兼容，backend 可能报错
 
-### A5 实现检查
+## Target-Profile 限制
 
-- 支持的元素类型：`int8_t`、`hifloat8_t`、`float8_e5m2_t`、`float8_e4m3_t`、`half`、`bfloat16_t`、`float`、`float4_e2m1x2_t`、`float4_e1m2x2_t`、`float8_e8m0_t`。
-- 源布局必须满足以下已检查到的 A5 提取布局之一：
-  - 对于 `Left` / `Right`：`(SFractal == ColMajor && isRowMajor)` 或 `(SFractal == RowMajor && !isRowMajor)`
-  - 对于 `ScaleLeft`：`(SFractal == RowMajor && isRowMajor)`
-  - 对于 `ScaleRight`：`(SFractal == ColMajor && !isRowMajor)`
-- 在以 `Left` 为目标的 GEMV 场景中，已检查到的源布局还允许 `(SrcTileData::Rows == 1 && SrcTileData::isRowMajor)`。
-- 目标支持 `TileType::Mat -> TileType::Left/Right/Scale`、`TileType::Acc -> TileType::Mat`（含 relu、标量量化、向量量化形式），以及特定的 `TileType::Vec -> TileType::Mat` 提取路径。
-- 向量量化形式额外要求提供 `FpTileData` 缩放操作数，对应 `TEXTRACT_FP(...)` 接口。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -105,7 +111,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -123,29 +129,13 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
+- [TEXTRACT_FP](./textract-fp_zh.md)
+- 指令集总览：[布局与重排](../../layout-and-rearrangement_zh.md)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand_zh.md
index afd5f19d..59714d99 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand_zh.md
@@ -1,23 +1,14 @@
-# TFILLPAD_EXPAND
+# pto.tfillpad_expand
 
-## 指令示意图
+`pto.tfillpad_expand` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TFILLPAD_EXPAND tile operation](../../../../figures/isa/TFILLPAD_EXPAND.svg)
-
-## 简介
-
-`TFILLPAD_EXPAND` 是 `TFILLPAD` 的扩展尺寸版本。它和 `TFILLPAD` 做的是同一件事：复制源 Tile 的有效区域，并把剩余位置填成确定 pad 值；不同之处在于，这里允许 `dst` 的静态尺寸大于 `src`。
-
-当你需要把一个较小 Tile 嵌进更大的工作 Tile，再把外围补成统一边界值时，用的就是这条指令。
+## 概述
 
-## 数学语义
+`TFILLPAD_EXPAND` 是 `TFILLPAD` 的扩展尺寸版本。它和 `TFILLPAD` 做的是同一件事：复制源 Tile 的有效区域，并把剩余位置填成确定 pad 值；不同之处在于，这里允许 `dst` 的静态尺寸大于 `src`。当你需要把一个较小 Tile 嵌进更大的工作 Tile，再把外围补成统一边界值时，用的就是这条指令。
 
-设：
+## 机制
 
-- `VR = src.GetValidRow()`
-- `VC = src.GetValidCol()`
-
-对 `dst` 的每个元素 `(i, j)`：
+设 `VR = src.GetValidRow()`，`VC = src.GetValidCol()`。对 `dst` 的每个元素 `(i, j)`：
 
 $$
 \mathrm{dst}_{i,j} =
@@ -27,23 +18,23 @@ $$
 \end{cases}
 $$
 
-其中 `pad` 由 `TileDataDst::PadVal` 决定。
+其中 `pad` 由 `TileDataDst::PadVal` 决定。与普通 `TFILLPAD` 相比，唯一的语义差别是：这里允许 `dst.Rows/Cols` 大于 `src.Rows/Cols`。
 
-与普通 `TFILLPAD` 相比，唯一的语义差别是：这里允许 `dst.Rows/Cols` 大于 `src.Rows/Cols`。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -56,9 +47,24 @@ template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
 PTO_INST RecordEvent TFILLPAD_EXPAND(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| dst | 输出 Tile | 目标 Tile，需满足 `dst.Rows >= src.Rows`、`dst.Cols >= src.Cols` |
+| src | 输入 Tile | 源 Tile |
+
+## 预期输出
 
-### 通用约束
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| dst | Tile | 源 Tile 有效区域复制到左上角，其余位置填充 `TileDataDst::PadVal` |
+
+## 副作用
+
+A2/A3、A5 和 CPU 模拟器都把它实现成“复制源有效区域，然后对目标剩余区域补 pad 值”的语义。这条指令本身不引入新的 pad 规则；`PadValue` 的解释与 `TFILLPAD` 保持一致。
+
+## 约束
 
 - `dst.Rows >= src.Rows`
 - `dst.Cols >= src.Cols`
@@ -66,13 +72,15 @@ PTO_INST RecordEvent TFILLPAD_EXPAND(DstTileData &dst, SrcTileData &src, WaitEve
 - `src` 和 `dst` 的元素大小必须一致，并且当前实现只接受 `1`、`2` 或 `4` 字节元素
 - 如果 `dst.GetValidRow() == 0` 或 `dst.GetValidCol() == 0`，backend 会直接返回
 
-### Backend 说明
+## Target-Profile 限制
 
-- A2/A3、A5 和 CPU 模拟器都把它实现成“复制源有效区域，然后对目标剩余区域补 pad 值”的语义。
-- 这条指令本身不引入新的 pad 规则；`PadValue` 的解释与 `TFILLPAD` 保持一致。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -95,3 +103,5 @@ void example() {
 - [TFILLPAD](./tfillpad_zh.md)
 - [TFILLPAD_INPLACE](./tfillpad-inplace_zh.md)
 - [布局与重排指令集](../../layout-and-rearrangement_zh.md)
+
+![TFILLPAD_EXPAND tile operation](../../../../figures/isa/TFILLPAD_EXPAND.svg)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace_zh.md
index 1b352495..2fca05d6 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace_zh.md
@@ -1,23 +1,14 @@
-# TFILLPAD_INPLACE
+# pto.tfillpad_inplace
 
-## 指令示意图
+`pto.tfillpad_inplace` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TFILLPAD_INPLACE tile operation](../../../../figures/isa/TFILLPAD_INPLACE.svg)
-
-## 简介
-
-`TFILLPAD_INPLACE` 是 `TFILLPAD` 的原位变体。它的结果语义与普通 `TFILLPAD` 相同，都是“保留源 valid region，剩余区域写入 pad 值”；区别在于这条接口面向“把目标 Tile 当成被原地修补的对象”的使用方式。
-
-如果你只看最终结果，可以把它理解成“同尺寸 `TFILLPAD`”；如果你关心实现路径，它表达的是“允许 backend 走原位修补式填充”。
+## 概述
 
-## 数学语义
+`TFILLPAD_INPLACE` 是 `TFILLPAD` 的原位变体，结果语义与普通 `TFILLPAD` 相同，都是“保留源 valid region，剩余区域写入 pad 值”；区别在于这条接口面向“把目标 Tile 当成被原地修补的对象”的使用方式。如果你只看最终结果，可以把它理解成“同尺寸 `TFILLPAD`”；如果你关心实现路径，它表达的是“允许 backend 走原位修补式填充”。
 
-设：
+## 机制
 
-- `VR = src.GetValidRow()`
-- `VC = src.GetValidCol()`
-
-对 `dst` 的每个元素 `(i, j)`：
+设 `VR = src.GetValidRow()`，`VC = src.GetValidCol()`。对 `dst` 的每个元素 `(i, j)`：
 
 $$
 \mathrm{dst}_{i,j} =
@@ -29,19 +20,21 @@ $$
 
 其中 `pad` 由 `TileDataDst::PadVal` 决定。
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -54,9 +47,24 @@ template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
 PTO_INST RecordEvent TFILLPAD_INPLACE(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| dst | 输出 Tile | 目标 Tile，需满足 `dst.Rows == src.Rows`、`dst.Cols == src.Cols` |
+| src | 输入 Tile | 源 Tile |
 
-### 通用约束
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| dst | Tile | 有效区域复制自 src，其余位置填充 `TileDataDst::PadVal` |
+
+## 副作用
+
+A2/A3、A5 和 CPU 模拟器都把这条指令实现成与同尺寸 `TFILLPAD` 等价的结果语义。“原位”更多是接口和实现路径上的语义，而不是另一套数学定义。
+
+## 约束
 
 - `dst.Rows == src.Rows`
 - `dst.Cols == src.Cols`
@@ -64,13 +72,15 @@ PTO_INST RecordEvent TFILLPAD_INPLACE(DstTileData &dst, SrcTileData &src, WaitEv
 - `src` 和 `dst` 的元素大小必须一致，并且当前实现只接受 `1`、`2` 或 `4` 字节元素
 - 如果 `dst.GetValidRow() == 0` 或 `dst.GetValidCol() == 0`，backend 会直接返回
 
-### Backend 说明
+## Target-Profile 限制
 
-- A2/A3、A5 和 CPU 模拟器都把这条指令实现成与同尺寸 `TFILLPAD` 等价的结果语义。
-- “原位”更多是接口和实现路径上的语义，而不是另一套数学定义。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -92,3 +102,5 @@ void example() {
 - [TFILLPAD](./tfillpad_zh.md)
 - [TFILLPAD_EXPAND](./tfillpad-expand_zh.md)
 - [布局与重排指令集](../../layout-and-rearrangement_zh.md)
+
+![TFILLPAD_INPLACE tile operation](../../../../figures/isa/TFILLPAD_INPLACE.svg)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad_zh.md
index fedde3f6..eda3d864 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad_zh.md
@@ -1,46 +1,22 @@
-# TFILLPAD
+# pto.tfillpad
 
-## 指令示意图
+`pto.tfillpad` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TFILLPAD tile operation](../../../../figures/isa/TFILLPAD.svg)
+## 概述
 
-## 简介
+`TFILLPAD` 复制源 Tile，并把源 valid region 之外的部分用一个编译期确定的 pad 值补满。它最常见的用途是把"运行时有效矩形"扩成"可安全继续参与后续计算的完整静态 Tile"。如果后续操作不想显式处理边界，就需要有人把边界外的位置先变成确定值，`TFILLPAD` 做的正是这件事。
 
-`TFILLPAD` 复制源 Tile，并把源 valid region 之外的部分用一个**编译期确定的 pad 值**补满。它最常见的用途，是把“运行时有效矩形”扩成“可安全继续参与后续计算的完整静态 Tile”。
+## 机制
 
-如果后续操作不想显式处理边界，就需要有人把边界外的位置先变成确定值。`TFILLPAD` 做的正是这件事。
+设 `VR = src.GetValidRow()` 和 `VC = src.GetValidCol()`。对 `dst` 的每个元素 `(i, j)`：
 
-## 数学语义
+$$\mathrm{dst}_{i,j} =\begin{cases}\mathrm{src}_{i,j} & \text{当 } i < VR \text{ 且 } j < VC \\ \mathrm{pad} & \text{否则}\end{cases}$$
 
-设：
+其中 `pad` 来自 `TileDataDst::PadVal`。常见取值有 `PadValue::Zero`、`PadValue::Min`、`PadValue::Max`，以及通过 `PadValueCustom(...)` 指定的自定义常量。对浮点类型，`Min/Max` 往往会映射到 `-inf/+inf` 一类"适合做极值归约"的值；对整数类型则映射到对应类型的最小值 / 最大值。
 
-- `VR = src.GetValidRow()`
-- `VC = src.GetValidCol()`
+## 语法
 
-对 `dst` 的每个元素 `(i, j)`：
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src}_{i,j} & \text{当 } i < VR \text{ 且 } j < VC \\
-\mathrm{pad} & \text{否则}
-\end{cases}
-$$
-
-其中 `pad` 来自 `TileDataDst::PadVal`。常见取值有：
-
-- `PadValue::Zero`
-- `PadValue::Min`
-- `PadValue::Max`
-- 通过 `PadValueCustom(...)` 指定的自定义常量
-
-对浮点类型，`Min/Max` 往往会映射到 `-inf/+inf` 一类“适合做极值归约”的值；对整数类型则映射到对应类型的最小值 / 最大值。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
-
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tfillpad %src : !pto.tile<...> -> !pto.tile<...>
@@ -48,13 +24,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -70,32 +46,43 @@ template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
 PTO_INST RecordEvent TFILLPAD(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
 ```
 
-相关的同族接口还有：
+相关的同族接口还有 `TFILLPAD_INPLACE(dst, src)` 用于原位填充，以及 `TFILLPAD_EXPAND(dst, src)` 允许 `dst` 比 `src` 更大。
 
-- `TFILLPAD_INPLACE(dst, src)`：原位填充
-- `TFILLPAD_EXPAND(dst, src)`：允许 `dst` 比 `src` 更大
+## 输入
 
-## 约束
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 Tile | 源 Tile |
+| `dst` | 输出 Tile | 填充后的目标 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 有效区域内复制源数据、有效区域外用 pad 值填充的 Tile |
 
-### 通用约束
+## 副作用
+
+`dst` 中有效区域外的内容会被 pad 值覆盖。
+
+## 约束
 
-- Vec Tile 版本要求 `TileDataDst::PadVal != PadValue::Null`。
-- `src` 和 `dst` 的元素大小必须一致，并且当前实现只接受 `1`、`2` 或 `4` 字节元素。
-- 如果 `dst.GetValidRow() == 0` 或 `dst.GetValidCol() == 0`，当前 backend 会直接返回，不执行填充。
+- Vec Tile 版本要求 `TileDataDst::PadVal != PadValue::Null`
+- `src` 和 `dst` 的元素大小必须一致，并且当前实现只接受 `1`、`2` 或 `4` 字节元素
+- 如果 `dst.GetValidRow() == 0` 或 `dst.GetValidCol() == 0`，当前 backend 会直接返回，不执行填充
+- `TFILLPAD(dst, src)` 和 `TFILLPAD_INPLACE(dst, src)` 要求 `dst.Rows/Cols` 必须与 `src.Rows/Cols` 相同
+- `TFILLPAD_EXPAND(dst, src)` 要求 `dst.Rows >= src.Rows` 且 `dst.Cols >= src.Cols`
+- 单类型重载 `TFILLPAD(TileData &dst, TileData &src)` 还支持一条 Mat Tile 特化路径，当前只支持 NZ 形态的 Mat Tile（非 row-major，`SLayout::RowMajor`），且 `TileData::PadVal` 为 `PadValue::Zero` 或 `PadValue::Null`
 
-### 形状约束
+## 异常与非法情形
 
-- `TFILLPAD(dst, src)`：`dst.Rows/Cols` 必须与 `src.Rows/Cols` 相同。
-- `TFILLPAD_INPLACE(dst, src)`：`dst.Rows/Cols` 也必须与 `src.Rows/Cols` 相同。
-- `TFILLPAD_EXPAND(dst, src)`：`dst.Rows >= src.Rows` 且 `dst.Cols >= src.Cols`。
+- 如果 `PadVal` 设置为 `Null` 且使用 Vec Tile 版本，行为未定义
+- 如果元素大小不匹配（不是 1、2、4 字节之一），backend 可能报错
 
-### Mat Tile 特化
+## Target-Profile 限制
 
-- 单类型重载 `TFILLPAD(TileData &dst, TileData &src)` 还支持一条 Mat Tile 特化路径。
-- 这条路径当前只支持：
-  - NZ 形态的 Mat Tile（非 row-major，`SLayout::RowMajor`）
-  - `TileData::PadVal` 为 `PadValue::Zero` 或 `PadValue::Null`
-- 这条 Mat 特化更像“把矩阵 Tile 的未覆盖区域置成可接受的默认值”，而不是通用的 Vec copy+pad。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
@@ -137,6 +124,6 @@ void example_mat() {
 
 ## 相关页面
 
-- [布局与重排指令集](../../layout-and-rearrangement_zh.md)
+- 指令集总览：[布局与重排](../../layout-and-rearrangement_zh.md)
 - [布局参考](../../../state-and-types/layout_zh.md)
 - [Tiles 与有效区域](../../../programming-model/tiles-and-valid-regions_zh.md)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/timg2col_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/timg2col_zh.md
index 094dc342..8c4ba5be 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/timg2col_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/timg2col_zh.md
@@ -1,51 +1,32 @@
-# TIMG2COL
+# pto.timg2col
 
-## 指令示意图
+`pto.timg2col` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TIMG2COL tile operation](../../../../figures/isa/TIMG2COL.svg)
-
-## 简介
-
-`TIMG2COL` 把输入特征图 Tile 重排成卷积友好的列矩阵形式，是 PTO 里连接卷积样式输入布局与矩阵乘法路径的关键桥梁。
-
-这条指令不只是简单提取窗口。它同时综合了：
-
-- 输入特征图几何信息
-- kernel 大小
-- stride / dilation
-- padding
-- channel 打包方式
-- 当前在逻辑 im2col 矩阵中的起始位置 `posM / posK`
+## 概述
 
-## 数学语义
+`TIMG2COL` 把输入特征图 Tile 重排成卷积友好的列矩阵形式，是 PTO 里连接卷积样式输入布局与矩阵乘法路径的关键桥梁。这条指令不只是简单提取窗口，它同时综合了输入特征图几何信息、kernel 大小、stride / dilation、padding、channel 打包方式，以及当前在逻辑 im2col 矩阵中的起始位置 `posM / posK`。
 
-把卷积输入展开成矩阵时，可以把输出矩阵看成按 `(m, k)` 编址：
+## 机制
 
-- `m` 选择输出空间位置
-- `k` 选择卷积核内的通道与空间偏移
+把卷积输入展开成矩阵时，可以把输出矩阵看成按 `(m, k)` 编址：`m` 选择输出空间位置，`k` 选择卷积核内的通道与空间偏移。`TIMG2COL` 会把源特征图中与 `(posM, posK)` 对应的卷积窗口元素，写到目标 Left Tile 中。若窗口越过输入边界，则写入 pad value。
 
-`TIMG2COL` 会把源特征图中与 `(posM, posK)` 对应的卷积窗口元素，写到目标 Left Tile 中。若窗口越过输入边界，则写入 pad value。
+CPU 模拟器里的显式计算逻辑是：先根据 `stride / dilation / filter / pad` 推出输出位置 `(outRow, outCol)`，再根据 `kIndex` 推出 `(channelIndex, kernelH, kernelW)`。若映射回输入后的 `(inputH, inputW)` 越界，则写 `padValue`；否则读取源特征图相应元素。
 
-CPU 模拟器里的显式计算逻辑是：
+## 语法
 
-- 先根据 `stride / dilation / filter / pad` 推出输出位置 `(outRow, outCol)`
-- 再根据 `kIndex` 推出 `(channelIndex, kernelH, kernelW)`
-- 若映射回输入后的 `(inputH, inputW)` 越界，则写 `padValue`
-- 否则读取源特征图相应元素
+### PTO-AS
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -60,53 +41,51 @@ PTO_INST RecordEvent TIMG2COL(TileData &dst, ConvTileData &src, uint16_t posM =
                               WaitEvents&... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| dst | 输出 Tile | 目标 Left Tile，类型为 `TileType::Left` |
+| src | 输入 Tile | 源卷积配置/特征图 Tile，类型为 `TileType::Mat` |
+| posM | 立即数 | 逻辑 im2col 矩阵中的起始行偏移 |
+| posK | 立即数 | 逻辑 im2col 矩阵中的起始列偏移 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| dst | Tile | im2col 格式的列矩阵 Tile |
+
+## 副作用
+
+当 `FmatrixMode` 为 `FMATRIX_A_AUTO` 或 `FMATRIX_B_AUTO` 时，A2/A3 会自动根据 `src` 的 `fmapH / fmapW / padList` 设置 FMATRIX；A5 还会同时设置 repeat 和 padding 状态。
+
 ## 约束
 
-### 通用约束
-
-- `src` 必须是卷积配置/特征图 Tile，位置类型为 `TileType::Mat`。
-- 输入布局必须是 `NC1HWC0` 或 `NDC1HWC0`。
-- `dst` 必须是 `TileType::Left`。
-- `src` 与 `dst` 的元素类型必须一致。
-- `posM / posK` 不是像素坐标，而是逻辑 im2col 矩阵中的起始偏移。
-
-### A2/A3 实现
-
-- 支持的数据类型是：
-  `int8_t`、`half`、`bfloat16_t`、`float`。
-- A2/A3 的 `Left` 目标约束是：
-  - `dst.SFractal == SLayout::RowMajor`
-  - `dst.isRowMajor == true`
-- 当 `FmatrixMode` 为 `FMATRIX_A_AUTO` 或 `FMATRIX_B_AUTO` 时，A2/A3 会自动根据 `src` 的：
-  - `fmapH / fmapW`
-  - `padList`
-  来设置 FMATRIX。
-- A2/A3 的 `TIMG2COL` auto 路径**不会**顺手设置 repeat 和 padding 寄存器；如果后续路径依赖这些状态，应显式使用对应的 `TSET_*` 指令。
-
-### A5 实现
-
-- 支持的数据类型更宽，除 `int8_t/half/bfloat16_t/float` 外，还覆盖若干 `uint*` / `int*` 类型。
-- A5 的 `Left` 目标约束是：
-  - `dst.SFractal == SLayout::RowMajor`
-  - `dst.isRowMajor == false`
-- 当 `FmatrixMode` 为 `FMATRIX_A_AUTO` 或 `FMATRIX_B_AUTO` 时，A5 会自动根据 `src` 的：
-  - `fmapH / fmapW`
-  - `padList`
-  - `repeatStride / repeatTime / repeatMode / dstStride / dstMposition`
-  - `padValue`
-  一并设置 FMATRIX、repeat 和 padding 状态。
-
-### CPU 模拟器
-
-- CPU 使用显式公式直接完成 im2col 展开。
-- CPU 目前沿用与 A5 相同的 `Left` 目标布局约束：
-  - `dst.SFractal == SLayout::RowMajor`
-  - `dst.isRowMajor == false`
-
-这意味着 `TIMG2COL` 的 Left Tile 细节并不是所有目标完全一致，文档里必须按目标区分。
+- `src` 必须是卷积配置/特征图 Tile，位置类型为 `TileType::Mat`
+- 输入布局必须是 `NC1HWC0` 或 `NDC1HWC0`
+- `dst` 必须是 `TileType::Left`
+- `src` 与 `dst` 的元素类型必须一致
+- `posM / posK` 不是像素坐标，而是逻辑 im2col 矩阵中的起始偏移
+- A2/A3 的 Left 目标约束：`dst.SFractal == SLayout::RowMajor`，`dst.isRowMajor == true`
+- A5 的 Left 目标约束：`dst.SFractal == SLayout::RowMajor`，`dst.isRowMajor == false`
+- CPU 目前沿用与 A5 相同的 Left 目标布局约束
+
+## 异常与非法情形
+
+- A2/A3 的 `TIMG2COL` auto 路径不会顺手设置 repeat 和 padding 寄存器；如果后续路径依赖这些状态，应显式使用对应的 `TSET_*` 指令
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 支持的数据类型 | - | int8_t、half、bfloat16_t、float | int8_t/half/bfloat16_t/float 及若干 uint*/int* 类型 |
+| Left Tile 布局 | isRowMajor=false | isRowMajor=true | isRowMajor=false |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -123,3 +102,6 @@ void example(LeftTile& dst, ConvTile& src) {
 - [TSETFMATRIX](../../../scalar/ops/control-and-configuration/tsetfmatrix_zh.md)
 - [TSET_IMG2COL_RPT](../sync-and-config/tset-img2col-rpt_zh.md)
 - [TSET_IMG2COL_PADDING](../sync-and-config/tset-img2col-padding_zh.md)
+- [布局与重排指令集](../../layout-and-rearrangement_zh.md)
+
+![TIMG2COL tile operation](../../../../figures/isa/TIMG2COL.svg)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp_zh.md
index dbeef3d9..6116e5b3 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp_zh.md
@@ -1,42 +1,34 @@
-# TINSERT_FP
+# pto.tinsert.fp
 
-## 指令示意图
+`pto.tinsert.fp` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TINSERT_FP tile operation](../../../../figures/isa/TINSERT_FP.svg)
+## 概述
 
-## 简介
-
-`TINSERT_FP` 是 `TINSERT` 的向量量化版本：它把一个源 Tile 插入到目标 Tile 的某个子区域里，同时使用额外的 `fp` Tile 提供量化参数。
-
-这条指令不是通用“任意 Tile 插入”。当前真实 backend 路径主要面向“把 `Acc` Tile 的一块结果，按向量量化规则插入到 `Mat` Tile 的指定位置”。
+`TINSERT_FP` 是 `TINSERT` 的向量量化版本：它把一个源 Tile 插入到目标 Tile 的某个子区域里，同时使用额外的 `fp` Tile 提供量化参数。当前真实 backend 路径主要面向"把 `Acc` Tile 的一块结果，按向量量化规则插入到 `Mat` Tile 的指定位置"。
 
 ## 机制
 
-对最直观的结果可以这样理解：
-
-1. 从 `src` 读取一个有效子块。
-2. 依据 `fp` 提供的量化参数，把这块数据做向量量化或相关转换。
-3. 把结果写到 `dst` 的 `(indexRow, indexCol)` 起始位置。
+对最直观的结果可以这样理解：首先从 `src` 读取一个有效子块，然后依据 `fp` 提供的量化参数把这块数据做向量量化或相关转换，最后把结果写到 `dst` 的 `(indexRow, indexCol)` 起始位置。若只看写入位置关系，它对应的是：
 
-若只看写入位置关系，它对应的是：
-
-$$ \mathrm{dst}_{indexRow + i,\; indexCol + j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
+$$\mathrm{dst}_{indexRow + i,\; indexCol + j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right)$$
 
 其中 `Convert` 的具体形式由 backend 上的向量量化模式决定。
 
-## 汇编语法
+## 语法
+
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -51,40 +43,51 @@ PTO_INST RecordEvent TINSERT_FP(DstTileData &dst, SrcTileData &src, FpTileData &
                                 uint16_t indexCol, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
 
-### 通用约束
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 Tile | 来自 `Acc` 的源 Tile |
+| `fp` | 量化参数 Tile | 提供向量量化所需的缩放信息，应为 `TileType::Scaling` |
+| `indexRow` | 起始行索引 | 插入到目标 Tile 的起始行 |
+| `indexCol` | 起始列索引 | 插入到目标 Tile 的起始列 |
+| `dst` | 输出 Tile | 目标 Tile |
 
-- `indexRow + src.Rows` 不能超过 `dst.Rows`。
-- `indexCol + src.Cols` 不能超过 `dst.Cols`。
-- `fp` 的设计意图是承载缩放/量化参数，可移植代码应把它建成 `TileType::Scaling`。
+## 预期输出
 
-### A2/A3 实现
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 源 Tile 量化后插入的目标 Tile |
 
-- 这条路径基于 `CheckTMovAccToMat(...)`，因此：
-  - `src` 必须来自 `Acc`
-  - `dst` 必须是 `Mat`
-  - `dst` fractal size 必须是 `512`
-  - `dst` 列宽字节数必须是 `32` 的倍数
-- `TINSERT_FP` 走的是向量量化版本，因此使用 `GetVectorPreQuantMode(...)`。
-- `FpTileData::Loc` 必须是 `TileType::Scaling`。
+## 副作用
 
-### A5 实现
+CPU 模拟器当前接受 `TINSERT_FP` 接口，但会忽略 `fp` 参数，退化为普通 `TINSERT`。依赖 `fp` 数值的量化行为应以 NPU backend 为准。
 
-- A5 也把这条指令实现为 `Acc -> Mat` 的量化插入。
-- 额外要求：
-  - `dst` 必须是 `TileType::Mat`
-  - `dst` 必须使用 `BLayout::ColMajor + SLayout::RowMajor`
-  - `src` 必须是 `float` 或 `int32_t` 的 `Acc`
-- `FpTileData::Loc` 必须是 `TileType::Scaling`。
+## 约束
+
+- `indexRow + src.Rows` 不能超过 `dst.Rows`
+- `indexCol + src.Cols` 不能超过 `dst.Cols`
+- `fp` 的设计意图是承载缩放/量化参数，可移植代码应把它建成 `TileType::Scaling`
+- A2/A3 这条路径基于 `CheckTMovAccToMat(...)`，因此 `src` 必须来自 `Acc`、`dst` 必须是 `Mat`、`dst` fractal size 必须是 `512`、`dst` 列宽字节数必须是 `32` 的倍数
+- `TINSERT_FP` 走的是向量量化版本，因此使用 `GetVectorPreQuantMode(...)`
+- `FpTileData::Loc` 必须是 `TileType::Scaling`
+- A5 也把这条指令实现为 `Acc -> Mat` 的量化插入，`dst` 必须是 `TileType::Mat`、必须使用 `BLayout::ColMajor + SLayout::RowMajor`、`src` 必须是 `float` 或 `int32_t` 的 `Acc`
+- A5 `FpTileData::Loc` 必须是 `TileType::Scaling`
+
+## 异常与非法情形
 
-### CPU 模拟器
+- 如果插入区域超出目标 Tile 范围，行为未定义
+- 如果量化参数格式不正确，backend 可能报错
 
-- CPU 模拟器当前接受 `TINSERT_FP` 接口，但会忽略 `fp` 参数，退化为普通 `TINSERT`。
-- 因此，依赖 `fp` 数值的量化行为应以 NPU backend 为准。
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -107,3 +110,4 @@ void example() {
 - [TINSERT](./tinsert_zh.md)
 - [TEXTRACT_FP](./textract-fp_zh.md)
 - [TMOV_FP](./tmov-fp_zh.md)
+- 指令集总览：[布局与重排](../../layout-and-rearrangement_zh.md)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md
index 40ead58a..40509fc5 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md
@@ -1,26 +1,20 @@
-﻿# TINSERT
+﻿# pto.tinsert
 
-## 指令示意图
+`pto.tinsert` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TINSERT tile operation](../../../../figures/isa/TINSERT.svg)
+## 概述
 
-## 简介
+`TINSERT` 在 `(indexRow, indexCol)` 偏移处将子 Tile 插入到目标 Tile 中。
 
-在 (indexRow, indexCol) 偏移处将子 Tile 插入到目标 Tile 中。
-
-## 数学语义
+## 机制
 
 设 `R = src.GetValidRow()` 和 `C = src.GetValidCol()`。概念上，对于 `0 <= i < R` 和 `0 <= j < C`：
 
-$$
-\mathrm{dst}_{\mathrm{indexRow}+i,\;\mathrm{indexCol}+j} = \mathrm{src}_{i,j}
-$$
-
-## 汇编语法
+$$\mathrm{dst}_{\mathrm{indexRow}+i,\;\mathrm{indexCol}+j} = \mathrm{src}_{i,j}$$
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
@@ -28,13 +22,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tinsert ins(%src[%r0, %r1] : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -63,44 +57,69 @@ PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint32_t indexR
 #endif
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 Tile | 源 Tile |
+| `indexRow` | 起始行索引 | 插入到目标 Tile 的起始行 |
+| `indexCol` | 起始列索引 | 插入到目标 Tile 的起始列 |
+| `dst` | 输出 Tile | 目标 Tile，会被部分覆盖 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 源 Tile 插入后的目标 Tile |
+
+## 副作用
+
+源 Tile 的有效区域会被写入到 `dst` 的指定位置，`dst` 其他位置的内容保持不变。
+
 ## 约束
 
-- **A2/A3**:
-    - 文档中列出的这些重载对应 `Acc -> Mat` 插入路径，包括普通形式、`reluMode` 形式、标量预量化形式以及向量预量化（`TINSERT_FP`）形式。
-    - 运行时边界必须满足 `indexRow + src.Rows <= dst.Rows` 且 `indexCol + src.Cols <= dst.Cols`。
-- **A5**:
-    - 除了上面的 `Acc -> Mat` 插入路径外，A5 还额外提供 `template <TInsertMode mode, ...> TINSERT(...)`，用于 `Vec -> Mat` 与 `Vec -> Vec` 插入变体。
-    - `mode == TInsertMode::ND` 要求源向量 tile 为行优先，并以 ND 布局插入到矩阵 tile。
-    - `mode == TInsertMode::ND_VEC` 要求源和目的都为行优先向量 tile。
-    - NZ 系列模式（`NZ`、`NZ_PLUS_1`、`SPLIT2_NZ_PLUS_1`、`SPLIT4_NZ_PLUS_1`）要求源向量 tile 为 NZ 格式，目的为矩阵 tile。
+- 运行时边界必须满足 `indexRow + src.Rows <= dst.Rows` 且 `indexCol + src.Cols <= dst.Cols`
+- A2/A3 的重载对应 `Acc -> Mat` 插入路径，包括普通形式、`reluMode` 形式、标量预量化形式以及向量预量化（`TINSERT_FP`）形式
+- A5 除了上面的 `Acc -> Mat` 插入路径外，还额外提供 `template <TInsertMode mode, ...> TINSERT(...)`，用于 `Vec -> Mat` 与 `Vec -> Vec` 插入变体
+- `mode == TInsertMode::ND` 要求源向量 tile 为行优先，并以 ND 布局插入到矩阵 tile
+- `mode == TInsertMode::ND_VEC` 要求源和目的都为行优先向量 tile
+- NZ 系列模式（`NZ`、`NZ_PLUS_1`、`SPLIT2_NZ_PLUS_1`、`SPLIT4_NZ_PLUS_1`）要求源向量 tile 为 NZ 格式，目的为矩阵 tile
 
-## 示例
+## 异常与非法情形
 
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+- 如果插入区域超出目标 Tile 范围，行为未定义
+- 如果源/目标类型组合不被支持，backend 可能报错
 
-## 汇编示例（ASM）
+## Target-Profile 限制
 
-### 自动模式
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-```
+## 示例
 
-### 手动模式
+### C++ 自动模式
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using SrcT = TileAcc<float, 16, 16>;
+  using DstT = Tile<TileType::Mat, float, 32, 32>;
+  SrcT src;
+  DstT dst;
+  TINSERT(dst, src, /*indexRow=*/0, /*indexCol=*/0);
+}
 ```
 
-### PTO 汇编形式
+### PTO-AS
 
 ```text
 %dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tinsert ins(%src[%r0, %r1] : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- [TINSERT_FP](./tinsert-fp_zh.md)
+- 指令集总览：[布局与重排](../../layout-and-rearrangement_zh.md)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tmov-fp_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/tmov-fp_zh.md
index a775b093..6889aee6 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/tmov-fp_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/tmov-fp_zh.md
@@ -1,32 +1,22 @@
-# TMOV_FP
+# pto.tmov.fp
 
-## 指令示意图
+`pto.tmov.fp` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TMOV_FP tile operation](../../../../figures/isa/TMOV_FP.svg)
+## 概述
 
-## 简介
+`TMOV_FP` 是 `TMOV` 家族里的向量量化版本：它从累加器 Tile 读取源数据，同时使用额外的 `fp` Tile 提供量化参数，把结果移动到目标 Tile。这条指令解决的是"单纯按 dtype cast 不够"的问题。很多量化路径不仅需要源/目标类型，还需要一组外部缩放参数。`TMOV_FP` 把这组参数显式建模成一个 Tile 操作数。
 
-`TMOV_FP` 是 `TMOV` 家族里的向量量化版本：它从累加器 Tile 读取源数据，同时使用额外的 `fp` Tile 提供量化参数，把结果移动到目标 Tile。
-
-这条指令解决的是“单纯按 dtype cast 不够”的问题。很多量化路径不仅需要源/目标类型，还需要一组外部缩放参数。`TMOV_FP` 把这组参数显式建模成一个 Tile 操作数。
-
-## 数学语义
+## 机制
 
 对有效区域中的每个元素，`TMOV_FP` 可以概念化为：
 
-$$ \mathrm{dst}_{i,j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
-
-这里的 `Convert` 并不是单纯的 C++ 类型转换，而是“由 `fp` 指定量化 / 反量化参数”的目标相关转换。架构层可见的合同是：
-
-- `src` 来自 `Acc` Tile；
-- `fp` 提供向量量化所需的缩放信息；
-- `dst` 接收量化或反量化后的结果。
+$$\mathrm{dst}_{i,j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right)$$
 
-## 汇编语法
+这里的 `Convert` 并不是单纯的 C++ 类型转换，而是"由 `fp` 指定量化 / 反量化参数"的目标相关转换。架构层可见的合同是：`src` 来自 `Acc` Tile，`fp` 提供向量量化所需的缩放信息，`dst` 接收量化或反量化后的结果。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
@@ -34,13 +24,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -56,49 +46,49 @@ PTO_INST RecordEvent TMOV_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp,
 
 如果目标是 Vec Tile，并且还需要显式选择 `AccToVecMode`，则使用 `TMOV` 的同族模板重载而不是这个命名接口。
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 Tile | 来自 `Acc` Tile 的源数据 |
+| `fp` | 量化参数 Tile | 提供向量量化所需的缩放信息，应为 `TileType::Scaling` |
+| `dst` | 输出 Tile | 量化或反量化后的结果 |
 
-### 通用约束
+## 预期输出
 
-- `src` 必须来自 `Acc` Tile。
-- `fp` 的设计意图是承载量化参数；可移植代码应将它建成 `TileType::Scaling`。
-- 目标 dtype 的支持集由具体 backend 上的向量量化模式决定，而不是由 `TMOV_FP` 单独定义一套新规则。
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 经过向量量化转换后的 Tile |
 
-### A2/A3 实现检查
+## 副作用
 
-- A2/A3 的 `TMOV_FP` 实际对应 `Acc -> Mat` 路径，不支持 `Acc -> Vec`。
-- `FpTileData::Loc` 必须是 `TileType::Scaling`。
-- 源必须是 `Acc`，目标必须是 `Mat`，并且满足 `CheckTMovAccToMat(...)`：
-  - 目标 fractal size 必须是 `512`
-  - 目标列宽字节数必须是 `32` 的倍数
-  - 源 dtype 必须是 `float` 或 `int32_t`
-- 量化支持集为：
-  - `float Acc -> int8_t Mat`
-  - `int32_t Acc -> int8_t / uint8_t / half / int16_t Mat`
+CPU 模拟器当前会接受 `TMOV_FP` 接口，但不会真正消费 `fp` 参数，而是退化为普通 `TMOV` 路径。依赖 `fp` 具体数值的行为应以 NPU backend 和目标 profile 为准。
 
-### A5 实现检查
+## 约束
+
+- `src` 必须来自 `Acc` Tile
+- `fp` 的设计意图是承载量化参数，可移植代码应将它建成 `TileType::Scaling`
+- 目标 dtype 的支持集由具体 backend 上的向量量化模式决定
+- A2/A3 的 `TMOV_FP` 实际对应 `Acc -> Mat` 路径，不支持 `Acc -> Vec`
+- A2/A3 要求 `FpTileData::Loc` 必须是 `TileType::Scaling`，源必须是 `Acc`，目标必须是 `Mat`，并且满足 `CheckTMovAccToMat(...)`：目标 fractal size 必须是 `512`、目标列宽字节数必须是 `32` 的倍数、源 dtype 必须是 `float` 或 `int32_t`
+- A2/A3 量化支持集为 `float Acc -> int8_t Mat` 和 `int32_t Acc -> int8_t / uint8_t / half / int16_t Mat`
+- A5 的 `TMOV_FP` 由 `CheckTMovAccValid(..., true)` 约束，`FpTileData::Loc` 必须是 `TileType::Scaling`
+- A5 命名接口 `TMOV_FP(dst, src, fp)` 支持 `Acc -> Vec` 和 `Acc -> Mat`
+- 对 `Acc -> Vec/Mat`，源 dtype 必须是 `float` 或 `int32_t`，目标布局只允许 `nz2nz`、`nz2nd`、`nz2dn`，目标 stride 对应字节数必须是 `32` 的倍数
+- A5 量化支持集为 `float Acc -> int8_t / uint8_t / hifloat8_t / half / bfloat16_t / float8_e4m3_t / float` 和 `int32_t Acc -> int8_t / uint8_t / half / bfloat16_t`
 
-- A5 的 `TMOV_FP` 由 `CheckTMovAccValid(..., true)` 约束。
-- `FpTileData::Loc` 必须是 `TileType::Scaling`。
-- 命名接口 `TMOV_FP(dst, src, fp)` 支持：
-  - `Acc -> Vec`
-  - `Acc -> Mat`
-- 对 `Acc -> Vec/Mat`：
-  - 源 dtype 必须是 `float` 或 `int32_t`
-  - 目标布局只允许 `nz2nz`、`nz2nd`、`nz2dn`
-  - 目标 stride 对应字节数必须是 `32` 的倍数
-- 量化支持集为：
-  - `float Acc -> int8_t / uint8_t / hifloat8_t / half / bfloat16_t / float8_e4m3_t / float`
-  - `int32_t Acc -> int8_t / uint8_t / half / bfloat16_t`
+## 异常与非法情形
 
-### CPU 模拟器说明
+- 如果目标不是 `Mat` 类型或量化参数格式不正确，backend 可能报错
 
-- 当前 CPU 模拟器会接受 `TMOV_FP` 接口，但不会真正消费 `fp` 参数，而是退化为普通 `TMOV` 路径。
-- 因此，依赖 `fp` 具体数值的行为应以 NPU backend 和目标 profile 为准。
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -117,7 +107,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -139,8 +129,14 @@ void example_manual() {
 }
 ```
 
+### PTO-AS
+
+```text
+%dst = tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
 ## 相关页面
 
 - [TMOV](./tmov_zh.md)
 - [TSTORE_FP](../memory-and-data-movement/tstore-fp_zh.md)
-- [布局与重排指令集](../../layout-and-rearrangement_zh.md)
+- 指令集总览：[布局与重排](../../layout-and-rearrangement_zh.md)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tmov_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/tmov_zh.md
index 7e94a380..8ea6bc14 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/tmov_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/tmov_zh.md
@@ -1,38 +1,24 @@
-# TMOV
+# pto.tmov
 
-## 指令示意图
+`pto.tmov` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TMOV tile operation](../../../../figures/isa/TMOV.svg)
+## 概述
 
-## 简介
-
-`TMOV` 是 PTO tile 空间里的“位置变换与局部搬运”总入口。它不在 GM 和 Tile 之间传输数据，而是在不同 Tile 位置、不同布局或不同后端缓冲语义之间移动数据。
-
-这条指令存在的理由很直接：`TLOAD` / `TSTORE` 负责进出 `GlobalTensor`，但很多后端计算单元还要求更具体的本地表示。例如 cube 路径要用 `Left` / `Right` / `Acc`，偏置和量化又要用 `Bias` / `Scaling` / `ScaleLeft` / `ScaleRight`。`TMOV` 正是这些本地表示之间的桥。
+`TMOV` 是 PTO tile 空间里的"位置变换与局部搬运"总入口。它不在 GM 和 Tile 之间传输数据，而是在不同 Tile 位置、不同布局或不同后端缓冲语义之间移动数据。`TMOV` 不是单一的数据通路，而是一组同名重载。实际语义由源 Tile、目标 Tile 以及所选重载共同决定。常见的路径包括 `Vec -> Vec` 在两个向量 Tile 之间复制有效区域、`Mat -> Left/Right` 把一般矩阵 Tile 重排成 cube 乘法所需的形态、`Acc -> Vec/Mat` 把累加器内容取出到普通 Tile 并可选附带 ReLU、标量量化或向量量化。
 
 ## 机制
 
-`TMOV` 不是单一的数据通路，而是一组同名重载。实际语义由源 Tile、目标 Tile 以及所选重载共同决定。
-
-最常见的几类路径是：
-
-- `Vec -> Vec`：在两个向量 Tile 之间复制有效区域。
-- `Mat -> Left/Right`：把一般矩阵 Tile 重排成 cube 乘法所需的 `Left` / `Right` 形态。
-- `Mat -> Bias/Scaling`：把单行矩阵 Tile 送入偏置表或 fixpipe buffer。
-- `Acc -> Vec/Mat`：把累加器内容取出到普通 Tile，并可选附带 ReLU、标量量化或向量量化。
-- 某些 A5 特化路径还支持 `Vec -> Mat`、`Mat -> ScaleLeft/ScaleRight`，以及带 `tmp` 的 ND->ZZ 打包。
-
-对最简单的纯复制场景，可以把它看成：
+对最简单的纯复制场景，`TMOV` 可以看成：
 
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} $$
+$$\mathrm{dst}_{i,j} = \mathrm{src}_{i,j}$$
 
-但一旦跨位置或跨 layout，真正发生的就不只是“复制”，还包括重排、格式转换和目标缓冲映射。
+但一旦跨位置或跨 layout，真正发生的就不只是"复制"，还包括重排、格式转换和目标缓冲映射。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+### PTO-AS
 
-PTO-AS 设计上通常会把不同子路径拆成更明确的 spelling，例如：
+PTO-AS 设计上通常会把不同子路径拆成更明确的 spelling：
 
 ```text
 %left  = tmov.m2l %mat  : !pto.tile<...> -> !pto.tile<...>
@@ -45,13 +31,13 @@ PTO-AS 设计上通常会把不同子路径拆成更明确的 spelling，例如
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tmov.s2d %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -91,76 +77,59 @@ template <typename DstTileData, typename SrcTileData, AccToVecMode mode, ReluPre
 PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 Tile | 源 Tile |
+| `dst` | 输出 Tile | 目标 Tile |
+| `tmp` | 可选临时 Tile | 仅某些特化路径需要 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 经过位置变换或格式转换后的 Tile |
+
+## 副作用
+
+对纯 `Vec -> Vec` 复制，backend 复制的是双方 valid region 的交集，而不是强制把物理 Tile 全部搬满。
+
 ## 约束
 
-### 通用约束
-
-- 这是一组重载，不同路径的合法性差异很大。阅读 `TMOV` 时，先看“源位置 -> 目标位置”是哪一种。
-- `reluMode` 取值为 `ReluPreMode::{NoRelu, NormalRelu}`。
-- `mode` 取值为 `AccToVecMode::{SingleModeVec0, SingleModeVec1, DualModeSplitM, DualModeSplitN}`。
-- 对纯 `Vec -> Vec` 复制，backend 复制的是双方 valid region 的交集，而不是强制把物理 Tile 全部搬满。
-
-### A2/A3 实现检查
-
-- A2/A3 支持的主要路径是：
-  - `Mat -> Left/Right/Bias/Scaling`
-  - `Vec -> Vec`
-  - `Acc -> Mat`
-- 这些路径都要求源和目标的静态 `Rows/Cols` 相同。
-- `Mat -> Bias`：
-  - 仅支持 `int32_t -> int32_t`、`float -> float`、`half -> float`
-  - 源行数必须为 `1`
-  - `Cols * sizeof(srcType)` 必须按 `64` 字节对齐
-- `Mat -> Scaling`：
-  - 目标类型必须与源类型一致，且必须是 `uint64_t`
-  - 源行数必须为 `1`
-  - `Cols * sizeof(srcType)` 必须按 `128` 字节对齐
-- `Acc -> Mat` 普通 / ReLU 形式并不是通用 cast：
-  - 源必须是 `Acc`
-  - 目标必须是 `Mat`
-  - 目标 fractal size 固定为 `512`
-  - 目标列宽字节数必须是 `32` 的倍数
-  - 普通 / ReLU 形式只覆盖 `float Acc -> half/bfloat16 Mat`
-- `Acc -> Mat` 标量量化 / 向量量化路径的支持集更窄：
-  - `float Acc -> int8_t Mat`
-  - `int32_t Acc -> int8_t / uint8_t / half / int16_t Mat`
-- A2/A3 没有 `Acc -> Vec` 的 `TMOV` 路径；这点和 A5 不同。
-
-### A5 实现检查
-
-- A5 支持的路径更宽，主要包括：
-  - `Mat -> Left/Right/Bias/Scaling/ScaleLeft/ScaleRight`
-  - `Vec -> Vec/Mat`
-  - `Acc -> Vec/Mat`
-- 对普通 `Mat` / `Vec` 路径，源和目标通常要求同 dtype。支持的常见元素类型包括：
-  `int8_t`、`hifloat8_t`、`float8_e5m2_t`、`float8_e4m3_t`、`half`、`bfloat16_t`、`float`、
-  `float4_e2m1x2_t`、`float4_e1m2x2_t`。
-- MX scale 路径额外覆盖 `float8_e8m0_t`。
-- `Mat -> Bias`：
-  - 支持 `int32_t -> int32_t`、`float -> float`、`half -> float`、`bfloat16_t -> float`
-  - 源行数必须为 `1`
-  - 目标字节数必须按 `64` 字节对齐，且总占用不超过 `4096` 字节
-- `Mat -> Scaling`：
-  - 源行数必须为 `1`
-  - 目标字节数必须按 `128` 字节对齐，且总占用不超过 `4096` 字节
-- `Acc -> Vec/Mat`：
-  - 源必须是 `float` 或 `int32_t` 的 `Acc`
-  - 目标布局只允许 `nz2nz`、`nz2nd`、`nz2dn`
-  - 目标 stride 必须非零，且对应字节数必须是 `32` 的倍数
-- A5 的非量化 `Acc -> Vec/Mat` 支持：
-  - `float -> half / bfloat16 / float`
-  - `int32_t -> int32_t`
-- A5 的量化路径支持：
-  - `float -> int8_t / uint8_t / hifloat8_t / half / bfloat16_t / float8_e4m3_t / float`
-  - `int32_t -> int8_t / uint8_t / half / bfloat16_t`
-- `DualModeSplitM` / `DualModeSplitN` 仅适用于 `Acc -> Vec`，并且：
-  - 不能与量化同时使用
-  - 不支持 `nz2dn` 输出路径
-- A5 还提供一个带 `tmp` 的特化重载，用于 `uint8_t` 数据的 ND->ZZ 打包路径；它不是通用三操作数 `TMOV`，而是一个窄 backend 特化。
+- 这是一组重载，不同路径的合法性差异很大。阅读 `TMOV` 时，先看"源位置 -> 目标位置"是哪一种
+- `reluMode` 取值为 `ReluPreMode::{NoRelu, NormalRelu}`
+- `mode` 取值为 `AccToVecMode::{SingleModeVec0, SingleModeVec1, DualModeSplitM, DualModeSplitN}`
+- A2/A3 支持的主要路径是 `Mat -> Left/Right/Bias/Scaling`、`Vec -> Vec`、`Acc -> Mat`，这些路径都要求源和目标的静态 `Rows/Cols` 相同
+- `Mat -> Bias` 仅支持 `int32_t -> int32_t`、`float -> float`、`half -> float`，源行数必须为 `1`，且 `Cols * sizeof(srcType)` 必须按 `64` 字节对齐
+- `Mat -> Scaling` 目标类型必须与源类型一致且必须是 `uint64_t`，源行数必须为 `1`，且 `Cols * sizeof(srcType)` 必须按 `128` 字节对齐
+- `Acc -> Mat` 普通 / ReLU 形式只覆盖 `float Acc -> half/bfloat16 Mat`，源必须是 `Acc`、目标必须是 `Mat`、目标 fractal size 固定为 `512`、目标列宽字节数必须是 `32` 的倍数
+- `Acc -> Mat` 标量量化 / 向量量化路径支持 `float Acc -> int8_t Mat` 以及 `int32_t Acc -> int8_t / uint8_t / half / int16_t Mat`
+- A2/A3 没有 `Acc -> Vec` 的 `TMOV` 路径
+- A5 支持的路径更宽，主要包括 `Mat -> Left/Right/Bias/Scaling/ScaleLeft/ScaleRight`、`Vec -> Vec/Mat`、`Acc -> Vec/Mat`
+- 对普通 `Mat` / `Vec` 路径，源和目标通常要求同 dtype。支持的常见元素类型包括 `int8_t`、`hifloat8_t`、`float8_e5m2_t`、`float8_e4m3_t`、`half`、`bfloat16_t`、`float`、`float4_e2m1x2_t`、`float4_e1m2x2_t`
+- MX scale 路径额外覆盖 `float8_e8m0_t`
+- `Mat -> Bias` 支持 `int32_t -> int32_t`、`float -> float`、`half -> float`、`bfloat16_t -> float`，源行数必须为 `1`，目标字节数必须按 `64` 字节对齐且总占用不超过 `4096` 字节
+- `Mat -> Scaling` 源行数必须为 `1`，目标字节数必须按 `128` 字节对齐且总占用不超过 `4096` 字节
+- `Acc -> Vec/Mat` 源必须是 `float` 或 `int32_t` 的 `Acc`，目标布局只允许 `nz2nz`、`nz2nd`、`nz2dn`，目标 stride 必须非零且对应字节数必须是 `32` 的倍数
+- A5 的非量化 `Acc -> Vec/Mat` 支持 `float -> half / bfloat16 / float` 和 `int32_t -> int32_t`
+- A5 的量化路径支持 `float -> int8_t / uint8_t / hifloat8_t / half / bfloat16_t / float8_e4m3_t / float` 以及 `int32_t -> int8_t / uint8_t / half / bfloat16_t`
+- `DualModeSplitM` / `DualModeSplitN` 仅适用于 `Acc -> Vec`，不能与量化同时使用，且不支持 `nz2dn` 输出路径
+- A5 还提供一个带 `tmp` 的特化重载，用于 `uint8_t` 数据的 ND->ZZ 打包路径
+
+## 异常与非法情形
+
+- 如果路径组合不被支持，backend 会报错
+- A5 的非量化 `Acc -> Vec/Mat` 不支持 `int32_t` 到浮点类型的转换
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -174,7 +143,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -192,8 +161,16 @@ void example_manual() {
 }
 ```
 
+### PTO-AS
+
+```text
+%left  = tmov.m2l %mat  : !pto.tile<...> -> !pto.tile<...>
+%right = tmov.m2r %mat  : !pto.tile<...> -> !pto.tile<...>
+%vec   = tmov.a2v %acc  : !pto.tile<...> -> !pto.tile<...>
+```
+
 ## 相关页面
 
-- [布局与重排指令集](../../layout-and-rearrangement_zh.md)
+- 指令集总览：[布局与重排](../../layout-and-rearrangement_zh.md)
 - [TMATMUL](../matrix-and-matrix-vector/tmatmul_zh.md)
 - [布局参考](../../../state-and-types/layout_zh.md)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/treshape_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/treshape_zh.md
index 1d659de0..8ac1a006 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/treshape_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/treshape_zh.md
@@ -1,27 +1,18 @@
-# TRESHAPE
+# pto.treshape
 
-## 指令示意图
+`pto.treshape` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TRESHAPE tile operation](../../../../figures/isa/TRESHAPE.svg)
+## 概述
 
-## 简介
-
-`TRESHAPE` 重新解释一个 Tile 的字节视图，而不改变底层字节内容。它不是数值转换，也不是数据搬运；它做的是“同一块数据，用另一种 Tile 形状/类型规则来看”。
-
-如果你需要真的改变值，应该找 `TCVT`、`TMOV` 或量化类指令；如果你只是想换一种兼容的 Tile 视图，才使用 `TRESHAPE`。
+`TRESHAPE` 重新解释一个 Tile 的字节视图，而不改变底层字节内容。它不是数值转换，也不是数据搬运；它做的是"同一块数据，用另一种 Tile 形状/类型规则来看"。因此它的核心前提不是"形状能不能算"，而是"总字节数和布局类别能不能兼容"。
 
 ## 机制
 
-从结果上看，可以把 `TRESHAPE` 理解成：
-
-- `src` 的字节序列保持不变
-- `dst` 只是用另一套 Tile 元数据去解释同一批字节
-
-因此它的核心前提不是“形状能不能算”，而是“总字节数和布局类别能不能兼容”。
+从结果上看，可以把 `TRESHAPE` 理解成 `src` 的字节序列保持不变，而 `dst` 只是用另一套 Tile 元数据去解释同一批字节。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+### PTO-AS
 
 ```text
 %dst = treshape %src : !pto.tile<...>
@@ -29,13 +20,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.treshape ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -48,31 +39,45 @@ template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
 PTO_INST RecordEvent TRESHAPE(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
 
-### 所有 backend 都共享的硬约束
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 Tile | 源 Tile，保持字节内容不变 |
+| `dst` | 输出 Tile | 用新的 Tile 元数据解释同一批字节 |
 
-- `TileDataIn::Loc == TileDataOut::Loc`
-- `sizeof(InElem) * InNumel == sizeof(OutElem) * OutNumel`
-- 不能在 boxed layout 和 non-boxed layout 之间重解释
+## 预期输出
 
-### CPU 模拟器
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 与 `src` 字节内容相同但 Tile 元数据不同的 Tile |
 
-- CPU 还会额外检查元素类型兼容性：
-  - 同类型，或
-  - 都是浮点，或
-  - 都是整数
+## 副作用
 
-### A2/A3 / A5 / Kirin9030
+`TRESHAPE` 在 NPU 上实际更接近"受约束的别名/重解释"，而不是一次真实复制。
 
-- NPU 路径没有 CPU 那么强的“元素类别兼容”检查。
-- A2/A3 在非自动路径下会把 `dst` 直接别名到 `src` 的地址；自动路径用 `__cce_alias`。
-- A5 和 Kirin9030 复用 A2/A3 的 `TRESHAPE` 实现。
+## 约束
+
+- 所有 backend 都必须满足：`TileDataIn::Loc == TileDataOut::Loc`
+- 所有 backend 都必须满足：`sizeof(InElem) * InNumel == sizeof(OutElem) * OutNumel`
+- 不能在 boxed layout 和 non-boxed layout 之间重解释
+- CPU 模拟器还会额外检查元素类型兼容性：同类型，或都是浮点，或都是整数
+- A2/A3 在非自动路径下会把 `dst` 直接别名到 `src` 的地址；自动路径用 `__cce_alias`
+- A5 和 Kirin9030 复用 A2/A3 的 `TRESHAPE` 实现
+
+## 异常与非法情形
 
-这意味着 `TRESHAPE` 在 NPU 上更接近“受约束的别名/重解释”，而不是一次真实复制。
+- 如果违反上述约束，行为由具体 backend 决定
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
 
 ## 示例
 
+### C++ 自动模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -93,4 +98,4 @@ void example() {
 
 - [TALIAS](../../../TALIAS_zh.md)
 - [TMOV](./tmov_zh.md)
-- [布局与重排指令集](../../layout-and-rearrangement_zh.md)
+- 指令集总览：[布局与重排](../../layout-and-rearrangement_zh.md)
diff --git a/docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md
index e53fd834..b55c890c 100644
--- a/docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md
+++ b/docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md
@@ -1,41 +1,42 @@
-﻿# TTRANS
+﻿# pto.ttrans
 
-## 指令示意图
+`pto.ttrans` 属于[布局与重排](../../layout-and-rearrangement_zh.md)指令集。
 
-![TTRANS tile operation](../../../../figures/isa/TTRANS.svg)
-
-## 简介
+## 概述
 
-使用实现定义的临时 Tile 进行转置。
+使用实现定义的临时 Tile 进行转置。对于二维 Tile，在有效转置域上满足 `dst_{i,j} = src_{j,i}`。确切的形状/布局及转置域取决于目标硬件。
 
-## 数学语义
+## 机制
 
 对于二维 Tile，在有效转置域上：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{src}_{j,i} $$
 
-确切的形状/布局及转置域取决于目标硬件（参见约束）。
+确切的形状/布局及转置域取决于目标硬件。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+### PTO-AS
+
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
 ```text
 %dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
 ```
+
 降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -45,38 +46,47 @@ pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TTRANS(TileDataDst &dst, TileDataSrc &src, TileDataTmp &tmp, WaitEvents &... events);
+PTO_INST RecordEvent TTRANS(TileDataDst &dst, TileDataSrc &src, TileDataTmp &tmp, WaitEvents & ... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| dst | 输出 Tile | 目标 Tile |
+| src | 输入 Tile | 源 Tile |
+| tmp | 临时 Tile | C++ API 需要 tmp，但某些实现可能不使用它 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| dst | Tile | 转置后的 Tile |
+
+## 副作用
+
+确切的形状/布局及转置域取决于目标硬件。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`。
-    - 源布局必须是行主序（`TileDataSrc::isRowMajor`）。
-    - 元素大小必须是 `1`、`2` 或 `4` 字节。
-    - 支持的元素类型按元素宽度限制如下：
-    - 4 字节：`uint32_t`、`int32_t`、`float`
-    - 2 字节：`uint16_t`、`int16_t`、`half`、`bfloat16_t`
-    - 1 字节：`uint8_t`、`int8_t`
-    - 转置大小取自 `src.GetValidRow()` / `src.GetValidCol()`。
-- **实现检查 (A5)**:
-    - `sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`。
-    - 对输入和输出的主维度强制执行 32 字节对齐约束（行主序检查 `Cols * sizeof(T) % 32 == 0`，列主序检查 `Rows * sizeof(T) % 32 == 0`）。
-    - 支持的元素类型按元素宽度限制如下：
-    - 4 字节：`uint32_t`、`int32_t`、`float`
-    - 2 字节：`uint16_t`、`int16_t`、`half`、`bfloat16_t`
-    - 1 字节：`uint8_t`、`int8_t`
-    - 实现在静态 Tile 形状（`TileDataSrc::Rows/Cols`）上运算，不参考 `GetValidRow/GetValidCol`。
-- **临时 Tile**:
-    - C++ API 需要 `tmp`，但某些实现可能不使用它。
-- **ConvTile**:
-    - 支持在`TileType::Vec`上的ConvTile的格式转换。其元素大小必须是 `1`、`2` 或 `4` 字节。元素类型限制为`uint32_t`、`int32_t`、`float`、`uint16_t`、`int16_t`、`half`、`bfloat16_t`、`uint8_t`、`int8_t`。
-    - 支持ConvTile从`NCHW`到`NC1HWC0`的变换，其中`C1 == (C + C0 - 1)/C0`，HW满足对齐要求，即`H*W*sizeof(T)==0`. C0对应`c0_size`, 即`C0 * sizeof(T) == 32`。C0也可以为4。
-    - 支持ConvTile从`NC1HWC0`到`FRACTAL_Z`的变换, 其中`N1 == (N + N0 - 1)/N0`。N0为16。
+A2/A3 实现检查：`sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`，源布局必须是行主序（`TileDataSrc::isRowMajor`），元素大小必须是 `1`、`2` 或 `4` 字节。支持按元素宽度限制的元素类型：4 字节为 `uint32_t`、`int32_t`、`float`；2 字节为 `uint16_t`、`int16_t`、`half`、`bfloat16_t`；1 字节为 `uint8_t`、`int8_t`。转置大小取自 `src.GetValidRow()` / `src.GetValidCol()`。
+
+A5 实现检查：`sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`，对输入和输出的主维度强制执行 32 字节对齐约束（行主序检查 `Cols * sizeof(T) % 32 == 0`，列主序检查 `Rows * sizeof(T) % 32 == 0`）。支持按元素宽度限制的元素类型：4 字节为 `uint32_t`、`int32_t`、`float`；2 字节为 `uint16_t`、`int16_t`、`half`、`bfloat16_t`；1 字节为 `uint8_t`、`int8_t`。实现在静态 Tile 形状（`TileDataSrc::Rows/Cols`）上运算，不参考 `GetValidRow/GetValidCol`。
+
+临时 Tile：C++ API 需要 `tmp`，但某些实现可能不使用它。
+
+ConvTile：支持在 `TileType::Vec` 上的 ConvTile 格式转换，元素大小必须是 `1`、`2` 或 `4` 字节。元素类型限制为 `uint32_t`、`int32_t`、`float`、`uint16_t`、`int16_t`、`half`、`bfloat16_t`、`uint8_t`、`int8_t`。支持 ConvTile 从 `NCHW` 到 `NC1HWC0` 的变换，其中 `C1 == (C + C0 - 1)/C0`，HW 满足对齐要求，即 `H*W*sizeof(T)==0`，C0 对应 `c0_size`，即 `C0 * sizeof(T) == 32`，C0 也可以为 4。支持 ConvTile 从 `NC1HWC0` 到 `FRACTAL_Z` 的变换，其中 `N1 == (N + N0 - 1)/N0`，N0 为 16。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 源布局要求 | - | 行主序 | - |
+| 32 字节对齐 | - | - | 是 |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -94,7 +104,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -115,29 +125,21 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
 # 自动模式：由编译器/运行时负责资源放置与调度。
 %dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
 # 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
 # pto.tassign %arg0, @tile(0x1000)
 # pto.tassign %arg1, @tile(0x2000)
 %dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
 # AS Level 2 (DPS)
 pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
 ```
+
+## 相关页面
+
+- [布局与重排指令集](../../layout-and-rearrangement_zh.md)
+
+![TTRANS tile operation](../../../../figures/isa/TTRANS.svg)
diff --git a/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc_zh.md b/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc_zh.md
index ae0b5b42..568e85b4 100644
--- a/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc_zh.md
+++ b/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc_zh.md
@@ -1,36 +1,24 @@
-# TGEMV_ACC
+# pto.tgemv.acc
 
-## 指令示意图
+`pto.tgemv.acc` 属于[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)指令集。
 
-![TGEMV_ACC tile operation](../../../../figures/isa/TGEMV_ACC.svg)
+## 概述
 
-## 简介
+`TGEMV_ACC` 表示"在已有累加器上继续做一轮 GEMV 叠加"。它是 GEMV 的累加形式，对应 `TMATMUL_ACC` 在 `m = 1` 条件下的专门版本。
 
-`TGEMV_ACC` 表示“在已有累加器上继续做一轮 GEMV 叠加”。它是 GEMV 的累加形式，对应 `TMATMUL_ACC` 在 `m = 1` 条件下的专门版本。
-
-## 数学语义
-
-设：
-
-- `K = bMatrix.GetValidRow()`
-- `N = bMatrix.GetValidCol()`
-
-对 `0 <= j < N`：
+设 `K = bMatrix.GetValidRow()`，`N = bMatrix.GetValidCol()`，对 `0 <= j < N`：
 
 $$ \mathrm{C}_{\text{out}, 0,j} = \mathrm{C}_{\text{in}, 0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
 
 ## 机制
 
-这条指令仍然是 cube 路径合同，只是运行时固定 `m = 1`。和 `TMATMUL_ACC` 一样，需要把接口语义和当前实现现状区分开看：
+这条指令仍然是 cube 路径合同，只是运行时固定 `m = 1`。和 `TMATMUL_ACC` 一样，需要把接口语义和当前实现现状区分开看：CPU 模拟器会把 `cInMatrix` 作为显式输入累加器；当前 A2A3 / A5 实现会直接在 `cOutMatrix` 上继续累加，不会先把 `cInMatrix` 拷入 `cOutMatrix`。因此，若你显式传入不同的输入/输出累加器，不应默认所有后端都严格等价。最稳妥的写法，是让输入和输出共享同一块累加器 tile。
 
-- CPU 模拟器会把 `cInMatrix` 作为显式输入累加器；
-- 当前 A2A3 / A5 实现会直接在 `cOutMatrix` 上继续累加，不会先把 `cInMatrix` 拷入 `cOutMatrix`。
+## 语法
 
-因此，若你显式传入不同的输入 / 输出累加器，不应默认所有后端都严格等价。最稳妥的写法，是让输入和输出共享同一块累加器 tile。
+### PTO-AS
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -40,13 +28,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
 ```
 
@@ -64,30 +52,43 @@ PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft
                                WaitEvents &... events);
 ```
 
-## 输入与输出
+## 输入
 
-- `cInMatrix`：输入累加器。
-- `aMatrix`：左操作数 tile，必须是 `Left`。
-- `bMatrix`：右操作数 tile，必须是 `Right`。
-- `cOutMatrix`：输出累加器，必须是 `Acc`。
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `cInMatrix` | Acc | 输入累加器 |
+| `aMatrix` | Left | 左操作数 tile |
+| `bMatrix` | Right | 右操作数 tile |
+| `cOutMatrix` | Acc | 输出累加器 |
 
-## 约束
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cOutMatrix` | `Acc` | 在输入累加器值上叠加本次 GEMV 结果 |
 
-### 通用约束
+## 副作用
 
-- `TGEMV` 的角色、shape、dtype 和 target-profile 约束在这里全部成立；
-- 运行时固定 `m = 1`；
+在 `cOutMatrix` 上做原地累加写入。若 `cInMatrix` 与 `cOutMatrix` 指向不同 tile，当前后端行为可能不一致。
+
+## 约束
+
+- `TGEMV` 的角色、shape、dtype 和 target-profile 约束在这里全部成立。
+- 运行时固定 `m = 1`。
 - 对于跨后端的稳妥可移植写法，优先使用共享累加器。
+- A2A3 与 A5 的 dtype、布局和角色限制与 `TGEMV` 相同。
+- 当前 A2A3 / A5 实现不会先把 `cInMatrix` 复制到 `cOutMatrix`，而是直接对 `cOutMatrix` 继续叠加。
 
-### A2A3 与 A5 说明
+## 异常与非法情形
 
-- A2A3 与 A5 的 dtype、布局和角色限制，与 `TGEMV` 相同；
-- 当前 A2A3 / A5 实现不会先把 `cInMatrix` 搬到 `cOutMatrix`，而是直接对 `cOutMatrix` 所指向的累加器继续叠加。
+- 违反 `TGEMV` 的任一合法性约束。
+- 依赖"不同 `cInMatrix` / `cOutMatrix` 在所有后端上都严格等价"的假设。
 
-## 不允许的情形
+## Target-Profile 限制
 
-- 违反 `TGEMV` 的任一合法性约束；
-- 依赖“不同 `cInMatrix` / `cOutMatrix` 在所有后端上都严格等价”的假设。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 累加语义 | 严格按接口语义 | 就地累加 | 就地累加 |
 
 ## 性能与吞吐
 
@@ -97,17 +98,11 @@ PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft
 cycles = 14 + ceil(N/16) * ceil(K / baskK) * repeat_cost
 ```
 
-其中：
-
-- `baskK = 32 / sizeof(left_element_type)`；
-- int8、fp16 bucket 的 `repeat_cost = 1`；
-- fp32 bucket 的 `repeat_cost = 2`。
-
-当前仓库没有公开单列的 A5 latency / throughput 表。
+其中 `baskK = 32 / sizeof(left_element_type)`；int8、fp16 bucket 的 `repeat_cost = 1`；fp32 bucket 的 `repeat_cost = 2`。当前仓库没有公开单列的 A5 latency / throughput 表。
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -147,4 +142,4 @@ void example_accumulate() {
 
 - [TGEMV](./tgemv_zh.md)
 - [TGEMV_BIAS](./tgemv-bias_zh.md)
-- [矩阵与矩阵-向量指令集](../../matrix-and-matrix-vector_zh.md)
+- 指令集总览：[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)
diff --git a/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias_zh.md b/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias_zh.md
index 546770e9..96122e65 100644
--- a/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias_zh.md
+++ b/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias_zh.md
@@ -1,33 +1,24 @@
-# TGEMV_BIAS
+# pto.tgemv.bias
 
-## 指令示意图
+`pto.tgemv.bias` 属于[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)指令集。
 
-![TGEMV_BIAS tile operation](../../../../figures/isa/TGEMV_BIAS.svg)
+## 概述
 
-## 简介
+`TGEMV_BIAS` 表示"先做 GEMV，再并入 bias"。它仍然属于 cube 路径，只是运行时固定 `m = 1`。这条指令的 bias 不是任意 shape 的普通 tile，而是单行 Bias tile，因此它表达的是输出列上的偏置，而不是一般意义上的逐元素加法。
 
-`TGEMV_BIAS` 表示“先做 GEMV，再并入 bias”。它仍然属于 cube 路径，只是运行时固定 `m = 1`。
-
-这条指令的 bias 不是任意 shape 的普通 tile，而是单行 Bias tile。也正因此，它表达的是输出列上的偏置，而不是一般意义上的逐元素加法。
-
-## 数学语义
-
-设：
-
-- `K = bMatrix.GetValidRow()`
-- `N = bMatrix.GetValidCol()`
-
-对 `0 <= j < N`：
+设 `K = bMatrix.GetValidRow()`，`N = bMatrix.GetValidCol()`，对 `0 <= j < N`：
 
 $$ \mathrm{C}_{0,j} = \mathrm{Bias}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
 
 ## 机制
 
-`TGEMV_BIAS` 沿用 `TGEMV` 的 `Left` / `Right` / `Acc` 角色合同，再额外引入一块单行 `Bias` tile。由于输出本身只有一行，这里不需要像 `TMATMUL_BIAS` 那样再解释“按列向多行广播”；bias 的每一项直接对应结果中的一列。
+`TGEMV_BIAS` 沿用 `TGEMV` 的 `Left` / `Right` / `Acc` 角色合同，再额外引入一块单行 `Bias` tile。由于输出本身只有一行，bias 的每一项直接对应结果中的一列，不需要像 `TMATMUL_BIAS` 那样解释"按列向多行广播"。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+### PTO-AS
+
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -37,13 +28,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
 ```
 
@@ -62,41 +53,45 @@ PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &
                                 WaitEvents &... events);
 ```
 
-## 输入与输出
+## 输入
 
-- `aMatrix`：左操作数 tile，必须是 `Left`。
-- `bMatrix`：右操作数 tile，必须是 `Right`。
-- `biasData`：单行 Bias tile。
-- `cMatrix`：结果累加器 tile，必须是 `Acc`。
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `aMatrix` | Left | 左操作数 tile |
+| `bMatrix` | Right | 右操作数 tile |
+| `biasData` | Bias | 单行 Bias tile |
+| `cMatrix` | Acc | 结果累加器 tile |
 
-## 约束
+## 预期输出
 
-### 通用约束
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cMatrix` | `Acc` | 输出为一行结果，每列叠加上对应的 bias 值 |
 
-- `TGEMV` 的角色、shape、dtype 和 target-profile 约束在这里全部成立；
-- bias 的数据类型必须与结果累加器 `TileRes::DType` 一致；
-- bias 必须是单行 Bias tile。
-
-### A2A3 约束
+## 副作用
 
-`A2A3` 指 Ascend 910B 与 Ascend 910C。当前实现要求：
+结果写入 `cMatrix` 累加器。
 
-- `TileBias::Loc == TileType::Bias`
-- `TileBias::Rows == 1`
+## 约束
 
-### A5 约束
+- `TGEMV` 的角色、shape、dtype 和 target-profile 约束在这里全部成立。
+- bias 的数据类型必须与结果累加器 `TileRes::DType` 一致。
+- bias 必须是单行 Bias tile。
+- A2A3 要求 `TileBias::Loc == TileType::Bias` 且 `TileBias::Rows == 1`。
+- A5 要求 `TileBias::Loc == TileType::Bias`、`TileBias::Rows == 1` 且 `TileBias::isRowMajor == true`。
 
-`A5` 指 Ascend 950 PR 与 Ascend 950 DT。当前实现要求：
+## 异常与非法情形
 
-- `TileBias::Loc == TileType::Bias`
-- `TileBias::Rows == 1`
-- `TileBias::isRowMajor == true`
+- bias 不是单行。
+- bias 的角色或 dtype 不合法。
+- 违反 `TGEMV` 的任一合法性约束。
 
-## 不允许的情形
+## Target-Profile 限制
 
-- bias 不是单行；
-- bias 的角色或 dtype 不合法；
-- 违反 `TGEMV` 的任一合法性约束。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| bias 支持 | 支持 | 支持 | 支持 |
+| 布局要求 | 无额外要求 | 无额外要求 | bias 必须 row-major |
 
 ## 性能与吞吐
 
@@ -106,17 +101,11 @@ PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &
 cycles = 14 + ceil(N/16) * ceil(K / baskK) * repeat_cost
 ```
 
-其中：
-
-- `baskK = 32 / sizeof(left_element_type)`；
-- int8、fp16 bucket 的 `repeat_cost = 1`；
-- fp32 bucket 的 `repeat_cost = 2`。
-
-当前仓库没有公开单列的 A5 latency / throughput 表。
+其中 `baskK = 32 / sizeof(left_element_type)`；int8、fp16 bucket 的 `repeat_cost = 1`；fp32 bucket 的 `repeat_cost = 2`。当前仓库没有公开单列的 A5 latency / throughput 表。
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -136,7 +125,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -165,4 +154,4 @@ void example_manual() {
 - [TGEMV](./tgemv_zh.md)
 - [TGEMV_ACC](./tgemv-acc_zh.md)
 - [TGEMV_MX](./tgemv-mx_zh.md)
-- [矩阵与矩阵-向量指令集](../../matrix-and-matrix-vector_zh.md)
+- 指令集总览：[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)
diff --git a/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx_zh.md b/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx_zh.md
index 3b3f0a62..e065f365 100644
--- a/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx_zh.md
+++ b/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx_zh.md
@@ -1,26 +1,24 @@
-# TGEMV_MX
+# pto.tgemv.mx
 
-## 指令示意图
+`pto.tgemv.mx` 属于[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)指令集。
 
-![TGEMV_MX tile operation](../../../../figures/isa/TGEMV_MX.svg)
+## 概述
 
-## 简介
+`TGEMV_MX` 是 MX 路径下的矩阵向量乘版本。它和 `TMATMUL_MX` 共享同一套 scale Tile 思路，只是把乘法域收窄到 GEMV 形态（`m = 1`）。从接口上看，这条指令同样支持普通、累加和 bias 三种变体。GEMV 基础乘法域可以写成：
 
-`TGEMV_MX` 是 MX 路径下的矩阵向量乘版本。它和 `TMATMUL_MX` 共享同一套 scale Tile 思路，只是把乘法域收窄到 GEMV 形态。
-
-从接口上看，这条指令同样支持普通、累加和 bias 三种变体。
+$$ \mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
 
-## 数学语义
+`aScaleMatrix` / `bScaleMatrix` 会参与 MX 路径下的重建/缩放；它们的精确作用由目标定义，而不是由这页单独给出一套独立数值规则。
 
-GEMV 基础乘法域可以写成：
+## 机制
 
-$$ \mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
+`TGEMV_MX` 复用 `TMATMUL_MX` 相同的 scale Tile 机制，只是固定 GEMV 语义使用单行输出：A5 的 `aScaleMatrix` / `bScaleMatrix` 参与 MX 重建/缩放，精确数值语义由目标 backend 定义。CPU 模拟器当前会忽略 `aScaleMatrix` / `bScaleMatrix`，退化为普通 `TGEMV` / `TGEMV_ACC` / `TGEMV_BIAS`。Kirin9030 当前没有 MX 实现路径。
 
-和 `TMATMUL_MX` 一样，`aScaleMatrix` / `bScaleMatrix` 会参与 MX 路径下的重建 / 缩放；它们的精确作用由目标定义，而不是由这页单独给出一套独立数值规则。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 示意形式：
 
@@ -30,13 +28,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%acc : !pto.tile_buf<...>)
 ```
 
@@ -61,31 +59,52 @@ PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale
                               TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `aMatrix` | Left | 左操作数 tile |
+| `aScaleMatrix` | LeftScale | 左缩放 tile |
+| `bMatrix` | Right | 右操作数 tile |
+| `bScaleMatrix` | RightScale | 右缩放 tile |
+| `biasData` | Bias | 偏置 tile（仅 bias 变体） |
+| `cMatrix` | Acc | 结果累加器 tile |
 
-### A5 真正支持的 MX 语义
+## 预期输出
 
-- `TGEMV_MX` 复用与 `TMATMUL_MX` 相同的 `CheckMadMxValid(...)` 约束，因此：
-  - 结果累加器必须是 `float`
-  - 输入必须是受支持的 fp4 或 fp8 组合
-  - Left / Right / Acc 的位置与 fractal 方向必须合法
-- 运行时这里不再取 `m = aMatrix.GetValidRow()`，而是固定 GEMV 语义使用单行输出：
-  - `k = aMatrix.GetValidCol()`
-  - `n = bMatrix.GetValidCol()`
-  - `k/n` 均必须落在 `[1, 4095]`
-- Bias 变体要求：
-  - `biasData` 元素类型为 `float`
-  - `biasData` 是单行 `TileType::Bias`
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cMatrix` | `Acc` | MX 路径下的重建/缩放结果，输出为一行 |
 
-### 其他目标的现状
+## 副作用
 
+结果写入 `cMatrix` 累加器。
+
+## 约束
+
+- `TGEMV_MX` 复用与 `TMATMUL_MX` 相同的 `CheckMadMxValid(...)` 约束：结果累加器必须是 `float`；输入必须是受支持的 fp4 或 fp8 组合；Left / Right / Acc 的位置与 fractal 方向必须合法。
+- 运行时固定 GEMV 语义使用单行输出：`k = aMatrix.GetValidCol()`、`n = bMatrix.GetValidCol()`，均必须落在 `[1, 4095]`。
+- Bias 变体要求 `biasData` 元素类型为 `float` 且为单行 `TileType::Bias`。
 - CPU 模拟器当前会忽略 `aScaleMatrix` / `bScaleMatrix`，退化为普通 `TGEMV` / `TGEMV_ACC` / `TGEMV_BIAS`。
 - Kirin9030 当前没有 MX 实现路径。
 
-因此，这条指令的真实 MX 数值语义目前仍以 A5 backend 为准。
+## 异常与非法情形
+
+- 违反 `CheckMadMxValid(...)` 约束。
+- 在不支持 MX 语义的 target 上使用。
+- k 或 n 超出 `[1, 4095]` 范围。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| MX 语义 | 退化（忽略 scale） | 不支持 | 支持 |
+| fp4/fp8 量化 | 不支持 | 不支持 | 支持 |
 
 ## 示例
 
+### C++ 普通模式
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -107,8 +126,56 @@ void example() {
 }
 ```
 
+### C++ 累加模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_acc() {
+  using A = TileLeft<float8_e5m2_t, 1, 64>;
+  using B = TileRight<float8_e5m2_t, 64, 32>;
+  using ScaleA = TileLeftScale<float8_e8m0_t, 1, 2>;
+  using ScaleB = TileRightScale<float8_e8m0_t, 2, 32>;
+  using C = TileAcc<float, 1, 32>;
+
+  A a;
+  B b;
+  ScaleA scaleA;
+  ScaleB scaleB;
+  C acc;
+  TGEMV_MX(acc, acc, a, scaleA, b, scaleB);
+}
+```
+
+### C++ Bias 模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_bias() {
+  using A = TileLeft<float8_e5m2_t, 1, 64>;
+  using B = TileRight<float8_e5m2_t, 64, 32>;
+  using ScaleA = TileLeftScale<float8_e8m0_t, 1, 2>;
+  using ScaleB = TileRightScale<float8_e8m0_t, 2, 32>;
+  using Bias = Tile<TileType::Bias, float, 1, 32>;
+  using C = TileAcc<float, 1, 32>;
+
+  A a;
+  B b;
+  ScaleA scaleA;
+  ScaleB scaleB;
+  Bias bias;
+  C c;
+  TGEMV_MX(c, a, scaleA, b, scaleB, bias);
+}
+```
+
 ## 相关页面
 
 - [TMATMUL_MX](./tmatmul-mx_zh.md)
 - [TGEMV](./tgemv_zh.md)
-- [矩阵与矩阵向量指令集](../../matrix-and-matrix-vector_zh.md)
+- 指令集总览：[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)
diff --git a/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv_zh.md b/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv_zh.md
index 39d8a299..6d027f12 100644
--- a/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv_zh.md
+++ b/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv_zh.md
@@ -1,41 +1,26 @@
-# TGEMV
+# pto.tgemv
 
-## 指令示意图
+`pto.tgemv` 属于[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)指令集。
 
-![TGEMV tile operation](../../../../figures/isa/TGEMV.svg)
+## 概述
 
-## 简介
+`TGEMV` 是 cube 路径上的矩阵-向量乘指令。它不是 vector 指令，而是矩阵乘合同在 `m = 1` 条件下的专门形式：左输入仍走 `Left`，右输入仍走 `Right`，结果仍写入 `Acc`。把 GEMV 单独列成一条指令，是为了让接口、用法和调度语义更直接，不必让读者总是从"一般 matmul 的退化情况"去倒推。
 
-`TGEMV` 是 cube 路径上的矩阵-向量乘指令。它不是 vector 指令，而是矩阵乘合同在 `m = 1` 条件下的专门形式：左输入仍走 `Left`，右输入仍走 `Right`，结果仍写入 `Acc`。
-
-把 GEMV 单独列成一条指令，是为了让接口、用法和调度语义更直接，不必让读者总是从“一般 matmul 的退化情况”去倒推。
-
-## 数学语义
-
-设：
-
-- `K = bMatrix.GetValidRow()`
-- `N = bMatrix.GetValidCol()`
-
-对 `0 <= j < N`：
+设 `K = bMatrix.GetValidRow()`，`N = bMatrix.GetValidCol()`，对 `0 <= j < N`：
 
 $$ \mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
 
-这里输出只有一行，因此 `TGEMV` 可以理解为矩阵乘一条向量，但它仍然遵守 cube 路径的角色和布局约束。
+输出只有一行，`TGEMV` 可以理解为矩阵乘一条向量，但它仍然遵守 cube 路径的角色和布局约束。
 
 ## 机制
 
-`TGEMV` 仍然使用：
-
-- `Left` 作为左操作数，对应 L0A 路径；
-- `Right` 作为右操作数，对应 L0B 路径；
-- `Acc` 作为输出累加器。
+`TGEMV` 使用 `Left` 作为左操作数（对应 L0A 路径）、`Right` 作为右操作数（对应 L0B 路径）、`Acc` 作为输出累加器。和 `TMATMUL` 的主要区别不在"是不是 cube 指令"，而在运行时合同里固定了 `m = 1`。因此它的 costmodel、角色限制和 target 边界都更接近 matmul，而不是向量算术。
 
-和 `TMATMUL` 的主要区别，不在“是不是 cube 指令”，而在运行时合同里固定了 `m = 1`。因此它的 costmodel、角色限制和 target 边界都更接近 matmul，而不是向量算术。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -45,13 +30,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tgemv ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
 ```
 
@@ -64,74 +49,62 @@ template <typename TileRes, typename TileLeft, typename TileRight, typename... W
 PTO_INST RecordEvent TGEMV(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
 ```
 
-## 输入与输出
+## 输入
 
-- `aMatrix`：左操作数 tile，必须是 `Left`。
-- `bMatrix`：右操作数 tile，必须是 `Right`。
-- `cMatrix`：结果累加器 tile，必须是 `Acc`。
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `aMatrix` | Left | 左操作数 tile |
+| `bMatrix` | Right | 右操作数 tile |
+| `cMatrix` | Acc | 结果累加器 tile |
 
-输出合同是：生成一行结果 `C[0, j]`。这条指令不会把普通 vector buffer 直接提升成 cube 合同。
+## 预期输出
 
-## 约束
-
-### 通用约束
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cMatrix` | `Acc` | 生成一行结果 `C[0, j]`，这条指令不会把普通 vector buffer 直接提升成 cube 合同 |
 
-- 静态 shape 必须满足：
-  - `TileLeft::Rows == TileRes::Rows`
-  - `TileLeft::Cols == TileRight::Rows`
-  - `TileRight::Cols == TileRes::Cols`
-- tile 角色必须满足：
-  - `TileLeft::Loc == Left`
-  - `TileRight::Loc == Right`
-  - `TileRes::Loc == Acc`
-- 运行时要求：
-  - `m = 1`
-  - `k`、`n` 位于 `[1, 4095]`
+## 副作用
 
-### A2A3 约束
+结果写入 `cMatrix` 累加器。
 
-`A2A3` 指 Ascend 910B 与 Ascend 910C。当前仓内实现公开支持的 `(CType, AType, BType)` 组合包括：
-
-- `(int32_t, int8_t, int8_t)`
-- `(float, half, half)`
-- `(float, float, float)`
-- `(float, bfloat16_t, bfloat16_t)`
+## 约束
 
-### A5 约束
+- 静态 shape 必须满足 `TileLeft::Rows == TileRes::Rows`、`TileLeft::Cols == TileRight::Rows` 且 `TileRight::Cols == TileRes::Cols`。
+- tile 角色必须满足 `TileLeft::Loc == Left`、`TileRight::Loc == Right` 且 `TileRes::Loc == Acc`。
+- 运行时要求 `m = 1`，`k`、`n` 位于 `[1, 4095]`。
+- A2A3 支持的 `(CType, AType, BType)` 组合包括 `(int32_t, int8_t, int8_t)`、`(float, half, half)`、`(float, float, float)` 和 `(float, bfloat16_t, bfloat16_t)`。
+- A5 要求累加器类型必须是 `int32_t` 或 `float`；若累加器为 `int32_t`，左右输入都必须是 `int8_t`；若累加器为 `float`，当前实现支持 `half`、`bfloat16_t`、`float` 和部分 fp8 输入对；A5 的 `Right` 角色有独立布局/fractal 约束，不能拿 A2A3 的右操作数布局直接套用。
 
-`A5` 指 Ascend 950 PR 与 Ascend 950 DT。当前实现要求：
+## 异常与非法情形
 
-- 累加器类型必须是 `int32_t` 或 `float`；
-- 若累加器为 `int32_t`，左右输入都必须是 `int8_t`；
-- 若累加器为 `float`，当前实现支持 `half`、`bfloat16_t`、`float` 和部分 fp8 输入对；
-- A5 的 `Right` 角色有独立布局 / fractal 约束，不能拿 A2A3 的右操作数布局直接套用。
+- `m != 1`。
+- 角色不是 `Left` / `Right` / `Acc`。
+- 形状不满足 GEMV 兼容关系。
+- 在不支持的 target 上使用不支持的 dtype 组合。
 
-## 不允许的情形
+## Target-Profile 限制
 
-- `m != 1`；
-- 角色不是 `Left` / `Right` / `Acc`；
-- 形状不满足 GEMV 兼容关系；
-- 在不支持的 target 上使用不支持的 dtype 组合。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| int8 matmul | 支持 | 支持 | 支持 |
+| fp16 matmul | 支持 | 支持 | 支持 |
+| fp32 matmul | 支持 | 支持 | 支持 |
+| bf16 matmul | 支持 | 支持 | 支持 |
+| MX 量化路径 | 退化 | 不支持 | 部分支持 |
 
 ## 性能与吞吐
 
-仓内 A2A3 costmodel 对 `TGEMV` 与 `TMATMUL` 共用 `mad/mmad` 模型，只是 GEMV 固定 `m = 1`，因此公式可直接写成：
+仓内 A2A3 costmodel 对 `TGEMV` 与 `TMATMUL` 共用 `mad/mmad` 模型，只是 GEMV 固定 `m = 1`，公式为：
 
 ```text
 cycles = 14 + ceil(N/16) * ceil(K / baskK) * repeat_cost
 ```
 
-其中：
-
-- `baskK = 32 / sizeof(left_element_type)`；
-- int8、fp16 bucket 的 `repeat_cost = 1`；
-- fp32 bucket 的 `repeat_cost = 2`。
-
-当前仓库没有公开单列的 A5 latency / throughput 表。
+其中 `baskK = 32 / sizeof(left_element_type)`；int8、fp16 bucket 的 `repeat_cost = 1`；fp32 bucket 的 `repeat_cost = 2`。当前仓库没有公开单列的 A5 latency / throughput 表。
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -149,7 +122,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -175,4 +148,4 @@ void example_manual() {
 - [TGEMV_ACC](./tgemv-acc_zh.md)
 - [TGEMV_BIAS](./tgemv-bias_zh.md)
 - [TGEMV_MX](./tgemv-mx_zh.md)
-- [矩阵与矩阵-向量指令集](../../matrix-and-matrix-vector_zh.md)
+- 指令集总览：[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)
diff --git a/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc_zh.md b/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc_zh.md
index 19df0692..13073902 100644
--- a/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc_zh.md
+++ b/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc_zh.md
@@ -1,41 +1,24 @@
-# TMATMUL_ACC
+# pto.tmatmul.acc
 
-## 指令示意图
+`pto.tmatmul.acc` 属于[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)指令集。
 
-![TMATMUL_ACC tile operation](../../../../figures/isa/TMATMUL_ACC.svg)
+## 概述
 
-## 简介
+`TMATMUL_ACC` 表示"在已有累加器上继续做一次矩阵乘叠加"。它不是 `TMATMUL` 的语法别名，而是 K 维分块 GEMM 中真正需要的累加形式。如果 `TMATMUL` 负责生成一个新块，那么 `TMATMUL_ACC` 负责把后续块继续叠到这个累加器上。这就是两条指令必须并列存在的原因。
 
-`TMATMUL_ACC` 表示“在已有累加器上继续做一次矩阵乘叠加”。它不是 `TMATMUL` 的语法别名，而是 K 维分块 GEMM 中真正需要的累加形式。
-
-如果 `TMATMUL` 负责生成一个新块，那么 `TMATMUL_ACC` 负责把后续块继续叠到这个累加器上。这就是两条指令必须并列存在的原因。
-
-## 数学语义
-
-设：
-
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
-
-则：
+设 `M = aMatrix.GetValidRow()`，`K = aMatrix.GetValidCol()`，`N = bMatrix.GetValidCol()`，则：
 
 $$ \mathrm{C}_{\text{out}, i,j} = \mathrm{C}_{\text{in}, i,j} + \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
 
 ## 机制
 
-这条指令仍然使用 `Left` / `Right` / `Acc` 的 cube 路径合同。和 `TMATMUL` 相比，唯一核心差别是它把累加器视为已有值，而不是从零开始的新结果。
+这条指令仍然使用 `Left` / `Right` / `Acc` 的 cube 路径合同。和 `TMATMUL` 相比，唯一核心差别是它把累加器视为已有值，而不是从零开始的新结果。需要特别说明的是：接口把 `cInMatrix` 和 `cOutMatrix` 分开，是为了表达"显式输入累加器"和"显式输出累加器"这层语义。但当前仓内实现并不完全一致：CPU 模拟器会按接口字面语义使用 `cInMatrix` 作为输入累加器；当前 A2A3 / A5 后端实现会直接在 `cOutMatrix` 上继续累加，不会先把 `cInMatrix` 拷入 `cOutMatrix`。因此，如果你显式传入两个不同的 tile，对 CPU 与 NPU 的当前行为不应想当然地当作完全一致。最稳妥的可移植写法，是使用共享累加器重载，让输入和输出指向同一块 `Acc` tile。
 
-需要特别说明的是：接口把 `cInMatrix` 和 `cOutMatrix` 分开，是为了表达“显式输入累加器”和“显式输出累加器”这层语义。但当前仓内实现并不完全一致：
+## 语法
 
-- CPU 模拟器会按接口字面语义使用 `cInMatrix` 作为输入累加器；
-- 当前 A2A3 / A5 后端实现会直接在 `cOutMatrix` 上继续累加，不会先把 `cInMatrix` 拷入 `cOutMatrix`。
+### PTO-AS
 
-因此，如果你显式传入两个不同的 tile，对 CPU 与 NPU 的当前行为不应想当然地当作完全一致。最稳妥的可移植写法，是使用共享累加器重载，让输入和输出指向同一块 `Acc` tile。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -45,13 +28,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
 ```
 
@@ -75,34 +58,47 @@ PTO_INST RecordEvent TMATMUL_ACC(TileRes &cMatrix, TileLeft &aMatrix, TileRight
 
 最后一个重载是最稳妥的共享累加器写法。
 
-## 输入与输出
+## 输入
 
-- `cInMatrix`：输入累加器。
-- `aMatrix`：左操作数 tile，必须是 `Left`。
-- `bMatrix`：右操作数 tile，必须是 `Right`。
-- `cOutMatrix`：输出累加器，必须是 `Acc`。
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `cInMatrix` | Acc | 输入累加器 |
+| `aMatrix` | Left | 左操作数 tile |
+| `bMatrix` | Right | 右操作数 tile |
+| `cOutMatrix` | Acc | 输出累加器 |
 
 若使用共享累加器重载，则同一块 `Acc` tile 同时承担输入和输出。
 
-## 约束
+## 预期输出
 
-### 通用约束
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cOutMatrix` | `Acc` | 在输入累加器值上叠加本次矩阵乘结果 |
 
-- `TMATMUL` 的 shape、角色、dtype 和 target-profile 约束在这里全部成立；
-- `m`、`k`、`n` 仍取自 `aMatrix.GetValidRow()`、`aMatrix.GetValidCol()` 和 `bMatrix.GetValidCol()`；
-- 若追求跨 CPU / NPU 的稳妥可移植性，应优先使用共享累加器重载。
+## 副作用
 
-### A2A3 与 A5 说明
+在 `cOutMatrix` 上做原地累加写入。若 `cInMatrix` 与 `cOutMatrix` 指向不同 tile，当前后端行为可能不一致。
 
-- A2A3 与 A5 的 dtype、布局和角色限制，与 `TMATMUL` 相同；
+## 约束
+
+- `TMATMUL` 的 shape、角色、dtype 和 target-profile 约束在这里全部成立。
+- `m`、`k`、`n` 仍取自 `aMatrix.GetValidRow()`、`aMatrix.GetValidCol()` 和 `bMatrix.GetValidCol()`。
+- 若追求跨 CPU / NPU 的稳妥可移植性，应优先使用共享累加器重载。
+- A2A3 与 A5 的 dtype、布局和角色限制与 `TMATMUL` 相同。
 - 当前 A2A3 / A5 后端的实现路径不会先把 `cInMatrix` 复制到 `cOutMatrix`，而是直接把 `cOutMatrix` 交给底层累加路径。
 
-## 不允许的情形
+## 异常与非法情形
 
-- 违反 `TMATMUL` 的任一合法性约束；
-- 依赖“不同 `cInMatrix` / `cOutMatrix` 在所有后端上都严格等价”的假设；
+- 违反 `TMATMUL` 的任一合法性约束。
+- 依赖"不同 `cInMatrix` / `cOutMatrix` 在所有后端上都严格等价"的假设。
 - 把实现现状误写成更强的架构合同。
 
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 累加语义 | 严格按接口语义 | 就地累加 | 就地累加 |
+
 ## 性能与吞吐
 
 当前仓内 A2A3 costmodel 对 `TMATMUL_ACC` 与 `TMATMUL` 使用同一条 `mad/mmad` 公式：
@@ -111,17 +107,11 @@ PTO_INST RecordEvent TMATMUL_ACC(TileRes &cMatrix, TileLeft &aMatrix, TileRight
 cycles = 14 + ceil(M/16) * ceil(N/16) * ceil(K / baskK) * repeat_cost
 ```
 
-其中：
-
-- `baskK = 32 / sizeof(left_element_type)`；
-- int8、fp16 bucket 的 `repeat_cost = 1`；
-- fp32 bucket 的 `repeat_cost = 2`。
-
-当前仓库没有单列的 A5 latency / throughput 表；A5 仍只公开合法性和 dtype / layout 边界。
+其中 `baskK = 32 / sizeof(left_element_type)`；int8、fp16 bucket 的 `repeat_cost = 1`；fp32 bucket 的 `repeat_cost = 2`。当前仓库没有单列的 A5 latency / throughput 表。
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -161,4 +151,5 @@ void example_accumulate() {
 
 - [TMATMUL](./tmatmul_zh.md)
 - [TMATMUL_BIAS](./tmatmul-bias_zh.md)
-- [矩阵与矩阵-向量指令集](../../matrix-and-matrix-vector_zh.md)
+- [TMATMUL_MX](./tmatmul-mx_zh.md)
+- 指令集总览：[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)
diff --git a/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias_zh.md b/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias_zh.md
index 091f6379..d902da87 100644
--- a/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias_zh.md
+++ b/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias_zh.md
@@ -1,24 +1,12 @@
-# TMATMUL_BIAS
+# pto.tmatmul.bias
 
-## 指令示意图
+`pto.tmatmul.bias` 属于[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)指令集。
 
-![TMATMUL_BIAS tile operation](../../../../figures/isa/TMATMUL_BIAS.svg)
+## 概述
 
-## 简介
+`TMATMUL_BIAS` 表示"矩阵乘法后立即并入列偏置"。它表达的是矩阵乘积再加一行 bias，而不是另一种不同的乘法。把 bias 作为这条指令的显式输入有两个好处：一是合同清楚，二是文档不必把"先做 matmul、再做逐元素加"误写成完全等价的抽象。
 
-`TMATMUL_BIAS` 表示“矩阵乘法后立即并入列偏置”。它表达的是矩阵乘积再加一行 bias，而不是另一种不同的乘法。
-
-把 bias 作为这条指令的显式输入，有两个好处：一是合同清楚，二是文档不必把“先做 matmul、再做逐元素加”误写成完全等价的抽象。
-
-## 数学语义
-
-设：
-
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
-
-对 `0 <= i < M`、`0 <= j < N`：
+设 `M = aMatrix.GetValidRow()`，`K = aMatrix.GetValidCol()`，`N = bMatrix.GetValidCol()`，对 `0 <= i < M`、`0 <= j < N`：
 
 $$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} + \mathrm{Bias}_{0,j} $$
 
@@ -26,13 +14,13 @@ Bias tile 只有一行，因此它按输出列广播。
 
 ## 机制
 
-`TMATMUL_BIAS` 仍然走 `Left` / `Right` / `Acc` 的 cube 路径，只是在乘积生成后再引入一块 `Bias` tile。
+`TMATMUL_BIAS` 仍然走 `Left` / `Right` / `Acc` 的 cube 路径，只是在乘积生成后再引入一块 `Bias` tile。这条指令要求 bias 是"单行偏置 tile"，而不是任意 shape 的普通 tile。也正因为如此，它表达的是列偏置，而不是一般意义上的逐元素加法。
 
-这条指令要求 bias 是“单行偏置 tile”，而不是任意 shape 的普通 tile。也正因为如此，它表达的是列偏置，而不是一般意义上的逐元素加法。
+## 语法
 
-## 汇编语法
+### PTO-AS
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -42,13 +30,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
 ```
 
@@ -67,44 +55,46 @@ PTO_INST RecordEvent TMATMUL_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight
                                   WaitEvents &... events);
 ```
 
-## 输入与输出
+## 输入
 
-- `aMatrix`：左操作数 tile，必须是 `Left`。
-- `bMatrix`：右操作数 tile，必须是 `Right`。
-- `biasData`：偏置 tile，必须是 `Bias`，且为单行。
-- `cMatrix`：结果累加器 tile，必须是 `Acc`。
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `aMatrix` | Left | 左操作数 tile |
+| `bMatrix` | Right | 右操作数 tile |
+| `biasData` | Bias | 偏置 tile，必须为单行 |
+| `cMatrix` | Acc | 结果累加器 tile |
 
-输出合同是：先得到矩阵乘积，再把 `bias[0, j]` 加到每个输出列 `j` 上。
+## 预期输出
 
-## 约束
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cMatrix` | `Acc` | 先得到矩阵乘积，再把 `bias[0, j]` 加到每个输出列 `j` 上 |
 
-### 通用约束
+## 副作用
 
-- `TMATMUL` 的所有 shape、角色、dtype 和 target-profile 约束在这里同样成立；
-- `biasData` 的数据类型必须与结果累加器 `TileRes::DType` 一致；
-- `biasData` 必须是单行 Bias tile。
+结果写入 `cMatrix` 累加器。
 
-### A2A3 约束
-
-`A2A3` 指 Ascend 910B 与 Ascend 910C。当前实现要求：
-
-- `TileBias::Loc == TileType::Bias`
-- `TileBias::Rows == 1`
+## 约束
 
-### A5 约束
+- `TMATMUL` 的所有 shape、角色、dtype 和 target-profile 约束在这里同样成立。
+- `biasData` 的数据类型必须与结果累加器 `TileRes::DType` 一致。
+- `biasData` 必须是单行 Bias tile。
+- A2A3 要求 `TileBias::Loc == TileType::Bias` 且 `TileBias::Rows == 1`。
+- A5 要求 `TileBias::Loc == TileType::Bias`、`TileBias::Rows == 1` 且 `TileBias::isRowMajor == true`。
 
-`A5` 指 Ascend 950 PR 与 Ascend 950 DT。当前实现要求：
+## 异常与非法情形
 
-- `TileBias::Loc == TileType::Bias`
-- `TileBias::Rows == 1`
-- `TileBias::isRowMajor == true`
+- 用普通 tile 代替 Bias tile。
+- bias 不是单行。
+- bias dtype 与结果累加器 dtype 不一致。
+- 违反 `TMATMUL` 的任一合法性约束。
 
-## 不允许的情形
+## Target-Profile 限制
 
-- 用普通 tile 代替 Bias tile；
-- bias 不是单行；
-- bias dtype 与结果累加器 dtype 不一致；
-- 违反 `TMATMUL` 的任一合法性约束。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| bias 支持 | 支持 | 支持 | 支持 |
+| 布局要求 | 无额外要求 | 无额外要求 | bias 必须 row-major |
 
 ## 性能与吞吐
 
@@ -114,17 +104,11 @@ PTO_INST RecordEvent TMATMUL_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight
 cycles = 14 + ceil(M/16) * ceil(N/16) * ceil(K / baskK) * repeat_cost
 ```
 
-其中：
-
-- `baskK = 32 / sizeof(left_element_type)`；
-- int8、fp16 bucket 的 `repeat_cost = 1`；
-- fp32 bucket 的 `repeat_cost = 2`。
-
-当前仓库没有单列的 A5 latency / throughput 表。
+其中 `baskK = 32 / sizeof(left_element_type)`；int8、fp16 bucket 的 `repeat_cost = 1`；fp32 bucket 的 `repeat_cost = 2`。当前仓库没有单列的 A5 latency / throughput 表。
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -144,7 +128,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -173,4 +157,4 @@ void example_manual() {
 - [TMATMUL](./tmatmul_zh.md)
 - [TMATMUL_ACC](./tmatmul-acc_zh.md)
 - [TMATMUL_MX](./tmatmul-mx_zh.md)
-- [矩阵与矩阵-向量指令集](../../matrix-and-matrix-vector_zh.md)
+- 指令集总览：[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)
diff --git a/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx_zh.md b/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx_zh.md
index 6759bc42..2bd5d3c3 100644
--- a/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx_zh.md
+++ b/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx_zh.md
@@ -1,32 +1,26 @@
-# TMATMUL_MX
+# pto.tmatmul.mx
 
-## 指令示意图
+`pto.tmatmul.mx` 属于[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)指令集。
 
-![TMATMUL_MX tile operation](../../../../figures/isa/TMATMUL_MX.svg)
+## 概述
 
-## 简介
+`TMATMUL_MX` 是带双 scale Tile 的矩阵乘法扩展，用来表达 MX 路径下的混合精度/量化 GEMM。它和普通 `TMATMUL` 共享 Left / Right / Acc 的整体结构，但额外携带 `aScaleMatrix` 和 `bScaleMatrix`。这条指令存在的意义，是把"矩阵乘法本体"和"MX 重建/缩放参数"同时放进一条架构可见指令里，而不是靠外部约定隐式拼接。
 
-`TMATMUL_MX` 是带双 scale Tile 的矩阵乘法扩展，用来表达 MX 路径下的混合精度 / 量化 GEMM。它和普通 `TMATMUL` 共享 Left / Right / Acc 的整体结构，但额外携带 `aScaleMatrix` 和 `bScaleMatrix`。
+设 `M = aMatrix.GetValidRow()`，`K = aMatrix.GetValidCol()`，`N = bMatrix.GetValidCol()`，基础乘法域仍然是：
 
-这条指令存在的意义，是把“矩阵乘法本体”和“MX 重建/缩放参数”同时放进一条架构可见指令里，而不是靠外部约定隐式拼接。
-
-## 数学语义
+$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
 
-设：
+区别在于，`aScaleMatrix` 和 `bScaleMatrix` 会参与 MX 路径下的重建/缩放。它们如何参与，不是通用 PTO 规则，而是目标定义的 MX 语义。
 
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
+## 机制
 
-基础乘法域仍然是：
+当前仓库里，只有 A5 backend 真正实现了 MX 语义。CPU 模拟器会接受 `TMATMUL_MX` 接口，但当前实现会忽略 `aScaleMatrix` / `bScaleMatrix`，直接退化为普通 `TMATMUL` / `TMATMUL_ACC` / `TMATMUL_BIAS`。Kirin9030 当前明确不支持 `TMATMUL_MX`，对应实现直接 `static_assert` 失败。因此，如果要验证真正的 MX 语义，应以 A5 为准；CPU 只能用来跑接口形态或近似流程，不适合作为 MX 数值语义的最终依据。
 
-$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
+## 语法
 
-区别在于，`aScaleMatrix` 和 `bScaleMatrix` 会参与 MX 路径下的重建 / 缩放。它们如何参与，不是通用 PTO 规则，而是目标定义的 MX 语义。
+### PTO-AS
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 示意形式：
 
@@ -38,7 +32,7 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 %c_out = pto.tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 %c = pto.tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
@@ -46,7 +40,7 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
 pto.tmatmul.mx.acc ins(%c_in, %a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
 pto.tmatmul.mx.bias ins(%a, %a_scale, %b, %b_scale, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
@@ -75,53 +69,58 @@ PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftSca
 
 `AccPhase` 的模板重载与普通 `TMATMUL` 一样，主要用于目标实现侧的 unit-flag 选择。
 
-## 约束
-
-### A5 真正支持的 MX 语义
+## 输入
 
-- 当前仓库里，只有 A5 backend 真正实现了 MX 语义。
-- A5 的 `CheckMadMxValid(...)` 要求：
-  - 结果累加器类型必须是 `float`
-  - 输入必须是受支持的 fp4 或 fp8 组合
-  - `TileLeft::Cols` 必须是 `64` 的倍数
-  - 若走 fp4 路径，`TileLeft::Cols` 还必须是偶数
-  - Left / Right / Acc 的位置与 fractal 方向必须符合 cube 路径要求
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `aMatrix` | Left | 左操作数 tile |
+| `aScaleMatrix` | LeftScale | 左缩放 tile |
+| `bMatrix` | Right | 右操作数 tile |
+| `bScaleMatrix` | RightScale | 右缩放 tile |
+| `biasData` | Bias | 偏置 tile（仅 bias 变体） |
+| `cMatrix` | Acc | 结果累加器 tile |
 
-支持的输入组合包括：
+## 预期输出
 
-- fp4：
-  - `float4_e1m2x2_t` / `float4_e1m2x2_t`
-  - `float4_e1m2x2_t` / `float4_e2m1x2_t`
-  - `float4_e2m1x2_t` / `float4_e2m1x2_t`
-  - `float4_e2m1x2_t` / `float4_e1m2x2_t`
-- fp8：
-  - `float8_e4m3_t` / `float8_e4m3_t`
-  - `float8_e4m3_t` / `float8_e5m2_t`
-  - `float8_e5m2_t` / `float8_e4m3_t`
-  - `float8_e5m2_t` / `float8_e5m2_t`
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cMatrix` | `Acc` | MX 路径下的重建/缩放结果 |
 
-Bias 变体还要求：
+## 副作用
 
-- `biasData` 的元素类型必须是 `float`
-- `biasData` 必须是单行 `TileType::Bias`
+结果写入 `cMatrix` 累加器。
 
-### 运行时范围
+## 约束
 
+- A5 的 `CheckMadMxValid(...)` 要求：结果累加器类型必须是 `float`；输入必须是受支持的 fp4 或 fp8 组合；`TileLeft::Cols` 必须是 `64` 的倍数；若走 fp4 路径，`TileLeft::Cols` 还必须是偶数；Left / Right / Acc 的位置与 fractal 方向必须符合 cube 路径要求。
+- 支持的 fp4 组合：`float4_e1m2x2_t / float4_e1m2x2_t`、`float4_e1m2x2_t / float4_e2m1x2_t`、`float4_e2m1x2_t / float4_e2m1x2_t`、`float4_e2m1x2_t / float4_e1m2x2_t`。
+- 支持的 fp8 组合：`float8_e4m3_t / float8_e4m3_t`、`float8_e4m3_t / float8_e5m2_t`、`float8_e5m2_t / float8_e4m3_t`、`float8_e5m2_t / float8_e5m2_t`。
+- Bias 变体还要求 `biasData` 的元素类型必须是 `float` 且为单行 `TileType::Bias`。
 - A5 的 `m/k/n` 均必须落在 `[1, 4095]`。
+- CPU 模拟器会接受接口但忽略 scale，退化为普通 `TMATMUL` / `TMATMUL_ACC` / `TMATMUL_BIAS`。
+- Kirin9030 明确不支持 `TMATMUL_MX`，对应实现直接 `static_assert` 失败。
 
-### 其他目标的现状
+## 异常与非法情形
 
-- CPU 模拟器会接受 `TMATMUL_MX` 接口，但当前实现会忽略 `aScaleMatrix` / `bScaleMatrix`，直接退化为普通 `TMATMUL` / `TMATMUL_ACC` / `TMATMUL_BIAS`。
-- Kirin9030 当前明确不支持 `TMATMUL_MX`，对应实现直接 `static_assert` 失败。
+- 违反 `CheckMadMxValid(...)` 约束。
+- 使用不支持的 fp4/fp8 组合。
+- `TileLeft::Cols` 不是 `64` 的倍数（fp4 路径下还须为偶数）。
+- 在不支持 MX 语义的 target（Kirin9030）上使用。
+- m、k 或 n 超出 `[1, 4095]` 范围。
 
-这意味着：
+## Target-Profile 限制
 
-- 如果你要验证真正的 MX 语义，应以 A5 为准。
-- CPU 只能用来跑接口形态或近似流程，不适合作为 MX 数值语义的最终依据。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| MX 语义 | 退化（忽略 scale） | 不支持 | 支持 |
+| fp4 量化 | 不支持 | 不支持 | 支持 |
+| fp8 量化 | 不支持 | 不支持 | 支持 |
+| MX bias | 不支持 | 不支持 | 支持 |
+| Kirin9030 | 不支持 | 不支持 | 不支持 |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式（普通）
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -145,7 +144,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -179,4 +178,6 @@ void example_manual() {
 
 - [TGEMV_MX](./tgemv-mx_zh.md)
 - [TMATMUL](./tmatmul_zh.md)
-- [矩阵与矩阵向量指令集](../../matrix-and-matrix-vector_zh.md)
+- [TMATMUL_ACC](./tmatmul-acc_zh.md)
+- [TMATMUL_BIAS](./tmatmul-bias_zh.md)
+- 指令集总览：[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)
diff --git a/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul_zh.md b/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul_zh.md
index 5e542e60..ff8c48a2 100644
--- a/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul_zh.md
+++ b/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul_zh.md
@@ -1,43 +1,26 @@
-# TMATMUL
+# pto.tmatmul
 
-## 指令示意图
+`pto.tmatmul` 属于[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)指令集。
 
-![TMATMUL tile operation](../../../../figures/isa/TMATMUL.svg)
+## 概述
 
-## 简介
+`TMATMUL` 是 tile 路径里生成新累加器结果的基础矩阵乘指令。它从 `Left` 读取左操作数，从 `Right` 读取右操作数，把结果写入 `Acc`。这条指令和 `TMATMUL_ACC` 分开的原因很直接：`TMATMUL` 代表"这次计算生成一个新的输出块"，而 `TMATMUL_ACC` 代表"在已有累加器上继续叠加"；把两者混在一起，会让 K 维分块循环里的资源和调度语义变得不清楚。
 
-`TMATMUL` 是 tile 路径里生成新累加器结果的基础矩阵乘指令。它从 `Left` 读取左操作数，从 `Right` 读取右操作数，把结果写入 `Acc`。
-
-这条指令和 `TMATMUL_ACC` 分开的原因很直接：`TMATMUL` 代表“这次计算生成一个新的输出块”，而 `TMATMUL_ACC` 代表“在已有累加器上继续叠加”。把两者混在一起，会让 K 维分块循环里的资源和调度语义变得不清楚。
-
-## 数学语义
-
-设：
-
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
-
-对 `0 <= i < M`、`0 <= j < N`：
+设 `M = aMatrix.GetValidRow()`，`K = aMatrix.GetValidCol()`，`N = bMatrix.GetValidCol()`，对 `0 <= i < M`、`0 <= j < N`：
 
 $$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
 
-有效计算域由输入 tile 的 valid region 决定，而不是单纯由静态 tile 尺寸决定。指令不会隐式做广播、reshape，或把不合法的输入域“修复”为某种可移植结果。
+有效计算域由输入 tile 的 valid region 决定，而不是单纯由静态 tile 尺寸决定。指令不会隐式做广播、reshape，或把不合法的输入域"修复"为某种可移植结果。
 
 ## 机制
 
-`TMATMUL` 属于 cube 路径，不是普通逐元素 tile 运算：
+`TMATMUL` 属于 cube 路径，不是普通逐元素 tile 运算：左操作数必须是 `Left` tile（对应 L0A 路径）；右操作数必须是 `Right` tile（对应 L0B 路径）；输出必须是 `Acc` tile；结果域由 `(M, K) x (K, N) -> (M, N)` 的合法矩阵乘域决定。`Right` 虽然在 A2A3 与 A5 上都叫同一个架构角色，但它们的具体布局要求并不完全相同，不能把某一侧的物理布局理解成"全 target 通用"。
 
-- 左操作数必须是 `Left` tile，对应 L0A 路径；
-- 右操作数必须是 `Right` tile，对应 L0B 路径；
-- 输出必须是 `Acc` tile；
-- 结果域由 `(M, K) x (K, N) -> (M, N)` 的合法矩阵乘域决定。
+## 语法
 
-`Right` 虽然在 A2A3 与 A5 上都叫同一个架构角色，但它们的具体布局要求并不完全相同，不能把某一侧的物理布局理解成“全 target 通用”。
+### PTO-AS
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 同步形式：
 
@@ -47,13 +30,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
 ```
 
@@ -71,84 +54,62 @@ PTO_INST RecordEvent TMATMUL(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMa
 
 带 `AccPhase` 的模板重载不改变矩阵乘法的算术语义；它只是让具体后端在 unit-flag 或实现细节上做选择。
 
-## 输入与输出
-
-- `aMatrix`：左操作数 tile，必须是 `Left`。
-- `bMatrix`：右操作数 tile，必须是 `Right`。
-- `cMatrix`：结果累加器 tile，必须是 `Acc`。
-
-结果写入 `cMatrix`。对读者可见的合同是：输出块由本次 `A * B` 生成，而不是在旧累加器上继续叠加。
+## 输入
 
-## 约束
-
-### 通用约束
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `aMatrix` | Left | 左操作数 tile |
+| `bMatrix` | Right | 右操作数 tile |
+| `cMatrix` | Acc | 结果累加器 tile |
 
-- 静态 shape 必须满足：
-  - `TileLeft::Rows == TileRes::Rows`
-  - `TileLeft::Cols == TileRight::Rows`
-  - `TileRight::Cols == TileRes::Cols`
-- tile 角色必须满足：
-  - `TileLeft::Loc == Left`
-  - `TileRight::Loc == Right`
-  - `TileRes::Loc == Acc`
-- 运行时 `m`、`k`、`n` 必须位于 `[1, 4095]`。
+## 预期输出
 
-### A2A3 约束
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `cMatrix` | `Acc` | 输出块由本次 `A * B` 生成，而不是在旧累加器上继续叠加 |
 
-`A2A3` 指 Ascend 910B 与 Ascend 910C。当前仓内实现公开支持的 `(CType, AType, BType)` 组合包括：
+## 副作用
 
-- `(int32_t, int8_t, int8_t)`
-- `(float, half, half)`
-- `(float, float, float)`
-- `(float, bfloat16_t, bfloat16_t)`
+结果写入 `cMatrix` 累加器。
 
-### A5 约束
-
-`A5` 指 Ascend 950 PR 与 Ascend 950 DT。当前仓内实现要求：
+## 约束
 
-- 累加器类型必须是 `int32_t` 或 `float`；
-- 若累加器为 `int32_t`，左右输入都必须是 `int8_t`；
-- 若累加器为 `float`，当前实现支持 `half`、`bfloat16_t`、`float` 和部分 fp8 输入对；
-- A5 还要求固定的角色布局组合：
-  - Left：`Loc == Left`，非 row-major，`SFractal == RowMajor`
-  - Right：`Loc == Right`，row-major，`SFractal == ColMajor`
-  - Acc：`Loc == Acc`，非 row-major，`SFractal == RowMajor`
+- 静态 shape 必须满足 `TileLeft::Rows == TileRes::Rows`、`TileLeft::Cols == TileRight::Rows` 且 `TileRight::Cols == TileRes::Cols`。
+- tile 角色必须满足 `TileLeft::Loc == Left`、`TileRight::Loc == Right` 且 `TileRes::Loc == Acc`。
+- 运行时 `m`、`k`、`n` 必须位于 `[1, 4095]`。
+- A2A3 支持的 `(CType, AType, BType)` 组合包括 `(int32_t, int8_t, int8_t)`、`(float, half, half)`、`(float, float, float)` 和 `(float, bfloat16_t, bfloat16_t)`。
+- A5 要求累加器类型必须是 `int32_t` 或 `float`；若累加器为 `int32_t`，左右输入都必须是 `int8_t`；若累加器为 `float`，当前实现支持 `half`、`bfloat16_t`、`float` 和部分 fp8 输入对；A5 还要求固定的角色布局组合：Left 为 `Loc == Left`，非 row-major，`SFractal == RowMajor`；Right 为 `Loc == Right`，row-major，`SFractal == ColMajor`；Acc 为 `Loc == Acc`，非 row-major，`SFractal == RowMajor`。
 
-## 不允许的情形
+## 异常与非法情形
 
-- 使用不是 `Left` / `Right` / `Acc` 的角色组合；
-- 形状不满足 `(M, K) x (K, N) -> (M, N)`；
-- 在不支持的 target 上使用不支持的 dtype 组合；
+- 使用不是 `Left` / `Right` / `Acc` 的角色组合。
+- 形状不满足 `(M, K) x (K, N) -> (M, N)`。
+- 在不支持的 target 上使用不支持的 dtype 组合。
 - 把某个 target 上偶然可运行的布局当成可移植合同。
 
-## 性能与吞吐
+## Target-Profile 限制
 
-仓内当前公开的性能数据主要来自 A2A3 costmodel。`TMATMUL`、`TMATMUL_ACC` 与 `TMATMUL_BIAS` 使用同一条 `mad/mmad` cube 模型：
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| int8 matmul | 支持 | 支持 | 支持 |
+| fp16 matmul | 支持 | 支持 | 支持 |
+| fp32 matmul | 支持 | 支持 | 支持 |
+| bf16 matmul | 支持 | 支持 | 支持 |
+| 布局约束 | 无 | 无 | Right 必须 row-major |
 
-- 启动开销：`14` cycles；
-- repeat 次数：`ceil(M/16) * ceil(N/16) * ceil(K / baskK)`；
-- `baskK = 32 / sizeof(left_element_type)`；
-- 单个 repeat 的稳态代价：
-  - int8、fp16 bucket 为 `1` cycle；
-  - fp32 bucket 为 `2` cycles。
+## 性能与吞吐
 
-因此 A2A3 的公开公式是：
+仓内当前公开的性能数据主要来自 A2A3 costmodel。`TMATMUL`、`TMATMUL_ACC` 与 `TMATMUL_BIAS` 使用同一条 `mad/mmad` cube 模型：启动开销为 `14` cycles；repeat 次数为 `ceil(M/16) * ceil(N/16) * ceil(K / baskK)`；`baskK = 32 / sizeof(left_element_type)`；单个 repeat 的稳态代价：int8、fp16 bucket 为 `1` cycle；fp32 bucket 为 `2` cycles。A2A3 公开公式为：
 
 ```text
 cycles = 14 + ceil(M/16) * ceil(N/16) * ceil(K / baskK) * repeat_cost
 ```
 
-仓内 costmodel 测试样例包括：
-
-- half `40x50 * 50x60`：`62` cycles；
-- int8 `6x7 * 7x8`：`15` cycles；
-- float `120x110 * 110x50`：`910` cycles。
-
-当前仓库没有公开单列的 A5 latency / throughput 表，因此 A5 这里只能精确写合法性和 dtype 边界，不能编造周期数字。
+仓内 costmodel 测试样例包括：half `40x50 * 50x60` 为 `62` cycles；int8 `6x7 * 7x8` 为 `15` cycles；float `120x110 * 110x50` 为 `910` cycles。当前仓库没有公开单列的 A5 latency / throughput 表。
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -166,7 +127,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -189,7 +150,7 @@ void example_manual() {
 
 ## 相关页面
 
-- [矩阵与矩阵-向量指令集](../../matrix-and-matrix-vector_zh.md)
 - [TMATMUL_ACC](./tmatmul-acc_zh.md)
 - [TMATMUL_BIAS](./tmatmul-bias_zh.md)
 - [TMATMUL_MX](./tmatmul-mx_zh.md)
+- 指令集总览：[矩阵与矩阵-向量](../../matrix-and-matrix-vector_zh.md)
diff --git a/docs/isa/tile/ops/memory-and-data-movement/tload.md b/docs/isa/tile/ops/memory-and-data-movement/tload.md
index 7585bdfc..00a56a89 100644
--- a/docs/isa/tile/ops/memory-and-data-movement/tload.md
+++ b/docs/isa/tile/ops/memory-and-data-movement/tload.md
@@ -61,8 +61,11 @@ PTO_INST RecordEvent TLOAD(TileData &dst, GlobalData &src, WaitEvents &... event
 
 ## Inputs
 
-- `src` is the source GlobalTensor to load from.
-- `dst` names the destination tile. The operation uses dst's valid region for the transfer shape.
+|| Operand | Role | Description |
+||---------|------|-------------|
+|| `src` | Source | GlobalTensor to load data from |
+|| `dst` | Destination | Tile that receives the loaded data; transfer size is determined by `dst`'s valid region |
+|| `WaitEvents...` | Optional synchronization | `RecordEvent` tokens to wait on before issuing the load |
 
 ## Expected Outputs
 
@@ -180,4 +183,8 @@ pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>) outs(%dst : !pto.til
 ## Related Ops / Instruction Set Links
 
 - Instruction set overview: [Memory And Data Movement](../../memory-and-data-movement.md)
+- Previous op in instruction set: (none — first in set)
 - Next op in instruction set: [pto.tprefetch](./tprefetch.md)
+- Complementary operation: [pto.tstore](./tstore.md)
+- Vector load counterpart: [pto.vlds](../../../vector/ops/vector-load-store/vlds.md)
+- Instruction set: [Tile Instructions](../../README.md)
diff --git a/docs/isa/tile/ops/memory-and-data-movement/tstore-fp_zh.md b/docs/isa/tile/ops/memory-and-data-movement/tstore-fp_zh.md
index ef339c9e..41a8e92e 100644
--- a/docs/isa/tile/ops/memory-and-data-movement/tstore-fp_zh.md
+++ b/docs/isa/tile/ops/memory-and-data-movement/tstore-fp_zh.md
@@ -1,19 +1,16 @@
-# TSTORE_FP
+# pto.tstore.fp
 
-## 指令示意图
+`pto.tstore.fp` 属于[内存与数据搬运指令](../../memory-and-data-movement_zh.md)集。
 
-![TSTORE_FP tile operation](../../../../figures/isa/TSTORE_FP.svg)
+## 概述
 
-## 简介
+`TSTORE_FP` 是带向量量化参数的累加器写回指令。它把 `Acc` Tile 写回 `GlobalTensor`，同时使用额外的 `fp` Tile 提供向量量化或反量化所需的参数。如果普通 `TSTORE` 足够表达写回，就不需要这条指令；只有当写回过程本身依赖一组外部缩放参数时，才使用 `TSTORE_FP`。
 
-`TSTORE_FP` 是带向量量化参数的累加器写回指令。它把 `Acc` Tile 写回 `GlobalTensor`，同时使用额外的 `fp` Tile 提供向量量化或反量化所需的参数。
+## 机制
 
-如果普通 `TSTORE` 足够表达你的写回，就不需要这条指令；只有当写回过程本身依赖一组外部缩放参数时，才使用 `TSTORE_FP`。
-
-## 数学语义
+### 数学语义
 
 设：
-
 - `R = src.GetValidRow()`
 - `C = src.GetValidCol()`
 
@@ -21,13 +18,11 @@
 
 $$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
 
-其中 `Convert` 表示“结合 `fp` 参数完成的向量量化 / 反量化写回”。地址计算规则与普通 `TSTORE` 一致，仍由 `GlobalTensor` 的 shape / stride 决定。
+其中 `Convert` 表示"结合 `fp` 参数完成的向量量化 / 反量化写回"。地址计算规则与普通 `TSTORE` 一致，仍由 `GlobalTensor` 的 shape / stride 决定。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
-
-同步形式：
+### PTO-AS
 
 ```text
 tstore.fp %src, %fp, %sv_out[%c0, %c0]
@@ -35,13 +30,13 @@ tstore.fp %src, %fp, %sv_out[%c0, %c0]
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tstore.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
 ```
 
@@ -55,50 +50,84 @@ template <typename TileData, typename GlobalData, typename FpTileData, AtomicTyp
 PTO_INST RecordEvent TSTORE_FP(GlobalData &dst, TileData &src, FpTileData &fp, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 | 源 Tile，必须为 `Acc` Tile |
+| `fp` | 输入 | 提供量化/反量化参数的 Tile |
+| `dst` | 输出 | 目标 GlobalTensor |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | GlobalTensor | 量化写回后的数据 |
+
+## 副作用
+
+将累加器数据写入全局内存，可能触发量化/反量化转换。
+
 ## 约束
 
 ### 通用约束
 
-- `src` 必须是 `Acc` Tile。
-- `TSTORE_FP` 复用累加器到 GM 的量化写回检查，因此目标布局、行列范围和 dtype 合法域都沿用对应 backend 的 quantized `Acc -> GM` 路径。
-- 当前实现不会对 `FpTileData::Loc` 做直接 `static_assert`，但可移植代码应将 `fp` 建成 `TileType::Scaling`。
+- `src` 必须是 `Acc` Tile
+- `TSTORE_FP` 复用累加器到 GM 的量化写回检查，因此目标布局、行列范围和 dtype 合法域都沿用对应 backend 的 quantized `Acc -> GM` 路径
+- 当前实现不会对 `FpTileData::Loc` 做直接 `static_assert`，但可移植代码应将 `fp` 建成 `TileType::Scaling`
 
 ### A2/A3 实现检查
 
-- 目标 `GlobalTensor` 布局必须是 `ND`、`NZ` 或 `NC1HWC0`。
-- 源累加器类型必须是 `float` 或 `int32_t`。
+- 目标 `GlobalTensor` 布局必须是 `ND`、`NZ` 或 `NC1HWC0`
+- 源累加器类型必须是 `float` 或 `int32_t`
 - 静态范围：
-  - `1 <= TileData::Cols <= 4095`
-  - 若目标为 `ND`，则 `1 <= TileData::Rows <= 8192`
-  - 若目标为 `NZ` 或 `NC1HWC0`，则 `1 <= TileData::Rows <= 65535` 且 `TileData::Cols % 16 == 0`
+    - `1 <= TileData::Cols <= 4095`
+    - 若目标为 `ND`，则 `1 <= TileData::Rows <= 8192`
+    - 若目标为 `NZ` 或 `NC1HWC0`，则 `1 <= TileData::Rows <= 65535` 且 `TileData::Cols % 16 == 0`
 - 运行时要求：
-  - `1 <= src.GetValidCol() <= 4095`
-  - 目标 shape 各维与源 valid region 都必须大于 `0`
+    - `1 <= src.GetValidCol() <= 4095`
+    - 目标 shape 各维与源 valid region 都必须大于 `0`
 - 量化支持集为：
-  - `float Acc -> __gm__ int8_t / __gm__ uint8_t`
-  - `int32_t Acc -> __gm__ int8_t / __gm__ uint8_t / __gm__ half`
+    - `float Acc -> __gm__ int8_t / __gm__ uint8_t`
+    - `int32_t Acc -> __gm__ int8_t / __gm__ uint8_t / __gm__ half`
 
 ### A5 实现检查
 
-- 目标 `GlobalTensor` 布局必须是 `ND`、`NZ`、`NHWC`、`NCHW` 或 `NCDHW`。
-- 源累加器类型必须是 `float` 或 `int32_t`。
+- 目标 `GlobalTensor` 布局必须是 `ND`、`NZ`、`NHWC`、`NCHW` 或 `NCDHW`
+- 源累加器类型必须是 `float` 或 `int32_t`
 - 静态范围：
-  - `1 <= TileData::Cols <= 4095`
-  - 若目标为 `ND`，则 `1 <= TileData::Rows <= 8192`
-  - 若目标为 `NZ/NHWC/NCHW/NCDHW`，则 `1 <= TileData::Rows <= 65535` 且 `TileData::Cols % 16 == 0`
+    - `1 <= TileData::Cols <= 4095`
+    - 若目标为 `ND`，则 `1 <= TileData::Rows <= 8192`
+    - 若目标为 `NZ/NHWC/NCHW/NCDHW`，则 `1 <= TileData::Rows <= 65535` 且 `TileData::Cols % 16 == 0`
 - 量化支持集为：
-  - `float Acc -> __gm__ int8_t / __gm__ uint8_t / __gm__ bfloat16_t / __gm__ half / __gm__ hifloat8_t / __gm__ float8_e4m3_t / __gm__ float`
-  - `int32_t Acc -> __gm__ int8_t / __gm__ uint8_t / __gm__ bfloat16_t / __gm__ half`
-- 若启用 `AtomicAdd`，目标 dtype 还必须属于 backend 允许的原子类型集合。
+    - `float Acc -> __gm__ int8_t / __gm__ uint8_t / __gm__ bfloat16_t / __gm__ half / __gm__ hifloat8_t / __gm__ float8_e4m3_t / __gm__ float`
+    - `int32_t Acc -> __gm__ int8_t / __gm__ uint8_t / __gm__ bfloat16_t / __gm__ half`
+- 若启用 `AtomicAdd`，目标 dtype 还必须属于 backend 允许的原子类型集合
 
 ### CPU 模拟器说明
 
-- 当前 CPU 模拟器会接受 `TSTORE_FP` 接口，但不会真正消费 `fp` 参数，而是退化为普通 `TSTORE`。
-- 因而依赖 `fp` 具体数值的量化效果，应以 NPU backend 为准。
+- 当前 CPU 模拟器会接受 `TSTORE_FP` 接口，但不会真正消费 `fp` 参数，而是退化为普通 `TSTORE`
+- 因而依赖 `fp` 具体数值的量化效果，应以 NPU backend 为准
+
+## 异常与非法情形
+
+- 若 `src` 不是累加器类型，行为未定义
+- 若目标布局不在支持列表中，编译失败
+- 若启用 `AtomicAdd` 但目标 dtype 不支持原子操作，行为未定义
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| float Acc → int8_t/uint8_t | 是 | 是 | 是 |
+| float Acc → bfloat16/half/float | 是 | 否 | 是 |
+| int32_t Acc → int8_t/uint8_t/half | 是 | 是 | 是 |
+| int32_t Acc → bfloat16 | 是 | 否 | 是 |
+| AtomicAdd | 否 | 是 | 是 |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -119,7 +148,7 @@ void example_auto(__gm__ int8_t* out) {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -144,6 +173,6 @@ void example_manual(__gm__ int8_t* out) {
 
 ## 相关页面
 
+- 指令集总览：[内存与数据搬运指令](../../memory-and-data-movement_zh.md)
 - [TSTORE](./tstore_zh.md)
 - [TMOV_FP](../layout-and-rearrangement/tmov-fp_zh.md)
-- [内存与数据搬运指令集](../../memory-and-data-movement_zh.md)
diff --git a/docs/isa/tile/ops/memory-and-data-movement/tstore.md b/docs/isa/tile/ops/memory-and-data-movement/tstore.md
index 7addc8c8..618de82e 100644
--- a/docs/isa/tile/ops/memory-and-data-movement/tstore.md
+++ b/docs/isa/tile/ops/memory-and-data-movement/tstore.md
@@ -185,3 +185,6 @@ pto.tstore ins(%src : !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view
 - Instruction set overview: [Memory And Data Movement](../../memory-and-data-movement.md)
 - Previous op in instruction set: [pto.tprefetch](./tprefetch.md)
 - Next op in instruction set: [pto.tstore_fp](./tstore-fp.md)
+- Complementary operation: [pto.tload](./tload.md)
+- Vector store counterpart: [pto.vsts](../../../vector/ops/vector-load-store/vsts.md)
+- Instruction set: [Tile Instructions](../../README.md)
diff --git a/docs/isa/tile/ops/reduce-and-expand/trowsum.md b/docs/isa/tile/ops/reduce-and-expand/trowsum.md
index 66321b3e..547c251f 100644
--- a/docs/isa/tile/ops/reduce-and-expand/trowsum.md
+++ b/docs/isa/tile/ops/reduce-and-expand/trowsum.md
@@ -4,12 +4,10 @@
 
 ## Summary
 
-Reduce each row by summing across columns.
+Reduce each row by summing across columns. For each row `i` in the valid region, all `C` elements in that row are summed and written to `dst[i, 0]`.
 
 ## Mechanism
 
-Reduce each row by summing across columns.
-
 Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
 
 $$ \mathrm{dst}_{i,0} = \sum_{j=0}^{C-1} \mathrm{src}_{i,j} $$
diff --git a/docs/isa/tile/ops/sync-and-config/tassign_zh.md b/docs/isa/tile/ops/sync-and-config/tassign_zh.md
index 1d6d36b5..52659a51 100644
--- a/docs/isa/tile/ops/sync-and-config/tassign_zh.md
+++ b/docs/isa/tile/ops/sync-and-config/tassign_zh.md
@@ -1,30 +1,20 @@
-# TASSIGN
+# pto.tassign
 
-## 指令示意图
+`pto.tassign` 属于[同步与配置指令](../../sync-and-config_zh.md)集。
 
-![TASSIGN tile operation](../../../../figures/isa/TASSIGN.svg)
+## 概述
 
-## 简介
-
-`TASSIGN` 把一个 Tile 或 `GlobalTensor` 对象绑定到具体存储地址。它不做算术，也不搬运数据；它做的是“把抽象对象落到某个物理或模拟地址上”。
-
-这条指令主要服务于手动放置和手动调度场景。很多自动流程也会在 lowering 过程中显式插入它，用来把 SSA 层的 tile 名称映射到真实缓冲。
+`TASSIGN` 把一个 Tile 或 `GlobalTensor` 对象绑定到具体存储地址。它不做算术，也不搬运数据；它做的是"把抽象对象落到某个物理或模拟地址上"。这条指令主要服务于手动放置和手动调度场景。
 
 ## 机制
 
-对 Tile 来说，`TASSIGN` 的作用是把内部数据指针指向某个片上地址；对 `GlobalTensor` 来说，则是把对象绑定到一段外部指针地址。
-
-它本身没有独立的数学语义。真正重要的是：
+对 Tile 来说，`TASSIGN` 的作用是把内部数据指针指向某个片上地址；对 `GlobalTensor` 来说，则是把对象绑定到一段外部指针地址。它本身没有独立的数学语义，真正重要的是：绑定的是哪类对象、地址是在运行时给出还是在编译期给出、以及当前目标是否允许这种地址落点。
 
-- 绑定的是哪类对象
-- 地址是在运行时给出，还是在编译期给出
-- 当前目标是否允许这种地址落点
+运行时地址形式在 NPU manual 模式下，地址会被直接解释成 tile 存储地址；在 `__PTO_AUTO__` 打开的自动模式下，NPU backend 中的 `TASSIGN(tile, addr)` 当前是空操作。CPU 模拟器不会直接把整型当裸地址使用，而是通过 `NPUMemoryModel` 把它解析到对应架构的模拟缓冲区。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
-
-同步形式：
+### PTO-AS
 
 ```text
 tassign %tile, %addr : !pto.tile<...>, index
@@ -32,13 +22,13 @@ tassign %tile, %addr : !pto.tile<...>, index
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 pto.tassign %tile, %addr : !pto.tile<...>, dtype
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
 ```
 
@@ -60,34 +50,51 @@ template <std::size_t Addr, typename T>
 PTO_INST std::enable_if_t<is_tile_data_v<T> || is_conv_tile_v<T>> TASSIGN(T& obj);
 ```
 
-第二种写法会在编译期执行静态边界与对齐检查，因此更适合固定地址的手动布局。
+编译期地址写法会在编译期执行静态边界与对齐检查，因此更适合固定地址的手动布局。
 
-## 约束
+## 输入
 
-### Tile / ConvTile
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `obj` | 输入/输出 | Tile 或 GlobalTensor 对象 |
+| `addr` | 输入 | 目标地址（运行时为整型，编译期为模板参数） |
 
-- 运行时地址形式要求 `addr` 是整型地址。
-- 在 NPU manual 模式下，这个地址会被直接解释成 tile 存储地址。
-- 在 `__PTO_AUTO__` 打开的自动模式下，NPU backend 中的 `TASSIGN(tile, addr)` 当前是 no-op。
-- CPU 模拟器不会直接把整型当裸地址使用，而是通过 `NPUMemoryModel` 把它解析到对应架构的模拟缓冲区。
+## 预期输出
 
-### GlobalTensor
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `obj` | Tile/GlobalTensor | 被绑定到指定地址的对象 |
 
-- `addr` 必须是指针类型。
-- 指针指向的元素类型必须和 `GlobalTensor::DType` 一致。
+## 副作用
 
-### 编译期地址检查
+将 Tile 或 GlobalTensor 绑定到指定地址，可能影响后续操作的数据位置。
 
-`TASSIGN<Addr>(tile)` 会根据 Tile 的 `Loc` 自动推导对应内存空间，并检查：
+## 约束
 
-- 该内存空间在当前架构上是否存在
-- Tile 是否能放得下
-- `Addr + tile_size` 是否越界
-- 地址是否满足对齐要求
+- 运行时地址形式要求 `addr` 是整型地址。
+- 在 NPU manual 模式下，地址会被直接解释成 tile 存储地址。
+- 在自动模式下，`TASSIGN(tile, addr)` 当前是空操作。
+- CPU 模拟器通过 `NPUMemoryModel` 解析地址到模拟缓冲区。
+- 对于 GlobalTensor，`addr` 必须是指针类型，且指针指向的元素类型必须和 `GlobalTensor::DType` 一致。
+- 编译期地址检查会根据 Tile 的 `Loc` 自动推导对应内存空间，并检查该内存空间是否存在、Tile 是否能放得下、`Addr + tile_size` 是否越界、以及地址是否满足对齐要求。
+
+## 异常与非法情形
+
+- 当指定地址越界时，行为未定义。
+- 当指针类型与 `GlobalTensor::DType` 不匹配时，编译错误。
+- 当目标架构不支持指定的内存空间时，编译错误。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 运行时地址形式 | 是 | 是 | 是 |
+| 编译期地址形式 | 是 | 是 | 是 |
+| GlobalTensor 支持 | 是 | 是 | 是 |
 
 ## 示例
 
-### 运行时地址
+### C++ 运行时地址
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -104,7 +111,7 @@ void example_runtime() {
 }
 ```
 
-### 编译期地址
+### C++ 编译期地址
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -122,8 +129,15 @@ void example_checked() {
 }
 ```
 
+### PTO-AS
+
+```text
+tassign %tile, %addr : !pto.tile<...>, index
+```
+
 ## 相关页面
 
+- 指令集总览：[同步与配置](../../sync-and-config_zh.md)
 - [TSUBVIEW](./tsubview_zh.md)
 - [TALIAS](../../../TALIAS_zh.md)
-- [同步与配置指令集](../../sync-and-config_zh.md)
+- [TASSIGN](./tassign_zh.md)
diff --git a/docs/isa/tile/ops/sync-and-config/tget-scale-addr_zh.md b/docs/isa/tile/ops/sync-and-config/tget-scale-addr_zh.md
index 3401d3c7..172185c6 100644
--- a/docs/isa/tile/ops/sync-and-config/tget-scale-addr_zh.md
+++ b/docs/isa/tile/ops/sync-and-config/tget-scale-addr_zh.md
@@ -1,30 +1,36 @@
-# TGET_SCALE_ADDR
+# pto.tget_scale_addr
 
-## Tile Operation Diagram
+`pto.tget_scale_addr` 属于[同步与配置指令](../../sync-and-config_zh.md)集。
 
-![TGET_SCALE_ADDR tile operation](../../../../figures/isa/TGET_SCALE_ADDR.svg)
+## 概述
 
-## 简介
+`TGET_SCALE_ADDR` 将输入 Tile 的片上地址数值按比例扩展，将其结果数值绑定为输出 Tile 的片上地址。这个扩展因子是由 `include/pto/npu/a5/utils.hpp` 中的右移值 `SHIFT_MX_ADDR` 定义的。
 
-将输入Tile的片上地址数值按比例扩展，将其结果数值绑定为输出Tile的片上地址。
+## 机制
 
-这个扩展因子是由`include/pto/npu/a5/utils.hpp`中的右移值`SHIFT_MX_ADDR`定义的。
+地址映射关系为：
 
-## 数学语义
+$$ \mathrm{Address}(\mathrm{dst}) = \mathrm{Address}(\mathrm{src}) \gg \mathrm{SHIFT\_MX\_ADDR} $$
 
-Address(`dst`) = Address(`src`) >> `SHIFT_MX_ADDR`
+该指令主要用于在自动模式下处理不同粒度的地址映射。
 
-## 汇编语法
+## 语法
 
-PTO-AS form: see [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md).
+### PTO-AS
 
-### IR Level 1（SSA）
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
-TODO
+### AS Level 1（SSA）
 
-### IR Level 2（DPS）
+```mlir
+%dst = pto.tget_scale_addr %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
 
-TODO
+```mlir
+pto.tget_scale_addr ins(%src : !pto.tile<...>) outs(%dst : !pto.tile<...>)
+```
 
 ## C++ 内建接口
 
@@ -32,23 +38,51 @@ TODO
 
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TGET_SCALE_ADDR(TileDataDst &dst, TileDataSrc &src, aitEvents&... events);
+PTO_INST RecordEvent TGET_SCALE_ADDR(TileDataDst &dst, TileDataSrc &src, WaitEvents&... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 目标 Tile（缩放后地址） |
+| `src` | 输入 | 源 Tile |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 片上地址按比例缩放后的 Tile |
+
+## 副作用
+
+将目标 Tile 的片上地址绑定为源 Tile 地址经右移 `SHIFT_MX_ADDR` 位后的值。
+
 ## 约束
 
-- **输入和输出都必须为Tile对象**
-- **目前只能用在auto模式下**（以后会将支持manual模式下的实现）
+- 输入和输出都必须为 Tile 对象。
+- 目前只能用在自动模式下。
+
+## 异常与非法情形
+
+- 在手动模式下使用此指令，行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 自动模式支持 | - | - | 是 |
 
 ## 示例
 
+### C++
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
-> wa
 using namespace pto;
 
-template <typename T, int ARows, int ACols, BRows, BCols>
+template <typename T, int ARows, int ACols, int BRows, int BCols>
 void example() {
     using LeftTile = TileLeft<T, ARows, ACols>;
     using RightTile = TileRight<T, BRows, BCols>;
@@ -66,16 +100,13 @@ void example() {
 }
 ```
 
-## asm form examples
+### PTO-AS
 
-### Auto Mode
-
-TODO
-
-### Manual Mode
-
-TODO
+```text
+%dst = pto.tget_scale_addr %src : !pto.tile<...> -> !pto.tile<...>
+```
 
-### PTO 汇编形式
+## 相关页面
 
-TODO
+- 指令集总览：[同步与配置](../../sync-and-config_zh.md)
+- [TGET_SCALE_ADDR](./tget-scale-addr_zh.md)
diff --git a/docs/isa/tile/ops/sync-and-config/tset-img2col-padding_zh.md b/docs/isa/tile/ops/sync-and-config/tset-img2col-padding_zh.md
index 484c657b..0c5809b6 100644
--- a/docs/isa/tile/ops/sync-and-config/tset-img2col-padding_zh.md
+++ b/docs/isa/tile/ops/sync-and-config/tset-img2col-padding_zh.md
@@ -1,32 +1,18 @@
-# TSET_IMG2COL_PADDING
+# pto.tset_img2col_padding
 
-## 指令示意图
+`pto.tset_img2col_padding` 属于[同步与配置指令](../../sync-and-config_zh.md)集。
 
-![TSET_IMG2COL_PADDING tile operation](../../../../figures/isa/TSET_IMG2COL_PADDING.svg)
+## 概述
 
-## 简介
-
-`TSET_IMG2COL_PADDING` 把 `Img2colTileConfig` 里的 pad value 写入 IMG2COL 相关 padding 配置寄存器，供后续 `TIMG2COL` 一类操作在边界补值时使用。
-
-它解决的问题不是“pad 多大”，而是“边界外应该补什么值”。
+`TSET_IMG2COL_PADDING` 把 `Img2colTileConfig` 里的 pad value 写入 IMG2COL 相关 padding 配置寄存器，供后续 `TIMG2COL` 一类操作在边界补值时使用。它解决的问题不是"pad 多大"，而是"边界外应该补什么值"。
 
 ## 机制
 
-这条指令从配置 Tile 读取 `src.GetPadValue()`，然后把这个值编码成硬件需要的 padding 字段。
-
-当前 backend 的编码规则是：
-
-- 若元素宽度为 1 字节：把同一个字节复制两次后写入 padding 字段
-- 若元素宽度为 2 字节：按 16 位值写入
-- 若元素宽度为 4 字节：按 32 位值写入
+这条指令从配置 Tile 读取 `src.GetPadValue()`，然后把这个值编码成硬件需要的 padding 字段。当前 backend 的编码规则是：若元素宽度为 1 字节，把同一个字节复制两次后写入 padding 字段；若元素宽度为 2 字节，按 16 位值写入；若元素宽度为 4 字节，按 32 位值写入。在 A5 上，这条配置同样支持 A 侧和 B 侧两组寄存器。
 
-在 A5 上，这条配置同样支持 A 侧和 B 侧两组寄存器。
+## 语法
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
-
-示意形式：
+### PTO-AS
 
 ```text
 tset_img2col_padding %cfg
@@ -34,13 +20,13 @@ tset_img2col_padding %cfg
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tset_img2col_padding ins(%cfg : !pto.fmatrix_config) outs()
 ```
 
@@ -56,31 +42,48 @@ template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FM
 PTO_INST RecordEvent TSET_IMG2COL_PADDING(ConvTileData &src, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 | IMG2COL 配置 Tile |
+| `events...` | 可选 | 等待事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 无 | - | 配置指令不产生数据结果 |
+
+## 副作用
 
-### 通用约束
+将 padding 配置写入 IMG2COL 相关寄存器。
+
+## 约束
 
 - `src` 应是有效的 IMG2COL 配置 Tile。
 - 这条指令只更新控制状态，不直接产生新的 tile 数据。
 - 通常应在同一执行流中先配置 padding，再发出依赖它的 `TIMG2COL`。
+- CPU 模拟器当前把这条指令实现为空操作，不额外写入寄存器状态。
+- A2/A3 直接把 `padValue` 编码后写入 padding 寄存器，这条路径没有区分 A/B 两组配置寄存器。
+- A5 仅在 `FMATRIX_A_MANUAL` 或 `FMATRIX_B_MANUAL` 时真正写寄存器。
+- `FMATRIX_A_MANUAL` 写 A 侧 padding 配置，`FMATRIX_B_MANUAL` 写 B 侧 padding 配置。
 
-### CPU 模拟器
-
-- CPU 当前把这条指令实现为 no-op，不额外写入寄存器状态。
-
-### A2/A3 实现
+## 异常与非法情形
 
-- A2/A3 直接把 `padValue` 编码后写入 padding 寄存器。
-- 这条路径没有区分 A/B 两组配置寄存器。
+- 当 `src` 不是有效的 IMG2COL 配置 Tile 时，行为未定义。
 
-### A5 实现
+## Target-Profile 限制
 
-- A5 仅在 `FMATRIX_A_MANUAL` 或 `FMATRIX_B_MANUAL` 时真正写寄存器。
-- `FMATRIX_A_MANUAL` 写 A 侧 padding 配置。
-- `FMATRIX_B_MANUAL` 写 B 侧 padding 配置。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 基础支持 | 是 | 是 | 是 |
+| A/B 侧区分 | - | - | 是 |
 
 ## 示例
 
+### C++
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -91,8 +94,16 @@ void example_set_img2col_padding(Img2colTileConfig<uint16_t>& cfg) {
 }
 ```
 
+### PTO-AS
+
+```text
+tset_img2col_padding %cfg
+```
+
 ## 相关页面
 
+- 指令集总览：[同步与配置](../../sync-and-config_zh.md)
 - [TSETFMATRIX](../../../scalar/ops/control-and-configuration/tsetfmatrix_zh.md)
 - [TSET_IMG2COL_RPT](./tset-img2col-rpt_zh.md)
 - [TIMG2COL](../layout-and-rearrangement/timg2col_zh.md)
+- [TSET_IMG2COL_PADDING](./tset-img2col-padding_zh.md)
diff --git a/docs/isa/tile/ops/sync-and-config/tset-img2col-rpt_zh.md b/docs/isa/tile/ops/sync-and-config/tset-img2col-rpt_zh.md
index 45db6515..2dad455f 100644
--- a/docs/isa/tile/ops/sync-and-config/tset-img2col-rpt_zh.md
+++ b/docs/isa/tile/ops/sync-and-config/tset-img2col-rpt_zh.md
@@ -1,41 +1,22 @@
-# TSET_IMG2COL_RPT
+# pto.tset_img2col_rpt
 
-## 指令示意图
+`pto.tset_img2col_rpt` 属于[同步与配置指令](../../sync-and-config_zh.md)集。
 
-![TSET_IMG2COL_RPT tile operation](../../../../figures/isa/TSET_IMG2COL_RPT.svg)
+## 概述
 
-## 简介
-
-`TSET_IMG2COL_RPT` 把 `Img2colTileConfig` 里的重复控制字段写入 IMG2COL 相关寄存器，供后续 `TIMG2COL` 一类操作使用。
-
-如果 `TSETFMATRIX` 负责写“输入特征图几何信息”，那么这条指令负责写“IMG2COL 应该按什么重复方式工作”。
+`TSET_IMG2COL_RPT` 把 `Img2colTileConfig` 里的重复控制字段写入 IMG2COL 相关寄存器，供后续 `TIMG2COL` 一类操作使用。如果 `TSETFMATRIX` 负责写"输入特征图几何信息"，那么这条指令负责写"IMG2COL 应该按什么重复方式工作"。
 
 ## 机制
 
 A2/A3 与 A5 都会从配置 Tile 中读取 `repeat` 相关字段，但两代硬件暴露的字段不完全一样。
 
-### A2/A3
-
-会把以下字段打包到一份 repeat 配置里：
-
-- `repeatStride`
-- `repeatTime`
-- `repeatMode`
-
-### A5
-
-除了上面三项，还会额外写入：
-
-- `dstStride`
-- `dstMposition`
+A2/A3 会把以下字段打包到一份 repeat 配置里：`repeatStride`、`repeatTime`、`repeatMode`。
 
-并且支持写到 A 侧或 B 侧的对应寄存器。
+A5 除了上面三项，还会额外写入 `dstStride` 和 `dstMposition`，并且支持写到 A 侧或 B 侧的对应寄存器。
 
-## 汇编语法
+## 语法
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
-
-示意形式：
+### PTO-AS
 
 ```text
 tset_img2col_rpt %cfg
@@ -43,13 +24,13 @@ tset_img2col_rpt %cfg
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tset_img2col_rpt ins(%cfg : !pto.fmatrix_config) outs()
 ```
 
@@ -65,37 +46,49 @@ template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FM
 PTO_INST RecordEvent TSET_IMG2COL_RPT(ConvTileData &src, WaitEvents &... events);
 ```
 
-## 约束
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 | IMG2COL 配置 Tile |
+| `events...` | 可选 | 等待事件 |
+
+## 预期输出
 
-### 通用约束
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 无 | - | 配置指令不产生数据结果 |
+
+## 副作用
+
+将重复控制字段写入 IMG2COL 相关寄存器。
+
+## 约束
 
 - `src` 应是有效的 IMG2COL 配置 Tile。
 - 这条指令只更新控制状态，不直接产生新的 tile 数据。
 - 通常应在同一执行流中先配置 repeat，再发出依赖它的 `TIMG2COL`。
+- CPU 模拟器只检查 `repeatTime` 已被初始化（`GetRepeatTime() >= 0`）。
+- A2/A3 的无 `SetFmatrixMode` 重载会直接写 repeat 配置，当前仓库里的字段是：`repeatStride`、`repeatTime`、`repeatMode`。
+- A5 仅在 `FMATRIX_A_MANUAL` 或 `FMATRIX_B_MANUAL` 时真正写寄存器，比 A2/A3 多写 `dstStride` 和 `dstMposition`。
+- `FMATRIX_A_MANUAL` 写 A 侧 repeat 配置，`FMATRIX_B_MANUAL` 写 B 侧 repeat 配置。
 
-### CPU 模拟器
-
-- CPU 只检查 `repeatTime` 已被初始化（`GetRepeatTime() >= 0`）。
-
-### A2/A3 实现
+## 异常与非法情形
 
-- A2/A3 的无 `SetFmatrixMode` 重载会直接写 repeat 配置。
-- 当前仓库里的 A2/A3 实现字段是：
-  - `repeatStride`
-  - `repeatTime`
-  - `repeatMode`
+- 当 `src` 不是有效的 IMG2COL 配置 Tile 时，行为未定义。
+- 当 `repeatTime` 未初始化时，CPU 模拟器将报错。
 
-### A5 实现
+## Target-Profile 限制
 
-- A5 仅在 `FMATRIX_A_MANUAL` 或 `FMATRIX_B_MANUAL` 时真正写寄存器。
-- A5 比 A2/A3 多写两项：
-  - `dstStride`
-  - `dstMposition`
-- `FMATRIX_A_MANUAL` 写 A 侧 repeat 配置。
-- `FMATRIX_B_MANUAL` 写 B 侧 repeat 配置。
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 基础支持 | 是 | 是 | 是 |
+| A/B 侧区分 | - | - | 是 |
 
 ## 示例
 
+### C++
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -106,8 +99,16 @@ void example_set_img2col_rpt(Img2colTileConfig<uint16_t>& cfg) {
 }
 ```
 
+### PTO-AS
+
+```text
+tset_img2col_rpt %cfg
+```
+
 ## 相关页面
 
+- 指令集总览：[同步与配置](../../sync-and-config_zh.md)
 - [TSETFMATRIX](../../../scalar/ops/control-and-configuration/tsetfmatrix_zh.md)
 - [TSET_IMG2COL_PADDING](./tset-img2col-padding_zh.md)
 - [TIMG2COL](../layout-and-rearrangement/timg2col_zh.md)
+- [TSET_IMG2COL_RPT](./tset-img2col-rpt_zh.md)
diff --git a/docs/isa/tile/ops/sync-and-config/tsettf32mode_zh.md b/docs/isa/tile/ops/sync-and-config/tsettf32mode_zh.md
index 0dd5f8aa..518d2db7 100644
--- a/docs/isa/tile/ops/sync-and-config/tsettf32mode_zh.md
+++ b/docs/isa/tile/ops/sync-and-config/tsettf32mode_zh.md
@@ -1,22 +1,18 @@
-# TSETTF32MODE
+# pto.tsettf32mode
 
-## 指令示意图
+`pto.tsettf32mode` 属于[同步与配置指令](../../sync-and-config_zh.md)集。
 
-![TSETTF32MODE tile operation](../../../../figures/isa/TSETTF32MODE.svg)
+## 概述
 
-## 简介
+`TSETTF32MODE` 设置 TF32 变换模式，该模式供后续指令使用。这是一个状态配置指令，本身不产生直接的张量算术结果，而是更新控制状态。
 
-设置 TF32 变换模式（实现定义）。
+## 机制
 
-## 数学语义
+该指令将 TF32 启用状态和舍入模式写入目标配置寄存器，精确模式取值和硬件行为由目标实现定义。后续使用 TF32 的计算指令将根据此设置决定变换行为。
 
-该指令本身不产生直接的张量算术结果，而是更新供后续指令使用的目标模式状态。
+## 语法
 
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
-
-示意形式：
+### PTO-AS
 
 ```text
 tsettf32mode {enable = true, mode = ...}
@@ -24,13 +20,13 @@ tsettf32mode {enable = true, mode = ...}
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 pto.tsettf32mode {enable = true, mode = ...}
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tsettf32mode ins({enable = true, mode = ...}) outs()
 ```
 
@@ -43,14 +39,42 @@ template <bool isEnable, RoundMode tf32TransMode = RoundMode::CAST_ROUND, typena
 PTO_INST RecordEvent TSETTF32MODE(WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `events...` | 可选 | 等待事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 无 | - | 状态配置指令不产生数据结果 |
+
+## 副作用
+
+设置 TF32 变换模式状态，后续 TF32 计算指令将使用此配置。
+
 ## 约束
 
 - 仅在对应 backend capability macro 启用时可用。
 - 精确模式取值和硬件行为由目标实现定义。
 - 该指令具有控制状态副作用，应与依赖它的计算指令建立正确顺序。
 
+## 异常与非法情形
+
+- 当目标架构不支持 TF32 时，行为未定义。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| TF32 支持 | - | - | 是 |
+
 ## 示例
 
+### C++
+
 ```cpp
 #include <pto/pto-inst.hpp>
 using namespace pto;
@@ -59,3 +83,14 @@ void example_enable_tf32() {
   TSETTF32MODE<true, RoundMode::CAST_ROUND>();
 }
 ```
+
+### PTO-AS
+
+```text
+tsettf32mode {enable = true, mode = ...}
+```
+
+## 相关页面
+
+- 指令集总览：[同步与配置](../../sync-and-config_zh.md)
+- [TSETTF32MODE](./tsettf32mode_zh.md)
diff --git a/docs/isa/tile/ops/sync-and-config/tsubview_zh.md b/docs/isa/tile/ops/sync-and-config/tsubview_zh.md
index ade99a1d..76afb256 100644
--- a/docs/isa/tile/ops/sync-and-config/tsubview_zh.md
+++ b/docs/isa/tile/ops/sync-and-config/tsubview_zh.md
@@ -1,54 +1,88 @@
-﻿# TSUBVIEW
+﻿# pto.tsubview
 
-## Tile操作图例
+`pto.tsubview` 属于[同步与配置指令](../../sync-and-config_zh.md)集。
 
-![TSUBVIEW tile operation](../../../../figures/isa/TSUBVIEW.svg)
+## 概述
 
-## 简介
+`TSUBVIEW` 表达一个 Tile 是另一个 Tile 的子视图，通过指定起始行和列偏移量来定义源 Tile 内的一个区域作为目标 Tile 的数据来源。
 
-表达一个Tile是另一个Tile的subview。
+## 机制
 
-## 数学表达
+对于 `dst` 中有效区域内的每一个元素 `(i, j)`：
 
-- `rowIdx`: 在`src`的有效区域内的起始行的索引。
-- `colIdx`: 在`src`的有效区域内的起始列的索引。
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{rowIdx} + i,\mathrm{colIdx} + j} $$
 
-对于`dst`中有效区域内的每一个元素`(i, j)`：
+其中 `rowIdx` 是源 Tile 有效区域内的起始行索引，`colIdx` 是源 Tile 有效区域内的起始列索引。
 
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{rowIdx} + i,\mathrm{colIdx} + j} $$
+## 语法
 
-## 汇编语法
+### PTO-AS
 
-PTO-AS form: 详见 [PTO-AS Specification](../../../../assembly/PTO-AS_zh.md).
+参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
-### IR Level 1（SSA）
+### AS Level 1（SSA）
 
-TODO
+```mlir
+%dst = pto.tsubview %src, %rowIdx, %colIdx : !pto.tile<...>
+```
 
-### IR Level 2（DPS）
+### AS Level 2（DPS）
 
-TODO
+```mlir
+pto.tsubview ins(%src, %rowIdx, %colIdx : !pto.tile<...>, ui16, ui16)
+             outs(%dst : !pto.tile<...>)
+```
 
-## C++ Intrinsic
+## C++ 内建接口
 
-定义在 `include/pto/common/pto_instr.hpp`:
+定义在 `include/pto/common/pto_instr.hpp`：
 
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
 PTO_INST RecordEvent TSUBVIEW(TileDataDst &dst, TileDataSrc &src, uint16_t rowIdx, uint16_t colIdx, WaitEvents&... events);
 ```
 
-## 限制
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `dst` | 输出 | 目标 Tile（子视图） |
+| `src` | 输入 | 源 Tile |
+| `rowIdx` | 输入 | 起始行偏移量 |
+| `colIdx` | 输入 | 起始列偏移量 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 绑定到源 Tile 指定区域的子视图 |
+
+## 副作用
+
+将 `dst` 绑定为 `src` 的子视图，后续对 `dst` 的操作实际作用于源 Tile 的对应区域。
 
-规定在`TSUBVIEW_IMPL`中:
+## 约束
 
-- **Tile类型必须相同**: `TileDataSrc::Loc == TileDataDst::Loc`.
-- **输入和输出Tile的静态shape必须相同**: `TileDataSrc::Rows == TileDataDst::Rows` and `TileDataSrc::Cols == TileDataDst::Cols`.
-- **输入和输出Tile的BLayout必须相同**: `TileDataSrc::BFractal == TileDataDst::BFractal`.
-- **src的validRow和validCol必须大于等于dst的validRow和validCol**
+- Tile 类型必须相同：`TileDataSrc::Loc == TileDataDst::Loc`。
+- 输入和输出 Tile 的静态 shape 必须相同：`TileDataSrc::Rows == TileDataDst::Rows` 且 `TileDataSrc::Cols == TileDataDst::Cols`。
+- 输入和输出 Tile 的 BLayout 必须相同：`TileDataSrc::BFractal == TileDataDst::BFractal`。
+- 源 Tile 的 validRow 和 validCol 必须大于等于目标 Tile 的 validRow 和 validCol。
+
+## 异常与非法情形
+
+- 当 `rowIdx` 或 `colIdx` 超出源 Tile 有效区域时，行为未定义。
+- 当源和目标 Tile 类型不匹配时，编译错误。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| 子视图支持 | 是 | 是 | 是 |
 
 ## 示例
 
+### C++
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -72,16 +106,14 @@ void example() {
 }
 ```
 
-## ASM示例
+### PTO-AS
 
-### Auto模式
-
-TODO
-
-### Manual模式
-
-TODO
+```text
+%dst = pto.tsubview %src, %rowIdx, %colIdx : !pto.tile<...>
+```
 
-### PTO汇编格式
+## 相关页面
 
-TODO
+- 指令集总览：[同步与配置](../../sync-and-config_zh.md)
+- [TASSIGN](./tassign_zh.md)
+- [TSUBVIEW](./tsubview_zh.md)
diff --git a/docs/isa/tile/ops/sync-and-config/tsync.md b/docs/isa/tile/ops/sync-and-config/tsync.md
index 4de00a38..a86093e9 100644
--- a/docs/isa/tile/ops/sync-and-config/tsync.md
+++ b/docs/isa/tile/ops/sync-and-config/tsync.md
@@ -144,4 +144,7 @@ tsync.op #pto.op<TADD>
 ## Related Ops / Instruction Set Links
 
 - Instruction set overview: [Sync And Config](../../sync-and-config.md)
+- Previous op in instruction set: (none — first in set)
 - Next op in instruction set: [pto.tassign](./tassign.md)
+- Related synchronization: [pto.pipe_barrier](../../../scalar/ops/pipeline-sync/pipe-barrier.md), [pto.set_flag](../../../scalar/ops/pipeline-sync/set-flag.md), [pto.wait_flag](../../../scalar/ops/pipeline-sync/wait-flag.md)
+- Instruction set: [Tile Instructions](../../README.md)
diff --git a/docs/isa/tile/ops/sync-and-config/tsync_zh.md b/docs/isa/tile/ops/sync-and-config/tsync_zh.md
index ed82ced1..346466cb 100644
--- a/docs/isa/tile/ops/sync-and-config/tsync_zh.md
+++ b/docs/isa/tile/ops/sync-and-config/tsync_zh.md
@@ -1,58 +1,40 @@
-﻿# TSYNC
+﻿# pto.tsync
 
-## 指令示意图
+`pto.tsync` 属于[同步与配置指令](../../sync-and-config_zh.md)集。
 
-![TSYNC tile operation](../../../../figures/isa/TSYNC.svg)
+## 概述
 
-## 简介
+`TSYNC` 用于同步 PTO 执行，它有两种形式：`TSYNC(events...)` 等待一组显式事件令牌，`TSYNC<Op>()` 为单个向量操作类插入流水线屏障。许多内建函数在发射指令前会在内部调用 `TSYNC(events...)`。
 
-同步 PTO 执行（等待事件或插入每操作流水线屏障）。
+## 机制
 
-- `TSYNC(events...)` 等待一组显式事件令牌。
-- `TSYNC<Op>()` 为单个向量操作类插入流水线屏障。
+`TSYNC(events...)` 调用 `WaitAllEvents(events...)`，后者对每个事件令牌调用 `events.Wait()`。在自动模式下这是空操作。
 
-`include/pto/common/pto_instr.hpp` 中的许多内建函数在发射指令前会在内部调用 `TSYNC(events...)`。
+`TSYNC<Op>()` 仅为向量流水线操作（`PIPE_V`）插入同步屏障，其他流水线类型不支持。
 
-## 数学语义
+## 语法
 
-不适用。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
-
-Event operand form:
+### PTO-AS
 
 ```text
 tsync %e0, %e1 : !pto.event<...>, !pto.event<...>
 ```
 
-Single-op barrier form:
-
-```text
-tsync.op #pto.op<TADD>
-```
-
 ### AS Level 1（SSA）
 
-```text
-// Level 1 (SSA) does not support explicit synchronization primitives.
-```
+SSA 形式不支持显式同步原语。
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.record_event[src_op, dst_op, eventID]
-// 支持的op：TLOAD， TSTORE_ACC，TSTORE_VEC，TMOV_M2L，TMOV_M2S，TMOV_M2B，TMOV_M2V，TMOV_V2M，TMATMUL，TVEC
 pto.wait_event[src_op, dst_op, eventID]
-// 支持的op：TLOAD， TSTORE_ACC，TSTORE_VEC，TMOV_M2L，TMOV_M2S，TMOV_M2B，TMOV_M2V，TMOV_V2M，TMATMUL，TVEC
 pto.barrier(op)
-// 支持的op：TVEC,TMATMUL
 ```
 
-在当前 PTO-DSL 前端流程中，`record_event` 和 `wait_event` 应视为 TSYNC 的低层形式。
-前端 kernel 通常不应手工编写事件连线，而应依赖 `ptoas --enable-insert-sync`
-自动插入同步。
+支持的 op：TLOAD，TSTORE_ACC，TSTORE_VEC，TMOV_M2L，TMOV_M2S，TMOV_M2B，TMOV_M2V，TMOV_V2M，TMATMUL，TVEC。
+
+`record_event` 和 `wait_event` 应视为 TSYNC 的低层形式。前端 kernel 通常不应手工编写事件连线，而应依赖 `ptoas --enable-insert-sync` 自动插入同步。
 
 ## C++ 内建接口
 
@@ -66,16 +48,43 @@ template <typename... WaitEvents>
 PTO_INST void TSYNC(WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `events...` | 输入 | 要等待的事件令牌 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| 无 | - | 同步操作不产生数据结果 |
+
+## 副作用
+
+`TSYNC` 可能阻塞执行流水线直到指定事件完成，或插入硬件屏障。
+
 ## 约束
 
-- **实现检查（`TSYNC<Op>()`）**:
-  - `TSYNC_IMPL<Op>()` 仅支持向量流水线操作（`include/pto/common/event.hpp` 中通过 `static_assert(pipe == PIPE_V)` 强制执行）。
-- **`TSYNC(events...)` 语义**:
-  - `TSYNC(events...)` 调用 `WaitAllEvents(events...)`，后者对每个事件令牌调用 `events.Wait()`。在auto模式下是no-op。
+- `TSYNC<Op>()` 仅支持向量流水线操作（通过 `static_assert(pipe == PIPE_V)` 强制执行）。
+- `TSYNC(events...)` 在自动模式下是空操作。
+- 手动模式下会等待所有指定事件完成。
+
+## 异常与非法情形
+
+- 在非向量流水线上调用 `TSYNC<Op>()` 将导致编译错误。
+- 当指定的事件未正确发出时，`TSYNC(events...)` 可能永久阻塞。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `TSYNC(events...)` | 是 | 是 | 是 |
+| `TSYNC<Op>()` | 是 | 是 | 是 |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -96,7 +105,7 @@ void example_auto(__gm__ float* in) {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -113,29 +122,19 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%result = pto.tsync ...
+tsync %e0, %e1 : !pto.event<...>, !pto.event<...>
 ```
 
-### 手动模式
+### AS Level 2 (DPS)
 
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%result = pto.tsync ...
+```mlir
+pto.record_event[src_op, dst_op, eventID]
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-tsync %e0, %e1 : !pto.event<...>, !pto.event<...>
-# AS Level 2 (DPS)
-pto.record_event[src_op, dst_op, eventID]
-```
+- 指令集总览：[同步与配置](../../sync-and-config_zh.md)
+- [TSYNC](./tsync_zh.md)
diff --git a/docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md b/docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md
index 26a13442..57594372 100644
--- a/docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md
+++ b/docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md
@@ -1,24 +1,22 @@
-﻿# TEXPANDS
+﻿# pto.texpands
 
-## 指令示意图
+`pto.texpands` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TEXPANDS tile operation](../../../../figures/isa/TEXPANDS.svg)
+## 概述
 
-## 简介
+将标量广播到目标 Tile 中所有有效位置。
 
-将标量广播到目标 Tile 中。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{scalar} $$
 
-## 汇编语法
+对于向量 Tile，迭代域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定；对于 Mat Tile，迭代域由 `TileData::Rows` / `TileData::Cols` 决定。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = texpands %scalar : f32, !pto.tile<...>
@@ -26,54 +24,70 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.texpands %scalar : dtype -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TEXPANDS(TileData &dst, typename TileData::DType scalar, WaitEvents &... events);
+PTO_INST RecordEvent TEXPANDS(TileData &dst, typename TileData::DType scalar, WaitEvents & ... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%scalar` | 标量 | 广播到目标 tile 的值 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | 有效区域内所有元素等于标量值 |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - 对于Tile位置是向量（`TileData::Loc == TileType::Vec`）:
-    - `TileData::DType` 必须是以下之一：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
-    - 静态有效边界： `TileData::ValidRow <= TileData::Rows`且`TileData::ValidCol <= TileData::Cols`.
-    - 对于Tile位置是Mat（`TileData::Loc == TileType::Mat`）:
-    - `TileData::DType` 必须是以下之一：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
-    - 有效边界：`TileData::Rows * TileData::Cols * sizeof(T) / 32` 必须在`[1, 32767]`范围内。
-- **实现检查 (A5)**:
-    - 对于Tile位置是向量（`TileData::Loc == TileType::Vec`）:
-    - 静态有效边界： `TileData::ValidRow <= TileData::Rows`且`TileData::ValidCol <= TileData::Cols`.
-    - `TileData::DType` 必须是以下之一： `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`.
-    - 对于Tile位置是Mat（`TileData::Loc == TileType::Mat`）:
-    - `TileData::DType` 必须是以下之一： `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`.
-    - 对于`TileDataDst::layout == pto::Layout::NC1HWC0 || TileDataDst::layout == pto::Layout::FRACTAL_Z`:
-      - `TileData::shape0 * TileData::shape1 * TileData::shape2 * TileData::shape3` 必须在`[1, 32767]`范围内。
-    - 对于`TileDataDst::layout == pto::Layout::NDC1HWC0 || TileDataDst::layout == pto::Layout::FRACTAL_Z_3D`:
-      - `TileData::shape0 * TileData::shape1 * TileData::shape2 * TileData::shape3 * TileData::shape4` 必须在`[1, 32767]`范围内。
-- **有效区域**:
-    - 对于Tile位置是向量（`TileData::Loc == TileType::Vec`）:
-    - 该操作在 `dst.GetValidRow()` / `dst.GetValidCol()` 上填充 `dst`。
-    - 对于Tile位置是Mat（`TileData::Loc == TileType::Mat`）:
-    - 对于Tile，该操作在 `TileData::Rows` / `TileData::Cols` 上填充 `dst`。
-    - 对于convTile，该操作在`ConvTileData`的`shape`内填充`dst`。
+- Tile 位置可以是向量或 Mat。
+- A2/A3 向量支持：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
+- A5 向量支持：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`。
+- A2/A3 Mat 要求：`TileData::Rows * TileData::Cols * sizeof(T) / 32` 在 `[1, 32767]` 范围内。
+- A5 向量静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+- A5 Mat 约束因布局而异。
+
+## 异常与非法情形
+
+- 不支持的元素类型会被 verifier 拒绝。
+- 所选 target profile 不支持的形状/布局约束会被后端拒绝。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | Supported | Supported |
+| `bf16` | Simulated | Supported | No |
+| `i32 / u32` | Simulated | Supported | Supported |
+| `i16 / u16` | Simulated | Supported | Supported |
+| `i8 / u8` | Simulated | Supported | Supported |
+| Vec Layout | Any | Any | RowMajor |
+| Mat Layout | Any | Supported | Supported |
 
 ## 示例
 
-### 自动（Auto）
+### C++ 自动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -87,7 +101,7 @@ void example_auto() {
 }
 ```
 
-### 手动（Manual）
+### C++ 手动模式
 
 ```cpp
 #include <pto/pto-inst.hpp>
@@ -102,29 +116,14 @@ void example_manual() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
+%dst = texpands %scalar : f32, !pto.tile<...>
 ```
 
-### PTO 汇编形式
+### AS Level 2（DPS）
 
 ```text
-%dst = texpands %scalar : f32, !pto.tile<...>
-# AS Level 2 (DPS)
 pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
 ```
diff --git a/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md b/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md
index e55ee411..a65d1ed1 100644
--- a/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md
+++ b/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md
@@ -1,24 +1,22 @@
-﻿# TLRELU
+﻿# pto.tlrelu
 
-## 指令示意图
+`pto.tlrelu` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TLRELU tile operation](../../../../figures/isa/TLRELU.svg)
+## 概述
 
-## 简介
+带标量斜率的 Leaky ReLU 激活，对 Tile 逐元素执行，结果写入目标 tile。
 
-带标量斜率的 Leaky ReLU。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = (\mathrm{src}_{i,j} > 0) ? \mathrm{src}_{i,j} : (\mathrm{src}_{i,j} \cdot \mathrm{slope}) $$
 
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tlrelu %src, %slope : !pto.tile<...>, f32
@@ -26,40 +24,66 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tlrelu ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
 PTO_INST RecordEvent TLRELU(TileDataDst& dst, TileDataSrc& src, typename TileDataSrc::DType scalar, WaitEvents&... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | Leaky ReLU 的输入 |
+| `%slope` | 标量 | 负值的乘数 |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素为 Leaky ReLU 结果 |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - `TileData::DType` 必须是以下之一：`half`、`float16_t`、`float`、`float32_t`（仅浮点类型）。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **实现检查 (A5)**:
-    - `TileData::DType` 必须是以下之一：`half`、`float`（仅浮点类型）。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **通用约束**:
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`dst` 和 `src` 的有效行列数必须相同。
-    - 斜率标量类型必须与 Tile 数据类型一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- `dst` 和 `src` 必须使用相同的元素类型。
+- 标量类型必须与 Tile 数据类型一致。
+- Tile 位置必须是向量。
+- 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+- 运行时：`dst` 和 `src` 的有效行列数必须相同。
+- 布局必须是行主序。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+- A2/A3 支持的浮点类型：`half`、`float16_t`、`float`、`float32_t`。
+- A5 支持的浮点类型：`half`、`float`。
+
+## 异常与非法情形
+
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `f32` | Simulated | Supported | Supported |
+| `f16` | Simulated | Supported | Supported |
+| 布局 | Any | RowMajor | RowMajor |
 
 ## 示例
 
@@ -75,29 +99,14 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = tlrelu %src, %slope : !pto.tile<...>, f32
 ```
 
-### PTO 汇编形式
+### AS Level 2（DPS）
 
 ```text
-%dst = tlrelu %src, %slope : !pto.tile<...>, f32
-# AS Level 2 (DPS)
 pto.tlrelu ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
diff --git a/docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md b/docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md
index a4ea40b3..d7805a48 100644
--- a/docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md
+++ b/docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md
@@ -1,24 +1,24 @@
-﻿# TSHLS
+﻿# pto.tshls
 
-## 指令示意图
+`pto.tshls` 属于[Tile 标量与立即数指令](../../tile-scalar-and-immediate_zh.md)集。
 
-![TSHLS tile operation](../../../../figures/isa/TSHLS.svg)
+## 概述
 
-## 简介
+Tile 按标量逐元素进行有符号左移操作。对有效区域内的每个元素，将其值左移指定的立即数位数。
 
-Tile 按标量逐元素左移。
+## 机制
 
-## 数学语义
+### 数学语义
 
 对每个元素 `(i, j)` 在有效区域内：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \ll \mathrm{scalar} $$
 
-## 汇编语法
+其中 `scalar` 为非负整数移位量。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tshls %src, %scalar : !pto.tile<...>, i32
@@ -26,13 +26,13 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tshls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
@@ -45,26 +45,62 @@ template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
 PTO_INST RecordEvent TSHLS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `src` | 输入 | 源 Tile |
+| `scalar` | 立即数 | 非负移位量 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `dst` | Tile | 左移结果，有效区域内有效 |
+
+## 副作用
+
+无。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - 支持的元素类型为 `int32_t`、`int`、`int16_t`、`uint32_t`、`unsigned int` 和 `uint16_t`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 标量仅支持零和正值。
-- **实现检查 (A5)**:
-    - 支持的元素类型为 `int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t` 和 `uint8_t`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 标量仅支持零和正值。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- A2/A3 实现检查：
+    - 支持的元素类型为 `int32_t`、`int`、`int16_t`、`uint32_t`、`unsigned int` 和 `uint16_t`
+    - `dst` 和 `src` 必须使用相同的元素类型
+    - `dst` 和 `src` 必须是向量 Tile
+    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`
+    - 标量仅支持零和正值
+- A5 实现检查：
+    - 支持的元素类型为 `int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t` 和 `uint8_t`
+    - `dst` 和 `src` 必须使用相同的元素类型
+    - `dst` 和 `src` 必须是向量 Tile
+    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`
+    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`
+    - 标量仅支持零和正值
+- 有效区域：
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域
+
+## 异常与非法情形
+
+- 若 `src` 和 `dst` 的元素类型不匹配，编译失败
+- 若 `scalar` 为负值，行为未定义
+- 若 valid region 不匹配，行为未定义
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| int32_t | 是 | 是 | 是 |
+| int16_t | 是 | 是 | 是 |
+| int8_t | 是 | 否 | 是 |
+| uint32_t | 是 | 是 | 是 |
+| uint16_t | 是 | 是 | 是 |
+| uint8_t | 是 | 否 | 是 |
 
 ## 示例
 
+### C++
+
 ```cpp
 #include <pto/pto-inst.hpp>
 
@@ -79,29 +115,18 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
+### PTO-AS
 
 ```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
+# 自动模式：由编译器/运行时负责资源放置与调度
 %dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
 
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
+# 手动模式：先显式绑定资源，再发射指令
 # pto.tassign %arg0, @tile(0x1000)
 # pto.tassign %arg1, @tile(0x2000)
 %dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
-### PTO 汇编形式
+## 相关页面
 
-```text
-%dst = tshls %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.tshls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
+- 指令集总览：[Tile 标量与立即数指令](../../tile-scalar-and-immediate_zh.md)
diff --git a/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md b/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md
index 0ce08803..a196b4ec 100644
--- a/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md
+++ b/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md
@@ -1,24 +1,22 @@
-﻿# TSHRS
+﻿# pto.tshrs
 
-## 指令示意图
+`pto.tshrs` 属于[逐元素 Tile-标量](../../elementwise-tile-tile_zh.md)指令集。
 
-![TSHRS tile operation](../../../../figures/isa/TSHRS.svg)
+## 概述
 
-## 简介
+对 Tile 按标量做逐元素右移，结果写入目标 tile。
 
-Tile 按标量逐元素右移。
+## 机制
 
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
+对有效区域内每个元素 `(i, j)`：
 
 $$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \gg \mathrm{scalar} $$
 
-## 汇编语法
+迭代域由目标 tile 的 valid region 决定。标量仅支持零和正值。
 
-PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
+## 语法
 
-同步形式：
+### PTO-AS
 
 ```text
 %dst = tshrs %src, %scalar : !pto.tile<...>, i32
@@ -26,42 +24,69 @@ PTO-AS 形式：参见 [PTO-AS 规范](../../../../assembly/PTO-AS_zh.md)。
 
 ### AS Level 1（SSA）
 
-```text
+```mlir
 %dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
 ```
 
 ### AS Level 2（DPS）
 
-```text
+```mlir
 pto.tshrs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
 
 ## C++ 内建接口
 
-声明于 `include/pto/common/pto_instr.hpp`：
-
 ```cpp
 template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TSHRS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
+PTO_INST RecordEvent TSHRS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents & ... events);
 ```
 
+## 输入
+
+| 操作数 | 角色 | 说明 |
+| --- | --- | --- |
+| `%src` | 源 tile | 被移位的值 |
+| `%scalar` | 标量 | 移位量（仅零和正值） |
+| `WaitEvents...` | 可选同步 | 发射前需要等待的事件 |
+
+## 预期输出
+
+| 结果 | 类型 | 说明 |
+| --- | --- | --- |
+| `%dst` | `!pto.tile<...>` | valid region 内每个元素等于 `src >> scalar` |
+
+## 副作用
+
+除产生目标 tile 外，没有额外架构副作用。
+
 ## 约束
 
-- **实现检查 (A2A3)**:
-    - 支持的元素类型为 `int32_t`、`int`、`int16_t`、`uint32_t`、`unsigned int` 和 `uint16_t`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 标量仅支持零和正值。
-- **实现检查 (A5)**:
-    - 支持的元素类型为 `int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t` 和 `uint8_t`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 标量仅支持零和正值。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- `dst` 和 `src` 必须使用相同的元素类型。
+- `dst` 和 `src` 必须是向量 Tile。
+- 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
+- 标量仅支持零和正值。
+- A2/A3 支持的元素类型：`int32_t`、`int`、`int16_t`、`uint32_t`、`unsigned int` 和 `uint16_t`。
+- A5 支持的元素类型：`int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t` 和 `uint8_t`。
+- 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
+- 迭代域总是 `dst.GetValidRow() × dst.GetValidCol()`。
+
+## 异常与非法情形
+
+- 源/目标类型不匹配会被 verifier 拒绝。
+- 所选 target profile 不支持的元素类型会被后端拒绝。
+- 负数移位量会被拒绝。
+- 程序不能依赖 `dst` valid region 之外的值。
+
+## Target-Profile 限制
+
+| 特性 | CPU Simulator | A2/A3 | A5 |
+| --- | :---: | :---: | :---: |
+| `i32` | Simulated | Supported | Supported |
+| `i16` | Simulated | Supported | Supported |
+| `i8` | Simulated | No | Supported |
+| `u32` | Simulated | Supported | Supported |
+| `u16` | Simulated | Supported | Supported |
+| `u8` | Simulated | No | Supported |
 
 ## 示例
 
@@ -79,29 +104,14 @@ void example() {
 }
 ```
 
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
+### PTO-AS
 
 ```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = tshrs %src, %scalar : !pto.tile<...>, i32
 ```
 
-### PTO 汇编形式
+### AS Level 2（DPS）
 
 ```text
-%dst = tshrs %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
 pto.tshrs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
 ```
diff --git a/docs/isa/vector/README.md b/docs/isa/vector/README.md
index ceb2590e..662a05f9 100644
--- a/docs/isa/vector/README.md
+++ b/docs/isa/vector/README.md
@@ -1,53 +1,54 @@
 # Vector ISA Reference
 
-The `pto.v*` vector micro-instruction set of PTO ISA is organized by instruction set, with standalone per-op pages under `vector/ops/`.
-
-## Instruction Sets
-
-| Instruction Set | Description | Operations |
-|--------|-------------|------------|
-| [Vector Load Store](./vector-load-store.md) | UB↔vector register transfer, gather/scatter | ~25 |
-| [Predicate and Materialization](./predicate-and-materialization.md) | Vector broadcast and duplication | 2 |
-| [Unary Vector Instructions](./unary-vector-ops.md) | Single-input elementwise operations | 12 |
-| [Binary Vector Instructions](./binary-vector-ops.md) | Two-input elementwise operations | 14 |
-| [Vector-Scalar Instructions](./vec-scalar-ops.md) | Vector combined with scalar operand | 14 |
-| [Conversion Ops](./conversion-ops.md) | Type conversion between numeric types | 3 |
-| [Reduction Instructions](./reduction-ops.md) | Cross-lane reductions | 6 |
-| [Compare and Select](./compare-select.md) | Comparison and conditional selection | 5 |
-| [Data Rearrangement](./data-rearrangement.md) | Lane permutation and packing | 10 |
-| [SFU and DSA Instructions](./sfu-and-dsa-ops.md) | Special function units and DSA instructions | 11 |
+`pto.v*` is the PTO ISA vector micro-instruction set. It directly exposes the vector pipeline, vector registers, predicates, and vector-visible UB movement.
 
-## Quick Reference
-
-### Common Vector Types
+## Organization
 
-| Type | Description |
-|------|-------------|
-| `!pto.vreg<NxT>` | Vector register with N lanes of type T |
-| `!pto.mask` | Predicate mask (width matches vector length) |
-| `!pto.scalar<T>` | Scalar register |
+The vector reference is organized by instruction family, with individual per-op pages under `vector/ops/`.
 
-### Vector Lengths
+## Instruction Families
 
-Vector length `N` is a power of 2. Common values depend on the target profile.
+| Family | Description | Operations |
+|--------|-------------|-----------|
+| [Vector Load Store](./vector-load-store.md) | GM↔UB and UB↔vreg data movement | `vlds`, `vldas`, `vldus`, `vldx2`, `vsld`, `vsldb`, `vgather2`, `vgatherb`, `vgather2_bc`, `vsts`, `vstx2`, `vsst`, `vsstb`, `vscatter`, `vsta`, `vstas`, `vstar`, `vstu`, `vstus`, `vstur` |
+| [Predicate and Materialization](./predicate-and-materialization.md) | Predicate broadcast and duplication | `vbr`, `vdup` |
+| [Unary Vector Ops](./unary-vector-ops.md) | Single-operand vector operations | `vabs`, `vneg`, `vexp`, `vln`, `vsqrt`, `vrsqrt`, `vrec`, `vrelu`, `vnot`, `vbcnt`, `vcls`, `vmov` |
+| [Binary Vector Ops](./binary-vector-ops.md) | Two-operand vector operations | `vadd`, `vsub`, `vmul`, `vdiv`, `vmax`, `vmin`, `vand`, `vor`, `vxor`, `vshl`, `vshr`, `vaddc`, `vsubc` |
+| [Vector-Scalar Ops](./vec-scalar-ops.md) | Vector combined with scalar operand | `vadds`, `vsubs`, `vmuls`, `vmaxs`, `vmins`, `vands`, `vors`, `vxors`, `vshls`, `vshrs`, `vlrelu`, `vaddcs`, `vsubcs` |
+| [Conversion Ops](./conversion-ops.md) | Type conversion | `vci`, `vcvt`, `vtrc` |
+| [Reduction Ops](./reduction-ops.md) | Cross-lane reduction | `vcadd`, `vcmax`, `vcmin`, `vcgadd`, `vcgmax`, `vcgmin`, `vcpadd` |
+| [Compare and Select](./compare-select.md) | Predicate generation and conditional selection | `vcmp`, `vcmps`, `vsel`, `vselr`, `vselrv2` |
+| [Data Rearrangement](./data-rearrangement.md) | Lane permutation and packing | `vintlv`, `vdintlv`, `vslide`, `vshift`, `vsqz`, `vusqz`, `vperm`, `vpack`, `vsunpack`, `vzunpack`, `vintlvv2`, `vdintlvv2` |
+| [SFU and DSA](./sfu-and-dsa-ops.md) | Special function units and DSA operations | `vprelu`, `vexpdiff`, `vaddrelu`, `vsubrelu`, `vaxpy`, `vaddreluconv`, `vmulconv`, `vmull`, `vmula`, `vtranspose`, `vsort32`, `vbitsort`, `vmrgsort` |
 
-## Navigation
+## Common Constraints
 
-The left sidebar provides standalone per-op pages for all vector instructions. Use the instruction set overviews above to understand shared constraints and mechanisms before reading individual opcode pages.
+- Vector width is determined by element type.
+- Predicate width must match vector width.
+- Alignment, distribution, and advanced forms depend on the target profile.
+- There is no tile-level valid-region semantics at the vector layer.
 
-## Source And Timing Provenance
+## Quick Reference
 
-This vector reference treats the current public VPTO semantic and timing material as the input for all per-op timing disclosures.
+### Common Vector Types
 
-Timing disclosure policy for the per-op pages is intentionally strict:
+| Type | Width/Element | Total Elements/vreg |
+|------|---------------|-------------------|
+| f32 / i32 | 8 | 64 |
+| f16 / bf16 / i16 | 16 | 128 |
+| i8 / si8 / ui8 | 32 | 256 |
 
-- If the public sources publish a numeric latency or throughput, the page states that number.
-- If the public sources only publish a stream-level statement, such as the one-CPI note on unaligned-load priming, the page states exactly that narrower contract.
-- If the public sources do not publish a numeric timing value, the page now says so explicitly instead of guessing.
+### Mask Types
 
+| Mask Type | Bytes/Element Slot | Total Lanes |
+|-----------|-------------------|-------------|
+| `mask<b32>` | 4 | 64 |
+| `mask<b16>` | 2 | 128 |
+| `mask<b8>` | 1 | 256 |
 
 ## See Also
 
-- [Vector instructions](../instruction-surfaces/vector-instructions.md)
-- [Vector Instruction Set](../instruction-families/vector-families.md)
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md)
+- [Vector instruction surface](../instruction-surfaces/vector-instructions.md) — High-level description
+- [Vector instruction families](../instruction-families/vector-families.md) — Normative contracts
+- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page format standard
+- [Micro-instruction summary](./micro-instruction-summary.md) — Scalar micro-instructions for vector scope
diff --git a/docs/isa/vector/README_zh.md b/docs/isa/vector/README_zh.md
index 05e6e324..91c64e71 100644
--- a/docs/isa/vector/README_zh.md
+++ b/docs/isa/vector/README_zh.md
@@ -8,36 +8,55 @@
 
 ## 指令族
 
-- 向量加载存储
-- 谓词与物化
-- 一元向量操作
-- 二元向量操作
-- 向量-标量操作
-- 转换操作
-- 归约操作
-- 比较与选择
-- 数据重排
-- SFU 与 DSA
+| 族 | 说明 | 典型操作 |
+|----|------|----------|
+| [向量加载存储](./vector-load-store_zh.md) | GM↔UB 和 UB↔vreg 数据搬运 | `vlds`、`vldas`、`vgather2`、`vsld`、`vsst`、`vscatter` 等 |
+| [谓词与物化](./predicate-and-materialization_zh.md) | 谓词广播与复制 | `vbr`、`vdup` |
+| [一元向量运算](./unary-vector-ops_zh.md) | 单操作数向量运算 | `vabs`、`vneg`、`vexp`、`vsqrt`、`vrec`、`vrelu` 等 |
+| [二元向量运算](./binary-vector-ops_zh.md) | 双操作数向量运算 | `vadd`、`vsub`、`vmul`、`vmax`、`vmin`、`vand`、`vor` 等 |
+| [向量-标量运算](./vec-scalar-ops_zh.md) | 向量与标量组合运算 | `vadds`、`vsubs`、`vmuls`、`vshls`、`vlrelu` 等 |
+| [类型转换](./conversion-ops_zh.md) | 类型转换 | `vci`、`vcvt`、`vtrc` |
+| [归约指令](./reduction-ops_zh.md) | 跨 lane 归约 | `vcadd`、`vcmax`、`vcmin`、`vcgadd`、`vcgmax` 等 |
+| [比较与选择](./compare-select_zh.md) | 谓词生成与条件选择 | `vcmp`、`vcmps`、`vsel`、`vselr`、`vselrv2` |
+| [数据重排](./data-rearrangement_zh.md) | Lane 置换与打包 | `vintlv`、`vslide`、`vshift`、`vpack`、`vzunpack` 等 |
+| [SFU 与 DSA](./sfu-and-dsa-ops_zh.md) | 特殊函数单元与 DSA 操作 | `vprelu`、`vexpdiff`、`vaxpy`、`vtranspose`、`vsort32` 等 |
 
 ## 共享约束
 
-- 向量宽度由元素类型决定
-- 谓词宽度必须匹配向量宽度
-- 对齐、分布和部分高级形式依赖目标 profile
-- 向量层没有 tile 级 valid region 语义
+- 向量宽度由元素类型决定。
+- 谓词宽度必须匹配向量宽度。
+- 对齐、分布和部分高级形式依赖目标 profile。
+- 向量层没有 tile 级 valid region 语义。
+
+## 快速参考
+
+### 常见向量类型
+
+| 类型 | 单元素宽度 | 每个 vreg 的总元素数 |
+|------|----------|-------------------|
+| f32 / i32 | 4 B | 64 |
+| f16 / bf16 / i16 | 2 B | 128 |
+| i8 / si8 / ui8 | 1 B | 256 |
+
+### Mask 类型
+
+| Mask 类型 | 每个元素槽字节数 | 总 lane 数 |
+|-----------|----------------|----------|
+| `mask<b32>` | 4 | 64 |
+| `mask<b16>` | 2 | 128 |
+| `mask<b8>` | 1 | 256 |
 
 ## 相关页面
 
-- [向量指令集](../instruction-surfaces/vector-instructions_zh.md)
-- [向量指令族](../instruction-families/vector-families_zh.md)
-- [指令描述格式](../reference/format-of-instruction-descriptions_zh.md)
+- [向量指令集](../instruction-surfaces/vector-instructions_zh.md) — 高层描述
+- [向量指令族](../instruction-families/vector-families_zh.md) — 规范契约
+- [指令描述格式](../reference/format-of-instruction-descriptions_zh.md) — per-op 页面格式标准
+- [微指令汇总](./micro-instruction-summary.md) — 向量作用域的标量微指令
 
 ## 来源与时序披露
 
-当前向量微指令参考页以现有公开 VPTO 语义和时序材料为准，并据此统一生成各 per-op 页的时序披露。
-
-各 per-op 页现在统一遵循下面的时序披露规则：
+当前向量微指令参考页以现有公开 VPTO 语义和时序材料为准，并据此统一生成各 per-op 页的时序披露：
 
 - 公开来源给出了数字时延或吞吐时，页面直接写出该数字。
 - 公开来源只给出流级描述时，页面只写出该更窄的公开契约。
-- 公开来源没有给出数字时，页面会明确写成“公开来源未给出”，而不是推测一个常数。
+- 公开来源没有给出数字时，页面会明确写成"公开来源未给出"，而不是推测一个常数。
diff --git a/docs/isa/vector/micro-instruction-summary.md b/docs/isa/vector/micro-instruction-summary.md
new file mode 100644
index 00000000..c779188e
--- /dev/null
+++ b/docs/isa/vector/micro-instruction-summary.md
@@ -0,0 +1,434 @@
+# Vector Micro-Instruction Reference Summary
+
+This document summarizes the micro-architectural details for vector instructions (`pto.v*`) from the PTO micro-instruction SPEC. It supplements the per-operation reference pages with hardware-specific timing, pipeline behavior, and implementation notes for the Ascend 950 (A5) architecture.
+
+> **Scope:** This information applies to the A5 profile (Ascend 950). The CPU simulator and A2/A3 profiles emulate vector instructions using scalar loops; performance numbers below are specific to A5 hardware.
+
+## Architecture Overview
+
+### Vector Core and Pipelines
+
+The Ascend 950 vector core operates asynchronously with three primary pipelines:
+
+| Pipeline | Role | Instructions |
+|----------|------|-------------|
+| **PIPE_MTE2** | DMA inbound: GM → UB | `copy_gm_to_ubuf` |
+| **PIPE_V** | Vector compute: UB ↔ vreg + operations | All `pto.v*` instructions |
+| **PIPE_MTE3** | DMA outbound: UB → GM | `copy_ubuf_to_gm` |
+
+Synchronization between these pipelines is explicit via `pto.set_flag` / `pto.wait_flag` or `pto.get_buf` / `pto.rls_buf`.
+
+## Element Types (Extended, A5)
+
+The `vreg<NxT>` type has exactly 256 bytes total (2048 bits). `N × bitwidth(T) = 2048`:
+
+|| Type | Bits | Description |
+|------|------|-------------|
+| `i8` / `si8` / `ui8` | 8 | Signless/signed/unsigned 8-bit integer |
+| `i16` / `si16` / `ui16` | 16 | Signless/signed/unsigned 16-bit integer |
+| `i32` / `si32` / `ui32` | 32 | Signless/signed/unsigned 32-bit integer |
+| `i64` / `si64` / `ui64` | 64 | Signless/signed/unsigned 64-bit integer |
+| `f16` | 16 | IEEE 754 half precision |
+| `bf16` | 16 | Brain floating point |
+| `f32` | 32 | IEEE 754 single precision |
+| `f8e4m3` | 8 | Float8 E4M3 (A5+) |
+| `f8e5m2` | 8 | Float8 E5M2 (A5+) |
+
+### Predicate Masks
+
+The mask type `!pto.mask<G>` models an A5 predicate register (256-bit) under a typed granularity view.
+
+`G` MUST be one of `b32`, `b16`, `b8`:
+
+|| Mask Type | Bytes/Element | Typical Element | Logical Lanes |
+|-----------|--------------|-----------------|---------------|
+| `!pto.mask<b32>` | 4 | `f32` / `i32` | 64 |
+| `!pto.mask<b16>` | 2 | `f16` / `bf16` / `i16` | 128 |
+| `!pto.mask<b8>` | 1 | 8-bit family | 256 |
+
+The physical predicate register is always 256 bits. The `G` parameter records how VPTO interprets the register for matching mask-producing and mask-consuming ops, and for verifier legality rules.
+
+**Predication Mode — ZEROING:** Inactive lanes produce zero, not preserved destination values:
+
+```c
+dst[i] = mask[i] ? op(src0[i], src1[i]) : 0    // ZEROING mode
+```
+
+This is intentionally different from a lane-vector model such as `mask<64xi1>`.
+
+## Architecture Overview
+
+Each 256-byte vector register (`vreg`) is organized as **8 VLanes** of 32 bytes each. A VLane is the atomic unit for group reduction operations.
+
+```
+vreg (256 bytes total):
+┌─────────┬─────────┬─────────┬─────┬─────────┬─────────┐
+│ VLane 0 │ VLane 1 │ VLane 2 │ ... │ VLane 6 │ VLane 7 │
+│   32B   │   32B   │   32B   │     │   32B   │   32B   │
+└─────────┴─────────┴─────────┴─────┴─────────┴─────────┘
+```
+
+**Elements per VLane by data type:**
+
+| Data Type | Elements/VLane | Total Elements/vreg |
+|-----------|---------------|-------------------|
+| `i8`/`u8` | 32 | 256 |
+| `i16`/`u16`/`f16`/`bf16` | 16 | 128 |
+| `i32`/`u32`/`f32` | 8 | 64 |
+| `i64`/`u64` | 4 | 32 |
+
+### Unified Buffer (UB)
+
+- **Capacity:** 256 KB on-chip SRAM per core
+- **Address space:** `!pto.ptr<T, ub>` distinguishes UB from GM (`!pto.ptr<T, gm>`)
+- **Role:** Staging area between DMA and vector registers; the only valid source for vector load instructions
+
+### Predicate Masks
+
+- **Type:** `!pto.mask` (256-bit width for A5)
+- **Width:** Must match vector register width `N`
+- **Semantics:** Mask bit `1` = lane active; `0` = lane inactive (destination preserved)
+
+## Instruction Groups and Timing
+
+### Group 3: Vector Load/Store (`pto.vlds`, `pto.vsts`, etc.)
+
+**Category:** UB ↔ Vector Register data movement
+**Pipeline:** PIPE_V
+
+Vector loads move data from UB to vector registers; stores move data from vector registers back to UB. All vector compute operates only on `vreg`; UB staging is explicit.
+
+#### Common Operand Model
+
+- `%source`/`%dest`: base address in UB space (`!pto.ptr<T, ub>`)
+- `%offset`: displacement; encoding is instruction-specific
+- `%mask`: predicate operand; inactive lanes do not issue memory requests
+- `!pto.align`: alignment state for unaligned operations
+
+#### A5 Latency and Throughput
+
+Cycle-accurate simulator (Ascend910_9599 CA) issue→retire timings. **These are simulator results, not guaranteed silicon values.**
+
+| Instruction | A5 mnemonic | Mode / note | Issue→Retire (cycles) |
+|-------------|-------------|-------------|----------------------|
+| `pto.vlds` | `RV_VLD` | `dist:NORMAL` / `NORAML` | **9** |
+| `pto.vldsx2` | `RV_VLDI` | `dist:DINTLV` (dual vreg) | **9** |
+| `pto.vsts` / `pto.vstx2` | `RV_VST` / `RV_VSTI` | `dist:NORM` | **9** (12 for `INTLV`) |
+| `pto.vgather2` | `RV_VGATHER2` | `Dtype: B32` | **27–28** |
+| `pto.vgatherb` | `RV_VGATHERB` | indexed byte gather | **~21** |
+| `pto.vscatter` | `RV_VSCATTER` | `Dtype: B16` | **~17** |
+| `pto.vadd` | `RV_VADD` | F32 between UB-backed ops | **7** |
+
+**Dual-issue capability:** `pto.vlds` is dual-issue capable — two independent `vlds` can issue in the same cycle. Alternatively, one `vlds` + one `vsts` can issue together in a **1+1** cycle. These modes are mutually exclusive.
+
+**Throughput summary:**
+
+| `dist:` token (load) | RV op | Cycles |
+|----------------------|-------|--------|
+| `NORM`, `UNPK`, `DINTLV`, `BRC`, `BRC_BLK`, `BDINTLV`, `US`, `DS`, `SPLT4CHN`, `SPLT2CHN` | `RV_VLD`/`RV_VLDI` | **9** |
+| `NORM`/`PK` (store) | `RV_VSTI` | **9** |
+| `INTLV` (`vstx2`) | `RV_VSTI` | **12** |
+
+**Gather/scatter latency:**
+
+| PTO op | A5-level | Latency |
+|--------|----------|---------|
+| `pto.vgather2` | `RV_VGATHER2` | 27–28 cycles (pattern-dependent) |
+| `pto.vgather2_bc` | broadcast gather | 27–28 cycles |
+| `pto.vgatherb` | `RV_VGATHERB` | ~21 cycles |
+| `pto.vscatter` | `RV_VSCATTER` | ~17 cycles (B16) |
+
+### Group 6: Unary Vector Ops (`pto.vabs`, `pto.vexp`, etc.)
+
+**Category:** Single-input element-wise operations
+**Pipeline:** PIPE_V
+
+#### Operand Model
+
+- `%input`: source vector register
+- `%mask`: predicate mask (active lanes participate)
+- `%result`: destination vector register (same width/type as input unless specified)
+
+Zeroing forms (`-z` suffix variants) zero-fill inactive lanes; merging forms preserve destination values.
+
+#### A5 Latency (Cycle-Accurate Simulator)
+
+| PTO op | RV mnemonic | fp32 | fp16 | bf16 | Types |
+|--------|-------------|------|------|------|-------|
+| `pto.vabs` | `RV_VABS_FP` | **5** | **5** | — | i8–i32, f16, f32 |
+| `pto.vneg` | `RV_VMULS` | **8** | **8** | — | i8–i32, f16, f32 |
+| `pto.vexp` | `RV_VEXP` | **16** | **21** | — | f16, f32 |
+| `pto.vln` | `RV_VLN` | **18** | **23** | — | f16, f32 |
+| `pto.vsqrt` | `RV_VSQRT` | **17** | **22** | — | f16, f32 |
+| `pto.vrelu` | `RV_VRELU` | **5** | **5** | — | i8–i32, f16, f32 |
+| `pto.vnot` | `RV_VNOT` | — | int-only | — | integer types |
+| `pto.vmov` | `RV_VLD` proxy | **9** | **9** | — | all |
+
+**Notes:**
+- Integer overflow follows ISA default truncation for `pto.vabs`.
+- Transcendental ops (`vexp`, `vln`, `vsqrt`) are hardware-accelerated SFU operations.
+
+### Group 7: Binary Vector Ops (`pto.vadd`, `pto.vmul`, etc.)
+
+**Category:** Two-input element-wise operations
+**Pipeline:** PIPE_V
+
+#### Operand Model
+
+- `%lhs`, `%rhs`: source vector registers
+- `%mask`: predicate mask
+- `%result`: destination (same width and type as sources)
+
+#### A5 Latency
+
+| PTO op | RV mnemonic | fp32 | fp16 | i32 | i16 | i8 |
+|--------|-------------|------|------|-----|-----|----|
+| `pto.vadd` | `RV_VADD` | **7** | **7** | **7** | **7** | **7** |
+| `pto.vsub` | `RV_VSUB` | **7** | **7** | **7** | **7** | **7** |
+| `pto.vmul` | `RV_VMUL` | **7** | **7** | **7** | **7** | **7** |
+| `pto.vdiv` | `RV_VDIV` | ~**14–28** | varies | — | — | — |
+| `pto.vmax`/`pto.vmin` | `RV_VMAX`/`RV_VMIN` | **7** | **7** | **7** | **7** | **7** |
+| `pto.vand`/`pto.vor`/`pto.vxor` | bitwise | **7** | **7** | **7** | **7** | **7** |
+| `pto.vshl`/`pto.vshr` | shift | **7** | **7** | **7** | **7** | **7** |
+| `pto.vaddc` | carry add | **7–10** | — | — | — | — |
+| `pto.vsubc` | borrow subtract | **7–10** | — | — | — | — |
+
+**Throughput:** Binary ops have 2× the per-repeat throughput of unary ops. Dual-issue paths exist for certain type combinations.
+
+### Group 8: Vec-Scalar Ops (`pto.vadds`, `pto.vmuls`, etc.)
+
+**Category:** Vector combined with scalar operand
+**Pipeline:** PIPE_V
+
+Vector-scalar operations broadcast a scalar to all lanes before computing.
+
+#### Operand Model
+
+- `%vector`: vector register
+- `%scalar`: scalar value (broadcast to all lanes)
+- `%mask`: predicate mask
+
+#### Latency
+
+Similar to binary ops but with scalar broadcast overhead (typically +1–2 cycles depending on type).
+
+### Group 9: Conversion Ops (`pto.vcvt`, `pto.vtrc`)
+
+**Category:** Type conversion with rounding/saturation control
+**Pipeline:** PIPE_V
+
+#### Key Instructions
+
+- `pto.vci`: Convert with implementation-defined rounding
+- `pto.vcvt`: Explicit rounding mode controlled by attribute
+- `pto.vtrc`: Truncate toward zero (round-to-nearest-even vs truncation)
+
+**Rounding modes:** `RN` (round-to-nearest-even), `RZ` (round-toward-zero), `RP` (round-toward-positive), `RM` (round-toward-negative)
+
+### Group 10: Reduction Ops (`pto.vcadd`, `pto.vcmax`, etc.)
+
+**Category:** Cross-lane reduction and prefix operations
+**Pipeline:** PIPE_V
+
+Reductions combine elements across lanes. Two categories:
+
+- **Full vector reductions** (`vcadd`, `vcmax`, `vcmin`): Reduce entire vector to a single value distributed to all lanes
+- **Per-VLane (Group) reductions** (`vcgadd`, `vcgmax`, `vcgmin`, `vcpadd`): Reduce within each VLane group (8-lane chunk)
+
+**VLane grouping:** The 256-bit vector is logically split into 8 VLanes of 32 bits each; group reductions operate within each VLane independently.
+
+### Group 11: Compare & Select (`pto.vcmp`, `pto.vsel`, etc.)
+
+**Category:** Comparison and conditional selection
+**Pipeline:** PIPE_V
+
+- `pto.vcmp`: Compare two vectors, produce predicate mask
+- `pto.vcmps`: Compare vector with scalar
+- `pto.vsel`: Select between two vectors based on predicate
+- `pto.vselr`: Select with reversal (invert condition)
+- `pto.vselrv2`: Select variant (not available on A5)
+
+### Group 12: Data Rearrangement (`pto.vintlv`, `pto.vdintlv`, etc.)
+
+**Category:** In-register data movement and permutation
+**Pipeline:** PIPE_V
+
+Lane-level permutation, interleave/deinterleave, pack/unpack operations.
+
+**Available on A5:** `vintlv`, `vdintlv`
+**Not A5:** `vintlvv2`, `vdintlvv2` (removed from A5 surface)
+
+### Group 13: DSA/SFU Ops (`pto.vprelu`, `pto.vexpdiff`, `pto.vaxpy`, `pto.vsort32`, etc.)
+
+**Category:** Specialized domain-specific and special function unit operations
+**Pipeline:** PIPE_V
+
+Fused operations, transcendental helpers, sorting, and index generation:
+
+| Instruction | Semantics |
+|-------------|-----------|
+| `pto.vprelu` | Parametric ReLU with broadcast scalar slope |
+| `pto.vexpdiff` | `exp(src0 - src1)` — fused exponential difference |
+| `pto.vaddrelu` | `max(src0 + src1, 0)` — fused add + ReLU |
+| `pto.vsubrelu` | `max(src0 - src1, 0)` — fused subtract + ReLU |
+| `pto.vaxpy` | `a*x + y` — fused multiply-add |
+| `pto.vaddreluconv` | `max(src0 + src1, 0) + src2` — complex fused pattern |
+| `pto.vmulconv` | `(src0 * src1) + src2` — multiply + add |
+| `pto.vmull` | Extended-precision multiply (wider product) |
+| `pto.vmula` | Multiply-accumulate variant |
+| `pto.vtranspose` | In-register matrix transpose (4×4 or 8×8 blocks) |
+| `pto.vsort32` | Sort 32-element vector block |
+| `pto.vbitsort` | Bitonic sort variant |
+| `pto.vmrgsort` | Merge sort for pre-sorted sequences |
+
+## Execution Scope (`__VEC_SCOPE__`)
+
+Vector instructions must be enclosed in a `pto.vecscope` or `pto.strict_vecscope` region. This defines the vector execution interval and establishes producer-consumer ordering with surrounding DMA operations.
+
+```mlir
+pto.vecscope {
+  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask<b32>
+  %v = pto.vlds %ub[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
+  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
+  pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
+}
+```
+
+**`pto.strict_vecscope`** requires all vector-scope inputs to be explicit region arguments, rejecting implicit capture.
+
+## Synchronization Patterns
+
+### Producer-Consumer (MTE2 → Vector)
+
+```mlir
+// DMA: GM → UB
+pto.copy_gm_to_ubuf %gm, %ub, ...
+// Signal: MTE2 → Vector
+pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
+// Wait: Vector sees data
+pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
+pto.vecscope {
+  %v = pto.vlds %ub[...]
+  // compute
+}
+```
+
+### Consumer-Producer (Vector → MTE3)
+
+```mlir
+pto.vecscope {
+  // vector compute
+  pto.vsts %v, %ub, %mask
+}
+// Signal: Vector → MTE3
+pto.set_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
+// Wait: MTE3 sees data
+pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
+// DMA: UB → GM
+pto.copy_ubuf_to_gm %ub, %gm, ...
+```
+
+### Barrier Synchronization
+
+`pto.pipe_barrier "PIPE_V"` drains all pending vector operations. Use when ordering within a single pipeline matters (e.g., two stores to the same GM address).
+
+### Resource-Based Sync (`get_buf`/`rls_buf`)
+
+Buffer IDs provide finer-grained producer-consumer coordination than event flags. The producer calls `rls_buf` after writing; the consumer calls `get_buf` before reading. This tracks *which* buffer slot is ready, not just *that* something is ready.
+
+## Memory Barriers Within VecScope
+
+When UB addresses alias between vector load/store operations within the same `vecscope`, explicit memory barriers are required:
+
+```c
+pto.mem_bar "VV_ALL"      // All prior vector ops complete before subsequent
+pto.mem_bar "VST_VLD"     // All prior vector stores visible before subsequent loads
+pto.mem_bar "VLD_VST"     // All prior vector loads complete before subsequent stores
+```
+
+Without proper barriers, loads may see stale data or stores may be reordered.
+
+## Common Usage Patterns
+
+### Full-Vector Compute (All Lanes Active)
+
+```cpp
+Mask<64> mask;
+mask.set_all(true);
+VADD(vdst, va, vb, mask);
+```
+
+### Partial Predication
+
+```mlir
+%result = pto.vadd %va, %vb, %cond_mask : (!pto.vreg<128xf16>, ...) -> !pto.vreg<128xf16>
+```
+
+Only lanes where `%cond_mask` is true participate; inactive lanes preserve destination.
+
+### Double-Buffering (Ping-Pong)
+
+```mlir
+// Buffer A
+pto.copy_gm_to_ubuf %gm_a, %ub_a, ...
+pto.set_flag["PIPE_MTE2", "PIPE_V", "EVT_A"]
+// Buffer B (overlap with A copy)
+pto.copy_gm_to_ubuf %gm_b, %ub_b, ...
+pto.set_flag["PIPE_MTE2", "PIPE_V", "EVT_B"]
+
+scf.for %iter = 0 to %N step 1 {
+  pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVT_A"]
+  pto.vecscope { pto.vlds %v_a, %ub_a[...] ... compute ... pto.vsts ... }
+  pto.set_flag["PIPE_V", "PIPE_MTE3", "EVT_A"]
+
+  // Overlap compute on A with copy for B
+  pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVT_B"]
+  // ... process B ...
+}
+```
+
+## Type Support Matrix (A5)
+
+| Element Type | Vector Load/Store | Unary | Binary | Vec-Scalar | Conversion | Reduction |
+|--------------|------------------|-------|--------|------------|------------|-----------|
+| `f32` | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+| `f16` | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+| `bf16` | ✓ (limited) | — | — | — | — | — |
+| `i8`/`u8` | ✓ | ✓ | ✓ | ✓ | ✓ (with restrictions) | ✓ |
+| `i16`/`u16` | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+| `i32`/`u32` | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+| `i64`/`u64` | — | — | — | — | — | — |
+| `f8e4m3`/`f8e5m2` | ✓ (A5+) | A5+ | A5+ | A5+ | A5+ | A5+ |
+
+**Note:** BF16 coverage varies by instruction; integer types support saturating variants where documented.
+
+## Performance Tuning Tips
+
+1. **Maximize dual-issue:** Structure loops to issue two `vlds` per cycle or `vlds`+`vsts` pairs.
+2. **Hide latency with double-buffering:** Overlap `copy_gm_to_ubuf` with vector compute using two buffer slots.
+3. **Predicate early:** Apply masks as early as possible to avoid unnecessary computation on inactive lanes.
+4. **Align UB addresses:** Misaligned accesses incur alignment-state overhead; prefer 32B alignment for contiguous patterns.
+5. **Group reductions by VLane:** Use group reduction ops (`vcg*`, `vcpadd`) when data locality within 8-lane groups is exploitable.
+6. **Prefer fused ops:** `vaddrelu`, `vexpdiff`, `vaxpy` reduce register pressure and instruction count.
+7. **Gather/scatter sparingly:** `vgather2` latency is ~27–28 cycles; ensure sufficient computation to amortize.
+
+## Relationship to Tile Instructions
+
+Vector instructions provide fine-grained control where tile instructions expose higher-level tile semantics:
+
+| Aspect | Tile (`pto.t*`) | Vector (`pto.v*`) |
+|--------|----------------|------------------|
+| **Abstraction** | Tile buffers with valid regions | Raw vector registers, no valid region |
+| **Data movement** | Implicit TLOAD/TSTORE (GM → UB → tile) | Explicit `copy_gm_to_ubuf` + `vlds`/`vsts` |
+| **Predication** | No per-lane masking | Full per-lane predicate on every op |
+| **Typical use** | High-level tensor algebra | Hand-tuned kernels, per-element control |
+| **Target profile** | All (A2/A3/A5/CPU) | A5 native; emulated on others |
+
+Many high-level tile operations are internally lowered to vector micro-instructions plus DMA orchestration. Understanding both levels enables hand-tuning critical kernels.
+
+## References
+
+- **[PTO ISA Manual](../README.md)** — Full instruction reference
+- **[Vector Instruction Reference](../vector/README.md)** — Per-op reference pages
+- **[Format of Instruction Descriptions](../reference/format-of-instruction-descriptions.md)** — Page structure standard
+- **[PTO-Gym micro-instruction SPEC](https://github.com/PTO-ISA/PTO-Gym/blob/main/docs/PTO-micro-Instruction-SPEC.md)** — Source specification
diff --git a/docs/machine/README_zh.md b/docs/machine/README_zh.md
index 7b7b3c75..8b66e036 100644
--- a/docs/machine/README_zh.md
+++ b/docs/machine/README_zh.md
@@ -1,6 +1,12 @@
-# PTO 抽象机器（Abstract Machine）
+# PTO Abstract Machine
 
-本目录定义 PTO ISA 与 PTO Tile Lib 使用的抽象执行模型。该模型刻意站在“写代码的人”的视角：描述正确程序可以依赖的行为假设，而不要求读者理解每一代设备的所有微架构细节。
+本目录定义 PTO ISA 与 PTO Tile Lib 使用的**抽象执行模型**。该模型刻意站在"写代码的人"的视角：描述正确程序可以依赖的行为假设，而不要求读者理解每一代设备的微架构细节。
+
+## 按任务选择
+
+| 你的需求 | 从这里开始 |
+|----------|----------|
+| 理解 core / device / host 分层 | [PTO 机器模型](abstract-machine_zh.md) |
 
 ## 文档
 
@@ -8,9 +14,15 @@
 
 ## 与其他文档的关系
 
-- 数据模型与编程模型：
-  - [Tiles](../coding/Tile_zh.md)
-  - [全局内存张量](../coding/GlobalTensor_zh.md)
-  - [事件与同步](../coding/Event_zh.md)
-  - [标量值与枚举](../coding/Scalar_zh.md)
-- [指令参考](../isa/README_zh.md)
+| 相关文档 | 内容 |
+|----------|------|
+| [docs/coding/Tile_zh.md](../coding/Tile_zh.md) | Tile 抽象与编程模型 |
+| [docs/coding/GlobalTensor_zh.md](../coding/GlobalTensor_zh.md) | 全局内存张量 |
+| [docs/coding/Event_zh.md](../coding/Event_zh.md) | 事件与同步 |
+| [docs/coding/Scalar_zh.md](../coding/Scalar_zh.md) | 标量值与枚举 |
+| [docs/isa/README_zh.md](../isa/README_zh.md) | 指令参考 |
+
+## 相关文档
+
+- [docs/README_zh.md](../../README_zh.md) — 文档总入口
+- [docs/isa/README_zh.md](../isa/README_zh.md) — ISA 参考入口
diff --git a/docs/mkdocs/mkdocs.yml b/docs/mkdocs/mkdocs.yml
index 32c7ca34..4e40ac6b 100644
--- a/docs/mkdocs/mkdocs.yml
+++ b/docs/mkdocs/mkdocs.yml
@@ -7,7 +7,7 @@ theme:
   name: readthedocs
   language: en
 
-docs_dir: src
+docs_dir: ../
 site_dir: ../../site
 exclude_docs: |
   /README.md
@@ -79,403 +79,404 @@ extra_css:
 
 nav:
   - 1. Introduction:
-      - Introduction: docs/isa/introduction/what-is-pto-visa.md
-      - Document Structure: docs/isa/introduction/document-structure.md
-      - Goals Of PTO: docs/isa/introduction/goals-of-pto.md
-      - PTO ISA Version 1.0: docs/isa/introduction/pto-isa-version-1-0.md
-      - Scope And Boundaries: docs/isa/introduction/design-goals-and-boundaries.md
+      - Introduction: isa/introduction/what-is-pto-visa.md
+      - Document Structure: isa/introduction/document-structure.md
+      - Goals Of PTO: isa/introduction/goals-of-pto.md
+      - PTO ISA Version 1.0: isa/introduction/pto-isa-version-1-0.md
+      - Scope And Boundaries: isa/introduction/design-goals-and-boundaries.md
   - 2. Programming Model:
-      - Tiles And Valid Regions: docs/isa/programming-model/tiles-and-valid-regions.md
-      - GlobalTensor And Data Movement: docs/isa/programming-model/globaltensor-and-data-movement.md
-      - Auto Vs Manual: docs/isa/programming-model/auto-vs-manual.md
+      - Tiles And Valid Regions: isa/programming-model/tiles-and-valid-regions.md
+      - GlobalTensor And Data Movement: isa/programming-model/globaltensor-and-data-movement.md
+      - Auto Vs Manual: isa/programming-model/auto-vs-manual.md
   - 3. Machine Model:
-      - Execution Agents And Target Profiles: docs/isa/machine-model/execution-agents.md
-      - Ordering And Synchronization: docs/isa/machine-model/ordering-and-synchronization.md
+      - Execution Agents And Target Profiles: isa/machine-model/execution-agents.md
+      - Ordering And Synchronization: isa/machine-model/ordering-and-synchronization.md
   - 4. Syntax And Operands:
-      - Assembly Spelling And Operands: docs/isa/syntax-and-operands/assembly-model.md
-      - Operands And Attributes: docs/isa/syntax-and-operands/operands-and-attributes.md
-      - Common Conventions: docs/isa/conventions.md
+      - Assembly Spelling And Operands: isa/syntax-and-operands/assembly-model.md
+      - Operands And Attributes: isa/syntax-and-operands/operands-and-attributes.md
+      - Common Conventions: isa/conventions.md
   - 5. State And Types:
-      - Type System: docs/isa/state-and-types/type-system.md
-      - Layout Reference: docs/isa/state-and-types/layout.md
-      - Data Format Reference: docs/isa/state-and-types/data-format.md
-      - Location Intent And Legality: docs/isa/state-and-types/location-intent-and-legality.md
+      - Type System: isa/state-and-types/type-system.md
+      - Layout Reference: isa/state-and-types/layout.md
+      - Data Format Reference: isa/state-and-types/data-format.md
+      - Location Intent And Legality: isa/state-and-types/location-intent-and-legality.md
   - 6. Memory Model:
-      - Consistency Baseline: docs/isa/memory-model/consistency-baseline.md
-      - Producer Consumer Ordering: docs/isa/memory-model/producer-consumer-ordering.md
+      - Consistency Baseline: isa/memory-model/consistency-baseline.md
+      - Producer Consumer Ordering: isa/memory-model/producer-consumer-ordering.md
   - 7. Instruction Set Overview:
-      - Overview: docs/isa/instruction-surfaces/README.md
-      - Tile Instructions: docs/isa/instruction-surfaces/tile-instructions.md
-      - Vector Instructions: docs/isa/instruction-surfaces/vector-instructions.md
-      - Scalar And Control Instructions: docs/isa/instruction-surfaces/scalar-and-control-instructions.md
-      - Other Instructions: docs/isa/instruction-surfaces/other-instructions.md
+      - Overview: isa/instruction-surfaces/README.md
+      - Tile Instructions: isa/instruction-surfaces/tile-instructions.md
+      - Vector Instructions: isa/instruction-surfaces/vector-instructions.md
+      - Scalar And Control Instructions: isa/instruction-surfaces/scalar-and-control-instructions.md
+      - Other Instructions: isa/instruction-surfaces/other-instructions.md
   - 8. Instruction Set Contracts:
-      - Overview: docs/isa/instruction-families/README.md
-      - Tile Instruction Set: docs/isa/instruction-families/tile-families.md
-      - Vector Instruction Set: docs/isa/instruction-families/vector-families.md
-      - Scalar And Control Instruction Set: docs/isa/instruction-families/scalar-and-control-families.md
-      - Other Instruction Set: docs/isa/instruction-families/other-families.md
+      - Overview: isa/instruction-families/README.md
+      - Tile Instruction Set: isa/instruction-families/tile-families.md
+      - Vector Instruction Set: isa/instruction-families/vector-families.md
+      - Scalar And Control Instruction Set: isa/instruction-families/scalar-and-control-families.md
+      - Other Instruction Set: isa/instruction-families/other-families.md
   - 9. Tile Instruction Reference:
-          - Overview: docs/isa/tile/README.md
+          - Overview: isa/tile/README.md
           - Sync And Config:
-              - Instruction Set Contract: docs/isa/tile/sync-and-config.md
-              - pto.tsync: docs/isa/tile/ops/sync-and-config/tsync.md
-              - pto.tassign: docs/isa/tile/ops/sync-and-config/tassign.md
-              - pto.tsethf32mode: docs/isa/tile/ops/sync-and-config/tsethf32mode.md
-              - pto.tsettf32mode: docs/isa/tile/ops/sync-and-config/tsettf32mode.md
-              - pto.tsetfmatrix: docs/isa/tile/ops/sync-and-config/tsetfmatrix.md
-              - pto.tset_img2col_rpt: docs/isa/tile/ops/sync-and-config/tset-img2col-rpt.md
-              - pto.tset_img2col_padding: docs/isa/tile/ops/sync-and-config/tset-img2col-padding.md
-              - pto.tsubview: docs/isa/tile/ops/sync-and-config/tsubview.md
-              - pto.tget_scale_addr: docs/isa/tile/ops/sync-and-config/tget-scale-addr.md
+              - Instruction Set Contract: isa/tile/sync-and-config.md
+              - pto.tsync: isa/tile/ops/sync-and-config/tsync.md
+              - pto.tassign: isa/tile/ops/sync-and-config/tassign.md
+              - pto.tsethf32mode: isa/tile/ops/sync-and-config/tsethf32mode.md
+              - pto.tsettf32mode: isa/tile/ops/sync-and-config/tsettf32mode.md
+              - pto.tsetfmatrix: isa/tile/ops/sync-and-config/tsetfmatrix.md
+              - pto.tset_img2col_rpt: isa/tile/ops/sync-and-config/tset-img2col-rpt.md
+              - pto.tset_img2col_padding: isa/tile/ops/sync-and-config/tset-img2col-padding.md
+              - pto.tsubview: isa/tile/ops/sync-and-config/tsubview.md
+              - pto.tget_scale_addr: isa/tile/ops/sync-and-config/tget-scale-addr.md
           - Elementwise Tile Tile:
-              - Instruction Set Contract: docs/isa/tile/elementwise-tile-tile.md
-              - pto.tadd: docs/isa/tile/ops/elementwise-tile-tile/tadd.md
-              - pto.tabs: docs/isa/tile/ops/elementwise-tile-tile/tabs.md
-              - pto.tand: docs/isa/tile/ops/elementwise-tile-tile/tand.md
-              - pto.tor: docs/isa/tile/ops/elementwise-tile-tile/tor.md
-              - pto.tsub: docs/isa/tile/ops/elementwise-tile-tile/tsub.md
-              - pto.tmul: docs/isa/tile/ops/elementwise-tile-tile/tmul.md
-              - pto.tmin: docs/isa/tile/ops/elementwise-tile-tile/tmin.md
-              - pto.tmax: docs/isa/tile/ops/elementwise-tile-tile/tmax.md
-              - pto.tcmp: docs/isa/tile/ops/elementwise-tile-tile/tcmp.md
-              - pto.tdiv: docs/isa/tile/ops/elementwise-tile-tile/tdiv.md
-              - pto.tshl: docs/isa/tile/ops/elementwise-tile-tile/tshl.md
-              - pto.tshr: docs/isa/tile/ops/elementwise-tile-tile/tshr.md
-              - pto.txor: docs/isa/tile/ops/elementwise-tile-tile/txor.md
-              - pto.tlog: docs/isa/tile/ops/elementwise-tile-tile/tlog.md
-              - pto.trecip: docs/isa/tile/ops/elementwise-tile-tile/trecip.md
-              - pto.tprelu: docs/isa/tile/ops/elementwise-tile-tile/tprelu.md
-              - pto.taddc: docs/isa/tile/ops/elementwise-tile-tile/taddc.md
-              - pto.tsubc: docs/isa/tile/ops/elementwise-tile-tile/tsubc.md
-              - pto.tcvt: docs/isa/tile/ops/elementwise-tile-tile/tcvt.md
-              - pto.tsel: docs/isa/tile/ops/elementwise-tile-tile/tsel.md
-              - pto.trsqrt: docs/isa/tile/ops/elementwise-tile-tile/trsqrt.md
-              - pto.tsqrt: docs/isa/tile/ops/elementwise-tile-tile/tsqrt.md
-              - pto.texp: docs/isa/tile/ops/elementwise-tile-tile/texp.md
-              - pto.tnot: docs/isa/tile/ops/elementwise-tile-tile/tnot.md
-              - pto.trelu: docs/isa/tile/ops/elementwise-tile-tile/trelu.md
-              - pto.tneg: docs/isa/tile/ops/elementwise-tile-tile/tneg.md
-              - pto.trem: docs/isa/tile/ops/elementwise-tile-tile/trem.md
-              - pto.tfmod: docs/isa/tile/ops/elementwise-tile-tile/tfmod.md
+              - Instruction Set Contract: isa/tile/elementwise-tile-tile.md
+              - pto.tadd: isa/tile/ops/elementwise-tile-tile/tadd.md
+              - pto.tabs: isa/tile/ops/elementwise-tile-tile/tabs.md
+              - pto.tand: isa/tile/ops/elementwise-tile-tile/tand.md
+              - pto.tor: isa/tile/ops/elementwise-tile-tile/tor.md
+              - pto.tsub: isa/tile/ops/elementwise-tile-tile/tsub.md
+              - pto.tmul: isa/tile/ops/elementwise-tile-tile/tmul.md
+              - pto.tmin: isa/tile/ops/elementwise-tile-tile/tmin.md
+              - pto.tmax: isa/tile/ops/elementwise-tile-tile/tmax.md
+              - pto.tcmp: isa/tile/ops/elementwise-tile-tile/tcmp.md
+              - pto.tdiv: isa/tile/ops/elementwise-tile-tile/tdiv.md
+              - pto.tshl: isa/tile/ops/elementwise-tile-tile/tshl.md
+              - pto.tshr: isa/tile/ops/elementwise-tile-tile/tshr.md
+              - pto.txor: isa/tile/ops/elementwise-tile-tile/txor.md
+              - pto.tlog: isa/tile/ops/elementwise-tile-tile/tlog.md
+              - pto.trecip: isa/tile/ops/elementwise-tile-tile/trecip.md
+              - pto.tprelu: isa/tile/ops/elementwise-tile-tile/tprelu.md
+              - pto.taddc: isa/tile/ops/elementwise-tile-tile/taddc.md
+              - pto.tsubc: isa/tile/ops/elementwise-tile-tile/tsubc.md
+              - pto.tcvt: isa/tile/ops/elementwise-tile-tile/tcvt.md
+              - pto.tsel: isa/tile/ops/elementwise-tile-tile/tsel.md
+              - pto.trsqrt: isa/tile/ops/elementwise-tile-tile/trsqrt.md
+              - pto.tsqrt: isa/tile/ops/elementwise-tile-tile/tsqrt.md
+              - pto.texp: isa/tile/ops/elementwise-tile-tile/texp.md
+              - pto.tnot: isa/tile/ops/elementwise-tile-tile/tnot.md
+              - pto.trelu: isa/tile/ops/elementwise-tile-tile/trelu.md
+              - pto.tneg: isa/tile/ops/elementwise-tile-tile/tneg.md
+              - pto.trem: isa/tile/ops/elementwise-tile-tile/trem.md
+              - pto.tfmod: isa/tile/ops/elementwise-tile-tile/tfmod.md
           - Tile Scalar And Immediate:
-              - Instruction Set Contract: docs/isa/tile/tile-scalar-and-immediate.md
-              - pto.texpands: docs/isa/tile/ops/tile-scalar-and-immediate/texpands.md
-              - pto.tcmps: docs/isa/tile/ops/tile-scalar-and-immediate/tcmps.md
-              - pto.tsels: docs/isa/tile/ops/tile-scalar-and-immediate/tsels.md
-              - pto.tmins: docs/isa/tile/ops/tile-scalar-and-immediate/tmins.md
-              - pto.tadds: docs/isa/tile/ops/tile-scalar-and-immediate/tadds.md
-              - pto.tsubs: docs/isa/tile/ops/tile-scalar-and-immediate/tsubs.md
-              - pto.tdivs: docs/isa/tile/ops/tile-scalar-and-immediate/tdivs.md
-              - pto.tmuls: docs/isa/tile/ops/tile-scalar-and-immediate/tmuls.md
-              - pto.tfmods: docs/isa/tile/ops/tile-scalar-and-immediate/tfmods.md
-              - pto.trems: docs/isa/tile/ops/tile-scalar-and-immediate/trems.md
-              - pto.tmaxs: docs/isa/tile/ops/tile-scalar-and-immediate/tmaxs.md
-              - pto.tands: docs/isa/tile/ops/tile-scalar-and-immediate/tands.md
-              - pto.tors: docs/isa/tile/ops/tile-scalar-and-immediate/tors.md
-              - pto.tshls: docs/isa/tile/ops/tile-scalar-and-immediate/tshls.md
-              - pto.tshrs: docs/isa/tile/ops/tile-scalar-and-immediate/tshrs.md
-              - pto.txors: docs/isa/tile/ops/tile-scalar-and-immediate/txors.md
-              - pto.tlrelu: docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu.md
-              - pto.taddsc: docs/isa/tile/ops/tile-scalar-and-immediate/taddsc.md
-              - pto.tsubsc: docs/isa/tile/ops/tile-scalar-and-immediate/tsubsc.md
+              - Instruction Set Contract: isa/tile/tile-scalar-and-immediate.md
+              - pto.texpands: isa/tile/ops/tile-scalar-and-immediate/texpands.md
+              - pto.tcmps: isa/tile/ops/tile-scalar-and-immediate/tcmps.md
+              - pto.tsels: isa/tile/ops/tile-scalar-and-immediate/tsels.md
+              - pto.tmins: isa/tile/ops/tile-scalar-and-immediate/tmins.md
+              - pto.tadds: isa/tile/ops/tile-scalar-and-immediate/tadds.md
+              - pto.tsubs: isa/tile/ops/tile-scalar-and-immediate/tsubs.md
+              - pto.tdivs: isa/tile/ops/tile-scalar-and-immediate/tdivs.md
+              - pto.tmuls: isa/tile/ops/tile-scalar-and-immediate/tmuls.md
+              - pto.tfmods: isa/tile/ops/tile-scalar-and-immediate/tfmods.md
+              - pto.trems: isa/tile/ops/tile-scalar-and-immediate/trems.md
+              - pto.tmaxs: isa/tile/ops/tile-scalar-and-immediate/tmaxs.md
+              - pto.tands: isa/tile/ops/tile-scalar-and-immediate/tands.md
+              - pto.tors: isa/tile/ops/tile-scalar-and-immediate/tors.md
+              - pto.tshls: isa/tile/ops/tile-scalar-and-immediate/tshls.md
+              - pto.tshrs: isa/tile/ops/tile-scalar-and-immediate/tshrs.md
+              - pto.txors: isa/tile/ops/tile-scalar-and-immediate/txors.md
+              - pto.tlrelu: isa/tile/ops/tile-scalar-and-immediate/tlrelu.md
+              - pto.taddsc: isa/tile/ops/tile-scalar-and-immediate/taddsc.md
+              - pto.tsubsc: isa/tile/ops/tile-scalar-and-immediate/tsubsc.md
           - Reduce And Expand:
-              - Instruction Set Contract: docs/isa/tile/reduce-and-expand.md
-              - pto.trowsum: docs/isa/tile/ops/reduce-and-expand/trowsum.md
-              - pto.tcolsum: docs/isa/tile/ops/reduce-and-expand/tcolsum.md
-              - pto.trowprod: docs/isa/tile/ops/reduce-and-expand/trowprod.md
-              - pto.tcolprod: docs/isa/tile/ops/reduce-and-expand/tcolprod.md
-              - pto.tcolmax: docs/isa/tile/ops/reduce-and-expand/tcolmax.md
-              - pto.tcolmin: docs/isa/tile/ops/reduce-and-expand/tcolmin.md
-              - pto.tcolargmax: docs/isa/tile/ops/reduce-and-expand/tcolargmax.md
-              - pto.tcolargmin: docs/isa/tile/ops/reduce-and-expand/tcolargmin.md
-              - pto.trowmax: docs/isa/tile/ops/reduce-and-expand/trowmax.md
-              - pto.trowmin: docs/isa/tile/ops/reduce-and-expand/trowmin.md
-              - pto.trowargmax: docs/isa/tile/ops/reduce-and-expand/trowargmax.md
-              - pto.trowargmin: docs/isa/tile/ops/reduce-and-expand/trowargmin.md
-              - pto.trowexpand: docs/isa/tile/ops/reduce-and-expand/trowexpand.md
-              - pto.trowexpanddiv: docs/isa/tile/ops/reduce-and-expand/trowexpanddiv.md
-              - pto.trowexpandmul: docs/isa/tile/ops/reduce-and-expand/trowexpandmul.md
-              - pto.trowexpandsub: docs/isa/tile/ops/reduce-and-expand/trowexpandsub.md
-              - pto.trowexpandadd: docs/isa/tile/ops/reduce-and-expand/trowexpandadd.md
-              - pto.trowexpandmax: docs/isa/tile/ops/reduce-and-expand/trowexpandmax.md
-              - pto.trowexpandmin: docs/isa/tile/ops/reduce-and-expand/trowexpandmin.md
-              - pto.trowexpandexpdif: docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif.md
-              - pto.tcolexpand: docs/isa/tile/ops/reduce-and-expand/tcolexpand.md
-              - pto.tcolexpanddiv: docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv.md
-              - pto.tcolexpandmul: docs/isa/tile/ops/reduce-and-expand/tcolexpandmul.md
-              - pto.tcolexpandadd: docs/isa/tile/ops/reduce-and-expand/tcolexpandadd.md
-              - pto.tcolexpandmax: docs/isa/tile/ops/reduce-and-expand/tcolexpandmax.md
-              - pto.tcolexpandmin: docs/isa/tile/ops/reduce-and-expand/tcolexpandmin.md
-              - pto.tcolexpandsub: docs/isa/tile/ops/reduce-and-expand/tcolexpandsub.md
-              - pto.tcolexpandexpdif: docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif.md
+              - Instruction Set Contract: isa/tile/reduce-and-expand.md
+              - pto.trowsum: isa/tile/ops/reduce-and-expand/trowsum.md
+              - pto.tcolsum: isa/tile/ops/reduce-and-expand/tcolsum.md
+              - pto.trowprod: isa/tile/ops/reduce-and-expand/trowprod.md
+              - pto.tcolprod: isa/tile/ops/reduce-and-expand/tcolprod.md
+              - pto.tcolmax: isa/tile/ops/reduce-and-expand/tcolmax.md
+              - pto.tcolmin: isa/tile/ops/reduce-and-expand/tcolmin.md
+              - pto.tcolargmax: isa/tile/ops/reduce-and-expand/tcolargmax.md
+              - pto.tcolargmin: isa/tile/ops/reduce-and-expand/tcolargmin.md
+              - pto.trowmax: isa/tile/ops/reduce-and-expand/trowmax.md
+              - pto.trowmin: isa/tile/ops/reduce-and-expand/trowmin.md
+              - pto.trowargmax: isa/tile/ops/reduce-and-expand/trowargmax.md
+              - pto.trowargmin: isa/tile/ops/reduce-and-expand/trowargmin.md
+              - pto.trowexpand: isa/tile/ops/reduce-and-expand/trowexpand.md
+              - pto.trowexpanddiv: isa/tile/ops/reduce-and-expand/trowexpanddiv.md
+              - pto.trowexpandmul: isa/tile/ops/reduce-and-expand/trowexpandmul.md
+              - pto.trowexpandsub: isa/tile/ops/reduce-and-expand/trowexpandsub.md
+              - pto.trowexpandadd: isa/tile/ops/reduce-and-expand/trowexpandadd.md
+              - pto.trowexpandmax: isa/tile/ops/reduce-and-expand/trowexpandmax.md
+              - pto.trowexpandmin: isa/tile/ops/reduce-and-expand/trowexpandmin.md
+              - pto.trowexpandexpdif: isa/tile/ops/reduce-and-expand/trowexpandexpdif.md
+              - pto.tcolexpand: isa/tile/ops/reduce-and-expand/tcolexpand.md
+              - pto.tcolexpanddiv: isa/tile/ops/reduce-and-expand/tcolexpanddiv.md
+              - pto.tcolexpandmul: isa/tile/ops/reduce-and-expand/tcolexpandmul.md
+              - pto.tcolexpandadd: isa/tile/ops/reduce-and-expand/tcolexpandadd.md
+              - pto.tcolexpandmax: isa/tile/ops/reduce-and-expand/tcolexpandmax.md
+              - pto.tcolexpandmin: isa/tile/ops/reduce-and-expand/tcolexpandmin.md
+              - pto.tcolexpandsub: isa/tile/ops/reduce-and-expand/tcolexpandsub.md
+              - pto.tcolexpandexpdif: isa/tile/ops/reduce-and-expand/tcolexpandexpdif.md
           - Memory And Data Movement:
-              - Instruction Set Contract: docs/isa/tile/memory-and-data-movement.md
-              - pto.tload: docs/isa/tile/ops/memory-and-data-movement/tload.md
-              - pto.tprefetch: docs/isa/tile/ops/memory-and-data-movement/tprefetch.md
-              - pto.tstore: docs/isa/tile/ops/memory-and-data-movement/tstore.md
-              - pto.tstore_fp: docs/isa/tile/ops/memory-and-data-movement/tstore-fp.md
-              - pto.mgather: docs/isa/tile/ops/memory-and-data-movement/mgather.md
-              - pto.mscatter: docs/isa/tile/ops/memory-and-data-movement/mscatter.md
+              - Instruction Set Contract: isa/tile/memory-and-data-movement.md
+              - pto.tload: isa/tile/ops/memory-and-data-movement/tload.md
+              - pto.tprefetch: isa/tile/ops/memory-and-data-movement/tprefetch.md
+              - pto.tstore: isa/tile/ops/memory-and-data-movement/tstore.md
+              - pto.tstore_fp: isa/tile/ops/memory-and-data-movement/tstore-fp.md
+              - pto.mgather: isa/tile/ops/memory-and-data-movement/mgather.md
+              - pto.mscatter: isa/tile/ops/memory-and-data-movement/mscatter.md
           - Matrix And Matrix Vector:
-              - Instruction Set Contract: docs/isa/tile/matrix-and-matrix-vector.md
-              - pto.tgemv_mx: docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx.md
-              - pto.tmatmul_mx: docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx.md
-              - pto.tmatmul: docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul.md
-              - pto.tmatmul_acc: docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc.md
-              - pto.tmatmul_bias: docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias.md
-              - pto.tgemv: docs/isa/tile/ops/matrix-and-matrix-vector/tgemv.md
-              - pto.tgemv_acc: docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc.md
-              - pto.tgemv_bias: docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias.md
+              - Instruction Set Contract: isa/tile/matrix-and-matrix-vector.md
+              - pto.tgemv_mx: isa/tile/ops/matrix-and-matrix-vector/tgemv-mx.md
+              - pto.tmatmul_mx: isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx.md
+              - pto.tmatmul: isa/tile/ops/matrix-and-matrix-vector/tmatmul.md
+              - pto.tmatmul_acc: isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc.md
+              - pto.tmatmul_bias: isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias.md
+              - pto.tgemv: isa/tile/ops/matrix-and-matrix-vector/tgemv.md
+              - pto.tgemv_acc: isa/tile/ops/matrix-and-matrix-vector/tgemv-acc.md
+              - pto.tgemv_bias: isa/tile/ops/matrix-and-matrix-vector/tgemv-bias.md
           - Layout And Rearrangement:
-              - Instruction Set Contract: docs/isa/tile/layout-and-rearrangement.md
-              - pto.textract: docs/isa/tile/ops/layout-and-rearrangement/textract.md
-              - pto.textract_fp: docs/isa/tile/ops/layout-and-rearrangement/textract-fp.md
-              - pto.timg2col: docs/isa/tile/ops/layout-and-rearrangement/timg2col.md
-              - pto.tinsert: docs/isa/tile/ops/layout-and-rearrangement/tinsert.md
-              - pto.tinsert_fp: docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp.md
-              - pto.tfillpad: docs/isa/tile/ops/layout-and-rearrangement/tfillpad.md
-              - pto.tfillpad_inplace: docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace.md
-              - pto.tfillpad_expand: docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand.md
-              - pto.tmov: docs/isa/tile/ops/layout-and-rearrangement/tmov.md
-              - pto.tmov_fp: docs/isa/tile/ops/layout-and-rearrangement/tmov-fp.md
-              - pto.treshape: docs/isa/tile/ops/layout-and-rearrangement/treshape.md
-              - pto.ttrans: docs/isa/tile/ops/layout-and-rearrangement/ttrans.md
+              - Instruction Set Contract: isa/tile/layout-and-rearrangement.md
+              - pto.textract: isa/tile/ops/layout-and-rearrangement/textract.md
+              - pto.textract_fp: isa/tile/ops/layout-and-rearrangement/textract-fp.md
+              - pto.timg2col: isa/tile/ops/layout-and-rearrangement/timg2col.md
+              - pto.tinsert: isa/tile/ops/layout-and-rearrangement/tinsert.md
+              - pto.tinsert_fp: isa/tile/ops/layout-and-rearrangement/tinsert-fp.md
+              - pto.tfillpad: isa/tile/ops/layout-and-rearrangement/tfillpad.md
+              - pto.tfillpad_inplace: isa/tile/ops/layout-and-rearrangement/tfillpad-inplace.md
+              - pto.tfillpad_expand: isa/tile/ops/layout-and-rearrangement/tfillpad-expand.md
+              - pto.tmov: isa/tile/ops/layout-and-rearrangement/tmov.md
+              - pto.tmov_fp: isa/tile/ops/layout-and-rearrangement/tmov-fp.md
+              - pto.treshape: isa/tile/ops/layout-and-rearrangement/treshape.md
+              - pto.ttrans: isa/tile/ops/layout-and-rearrangement/ttrans.md
           - Irregular And Complex:
-              - Instruction Set Contract: docs/isa/tile/irregular-and-complex.md
-              - pto.tprint: docs/isa/tile/ops/irregular-and-complex/tprint.md
-              - pto.tmrgsort: docs/isa/tile/ops/irregular-and-complex/tmrgsort.md
-              - pto.tsort32: docs/isa/tile/ops/irregular-and-complex/tsort32.md
-              - pto.tgather: docs/isa/tile/ops/irregular-and-complex/tgather.md
-              - pto.tgatherb: docs/isa/tile/ops/irregular-and-complex/tgatherb.md
-              - pto.tscatter: docs/isa/tile/ops/irregular-and-complex/tscatter.md
-              - pto.tci: docs/isa/tile/ops/irregular-and-complex/tci.md
-              - pto.ttri: docs/isa/tile/ops/irregular-and-complex/ttri.md
-              - pto.tpartadd: docs/isa/tile/ops/irregular-and-complex/tpartadd.md
-              - pto.tpartmul: docs/isa/tile/ops/irregular-and-complex/tpartmul.md
-              - pto.tpartmax: docs/isa/tile/ops/irregular-and-complex/tpartmax.md
-              - pto.tpartmin: docs/isa/tile/ops/irregular-and-complex/tpartmin.md
-              - pto.tquant: docs/isa/tile/ops/irregular-and-complex/tquant.md
+              - Instruction Set Contract: isa/tile/irregular-and-complex.md
+              - pto.tprint: isa/tile/ops/irregular-and-complex/tprint.md
+              - pto.tmrgsort: isa/tile/ops/irregular-and-complex/tmrgsort.md
+              - pto.tsort32: isa/tile/ops/irregular-and-complex/tsort32.md
+              - pto.tgather: isa/tile/ops/irregular-and-complex/tgather.md
+              - pto.tgatherb: isa/tile/ops/irregular-and-complex/tgatherb.md
+              - pto.tscatter: isa/tile/ops/irregular-and-complex/tscatter.md
+              - pto.tci: isa/tile/ops/irregular-and-complex/tci.md
+              - pto.ttri: isa/tile/ops/irregular-and-complex/ttri.md
+              - pto.tpartadd: isa/tile/ops/irregular-and-complex/tpartadd.md
+              - pto.tpartmul: isa/tile/ops/irregular-and-complex/tpartmul.md
+              - pto.tpartmax: isa/tile/ops/irregular-and-complex/tpartmax.md
+              - pto.tpartmin: isa/tile/ops/irregular-and-complex/tpartmin.md
+              - pto.tquant: isa/tile/ops/irregular-and-complex/tquant.md
   - 10. Vector ISA Reference:
-          - Overview: docs/isa/vector/README.md
+          - Overview: isa/vector/README.md
           - Vector Load Store:
-              - Instruction Set Overview: docs/isa/vector/vector-load-store.md
-              - pto.vlds: docs/isa/vector/ops/vector-load-store/vlds.md
-              - pto.vldas: docs/isa/vector/ops/vector-load-store/vldas.md
-              - pto.vldus: docs/isa/vector/ops/vector-load-store/vldus.md
-              - pto.vldx2: docs/isa/vector/ops/vector-load-store/vldx2.md
-              - pto.vsld: docs/isa/vector/ops/vector-load-store/vsld.md
-              - pto.vsldb: docs/isa/vector/ops/vector-load-store/vsldb.md
-              - pto.vgather2: docs/isa/vector/ops/vector-load-store/vgather2.md
-              - pto.vgatherb: docs/isa/vector/ops/vector-load-store/vgatherb.md
-              - pto.vgather2_bc: docs/isa/vector/ops/vector-load-store/vgather2-bc.md
-              - pto.vsts: docs/isa/vector/ops/vector-load-store/vsts.md
-              - pto.vstx2: docs/isa/vector/ops/vector-load-store/vstx2.md
-              - pto.vsst: docs/isa/vector/ops/vector-load-store/vsst.md
-              - pto.vsstb: docs/isa/vector/ops/vector-load-store/vsstb.md
-              - pto.vscatter: docs/isa/vector/ops/vector-load-store/vscatter.md
-              - pto.vsta: docs/isa/vector/ops/vector-load-store/vsta.md
-              - pto.vstas: docs/isa/vector/ops/vector-load-store/vstas.md
-              - pto.vstar: docs/isa/vector/ops/vector-load-store/vstar.md
-              - pto.vstu: docs/isa/vector/ops/vector-load-store/vstu.md
-              - pto.vstus: docs/isa/vector/ops/vector-load-store/vstus.md
-              - pto.vstur: docs/isa/vector/ops/vector-load-store/vstur.md
+              - Instruction Set Overview: isa/vector/vector-load-store.md
+              - pto.vlds: isa/vector/ops/vector-load-store/vlds.md
+              - pto.vldas: isa/vector/ops/vector-load-store/vldas.md
+              - pto.vldus: isa/vector/ops/vector-load-store/vldus.md
+              - pto.vldx2: isa/vector/ops/vector-load-store/vldx2.md
+              - pto.vsld: isa/vector/ops/vector-load-store/vsld.md
+              - pto.vsldb: isa/vector/ops/vector-load-store/vsldb.md
+              - pto.vgather2: isa/vector/ops/vector-load-store/vgather2.md
+              - pto.vgatherb: isa/vector/ops/vector-load-store/vgatherb.md
+              - pto.vgather2_bc: isa/vector/ops/vector-load-store/vgather2-bc.md
+              - pto.vsts: isa/vector/ops/vector-load-store/vsts.md
+              - pto.vstx2: isa/vector/ops/vector-load-store/vstx2.md
+              - pto.vsst: isa/vector/ops/vector-load-store/vsst.md
+              - pto.vsstb: isa/vector/ops/vector-load-store/vsstb.md
+              - pto.vscatter: isa/vector/ops/vector-load-store/vscatter.md
+              - pto.vsta: isa/vector/ops/vector-load-store/vsta.md
+              - pto.vstas: isa/vector/ops/vector-load-store/vstas.md
+              - pto.vstar: isa/vector/ops/vector-load-store/vstar.md
+              - pto.vstu: isa/vector/ops/vector-load-store/vstu.md
+              - pto.vstus: isa/vector/ops/vector-load-store/vstus.md
+              - pto.vstur: isa/vector/ops/vector-load-store/vstur.md
           - Predicate And Materialization:
-              - Instruction Set Overview: docs/isa/vector/predicate-and-materialization.md
-              - pto.vbr: docs/isa/vector/ops/predicate-and-materialization/vbr.md
-              - pto.vdup: docs/isa/vector/ops/predicate-and-materialization/vdup.md
+              - Instruction Set Overview: isa/vector/predicate-and-materialization.md
+              - pto.vbr: isa/vector/ops/predicate-and-materialization/vbr.md
+              - pto.vdup: isa/vector/ops/predicate-and-materialization/vdup.md
           - Unary Vector Instructions:
-              - Instruction Set Overview: docs/isa/vector/unary-vector-ops.md
-              - pto.vabs: docs/isa/vector/ops/unary-vector-ops/vabs.md
-              - pto.vneg: docs/isa/vector/ops/unary-vector-ops/vneg.md
-              - pto.vexp: docs/isa/vector/ops/unary-vector-ops/vexp.md
-              - pto.vln: docs/isa/vector/ops/unary-vector-ops/vln.md
-              - pto.vsqrt: docs/isa/vector/ops/unary-vector-ops/vsqrt.md
-              - pto.vrsqrt: docs/isa/vector/ops/unary-vector-ops/vrsqrt.md
-              - pto.vrec: docs/isa/vector/ops/unary-vector-ops/vrec.md
-              - pto.vrelu: docs/isa/vector/ops/unary-vector-ops/vrelu.md
-              - pto.vnot: docs/isa/vector/ops/unary-vector-ops/vnot.md
-              - pto.vbcnt: docs/isa/vector/ops/unary-vector-ops/vbcnt.md
-              - pto.vcls: docs/isa/vector/ops/unary-vector-ops/vcls.md
-              - pto.vmov: docs/isa/vector/ops/unary-vector-ops/vmov.md
+              - Instruction Set Overview: isa/vector/unary-vector-ops.md
+              - pto.vabs: isa/vector/ops/unary-vector-ops/vabs.md
+              - pto.vneg: isa/vector/ops/unary-vector-ops/vneg.md
+              - pto.vexp: isa/vector/ops/unary-vector-ops/vexp.md
+              - pto.vln: isa/vector/ops/unary-vector-ops/vln.md
+              - pto.vsqrt: isa/vector/ops/unary-vector-ops/vsqrt.md
+              - pto.vrsqrt: isa/vector/ops/unary-vector-ops/vrsqrt.md
+              - pto.vrec: isa/vector/ops/unary-vector-ops/vrec.md
+              - pto.vrelu: isa/vector/ops/unary-vector-ops/vrelu.md
+              - pto.vnot: isa/vector/ops/unary-vector-ops/vnot.md
+              - pto.vbcnt: isa/vector/ops/unary-vector-ops/vbcnt.md
+              - pto.vcls: isa/vector/ops/unary-vector-ops/vcls.md
+              - pto.vmov: isa/vector/ops/unary-vector-ops/vmov.md
           - Binary Vector Instructions:
-              - Instruction Set Overview: docs/isa/vector/binary-vector-ops.md
-              - pto.vadd: docs/isa/vector/ops/binary-vector-ops/vadd.md
-              - pto.vsub: docs/isa/vector/ops/binary-vector-ops/vsub.md
-              - pto.vmul: docs/isa/vector/ops/binary-vector-ops/vmul.md
-              - pto.vdiv: docs/isa/vector/ops/binary-vector-ops/vdiv.md
-              - pto.vmax: docs/isa/vector/ops/binary-vector-ops/vmax.md
-              - pto.vmin: docs/isa/vector/ops/binary-vector-ops/vmin.md
-              - pto.vand: docs/isa/vector/ops/binary-vector-ops/vand.md
-              - pto.vor: docs/isa/vector/ops/binary-vector-ops/vor.md
-              - pto.vxor: docs/isa/vector/ops/binary-vector-ops/vxor.md
-              - pto.vshl: docs/isa/vector/ops/binary-vector-ops/vshl.md
-              - pto.vshr: docs/isa/vector/ops/binary-vector-ops/vshr.md
-              - pto.vaddc: docs/isa/vector/ops/binary-vector-ops/vaddc.md
-              - pto.vsubc: docs/isa/vector/ops/binary-vector-ops/vsubc.md
+              - Instruction Set Overview: isa/vector/binary-vector-ops.md
+              - pto.vadd: isa/vector/ops/binary-vector-ops/vadd.md
+              - pto.vsub: isa/vector/ops/binary-vector-ops/vsub.md
+              - pto.vmul: isa/vector/ops/binary-vector-ops/vmul.md
+              - pto.vdiv: isa/vector/ops/binary-vector-ops/vdiv.md
+              - pto.vmax: isa/vector/ops/binary-vector-ops/vmax.md
+              - pto.vmin: isa/vector/ops/binary-vector-ops/vmin.md
+              - pto.vand: isa/vector/ops/binary-vector-ops/vand.md
+              - pto.vor: isa/vector/ops/binary-vector-ops/vor.md
+              - pto.vxor: isa/vector/ops/binary-vector-ops/vxor.md
+              - pto.vshl: isa/vector/ops/binary-vector-ops/vshl.md
+              - pto.vshr: isa/vector/ops/binary-vector-ops/vshr.md
+              - pto.vaddc: isa/vector/ops/binary-vector-ops/vaddc.md
+              - pto.vsubc: isa/vector/ops/binary-vector-ops/vsubc.md
           - Vector-Scalar Instructions:
-              - Instruction Set Overview: docs/isa/vector/vec-scalar-ops.md
-              - pto.vadds: docs/isa/vector/ops/vec-scalar-ops/vadds.md
-              - pto.vsubs: docs/isa/vector/ops/vec-scalar-ops/vsubs.md
-              - pto.vmuls: docs/isa/vector/ops/vec-scalar-ops/vmuls.md
-              - pto.vmaxs: docs/isa/vector/ops/vec-scalar-ops/vmaxs.md
-              - pto.vmins: docs/isa/vector/ops/vec-scalar-ops/vmins.md
-              - pto.vands: docs/isa/vector/ops/vec-scalar-ops/vands.md
-              - pto.vors: docs/isa/vector/ops/vec-scalar-ops/vors.md
-              - pto.vxors: docs/isa/vector/ops/vec-scalar-ops/vxors.md
-              - pto.vshls: docs/isa/vector/ops/vec-scalar-ops/vshls.md
-              - pto.vshrs: docs/isa/vector/ops/vec-scalar-ops/vshrs.md
-              - pto.vlrelu: docs/isa/vector/ops/vec-scalar-ops/vlrelu.md
-              - pto.vaddcs: docs/isa/vector/ops/vec-scalar-ops/vaddcs.md
-              - pto.vsubcs: docs/isa/vector/ops/vec-scalar-ops/vsubcs.md
+              - Instruction Set Overview: isa/vector/vec-scalar-ops.md
+              - pto.vadds: isa/vector/ops/vec-scalar-ops/vadds.md
+              - pto.vsubs: isa/vector/ops/vec-scalar-ops/vsubs.md
+              - pto.vmuls: isa/vector/ops/vec-scalar-ops/vmuls.md
+              - pto.vmaxs: isa/vector/ops/vec-scalar-ops/vmaxs.md
+              - pto.vmins: isa/vector/ops/vec-scalar-ops/vmins.md
+              - pto.vands: isa/vector/ops/vec-scalar-ops/vands.md
+              - pto.vors: isa/vector/ops/vec-scalar-ops/vors.md
+              - pto.vxors: isa/vector/ops/vec-scalar-ops/vxors.md
+              - pto.vshls: isa/vector/ops/vec-scalar-ops/vshls.md
+              - pto.vshrs: isa/vector/ops/vec-scalar-ops/vshrs.md
+              - pto.vlrelu: isa/vector/ops/vec-scalar-ops/vlrelu.md
+              - pto.vaddcs: isa/vector/ops/vec-scalar-ops/vaddcs.md
+              - pto.vsubcs: isa/vector/ops/vec-scalar-ops/vsubcs.md
           - Conversion Ops:
-              - Instruction Set Overview: docs/isa/vector/conversion-ops.md
-              - pto.vci: docs/isa/vector/ops/conversion-ops/vci.md
-              - pto.vcvt: docs/isa/vector/ops/conversion-ops/vcvt.md
-              - pto.vtrc: docs/isa/vector/ops/conversion-ops/vtrc.md
+              - Instruction Set Overview: isa/vector/conversion-ops.md
+              - pto.vci: isa/vector/ops/conversion-ops/vci.md
+              - pto.vcvt: isa/vector/ops/conversion-ops/vcvt.md
+              - pto.vtrc: isa/vector/ops/conversion-ops/vtrc.md
           - Reduction Instructions:
-              - Instruction Set Overview: docs/isa/vector/reduction-ops.md
-              - pto.vcadd: docs/isa/vector/ops/reduction-ops/vcadd.md
-              - pto.vcmax: docs/isa/vector/ops/reduction-ops/vcmax.md
-              - pto.vcmin: docs/isa/vector/ops/reduction-ops/vcmin.md
-              - pto.vcgadd: docs/isa/vector/ops/reduction-ops/vcgadd.md
-              - pto.vcgmax: docs/isa/vector/ops/reduction-ops/vcgmax.md
-              - pto.vcgmin: docs/isa/vector/ops/reduction-ops/vcgmin.md
-              - pto.vcpadd: docs/isa/vector/ops/reduction-ops/vcpadd.md
+              - Instruction Set Overview: isa/vector/reduction-ops.md
+              - pto.vcadd: isa/vector/ops/reduction-ops/vcadd.md
+              - pto.vcmax: isa/vector/ops/reduction-ops/vcmax.md
+              - pto.vcmin: isa/vector/ops/reduction-ops/vcmin.md
+              - pto.vcgadd: isa/vector/ops/reduction-ops/vcgadd.md
+              - pto.vcgmax: isa/vector/ops/reduction-ops/vcgmax.md
+              - pto.vcgmin: isa/vector/ops/reduction-ops/vcgmin.md
+              - pto.vcpadd: isa/vector/ops/reduction-ops/vcpadd.md
           - Compare And Select:
-              - Instruction Set Overview: docs/isa/vector/compare-select.md
-              - pto.vcmp: docs/isa/vector/ops/compare-select/vcmp.md
-              - pto.vcmps: docs/isa/vector/ops/compare-select/vcmps.md
-              - pto.vsel: docs/isa/vector/ops/compare-select/vsel.md
-              - pto.vselr: docs/isa/vector/ops/compare-select/vselr.md
-              - pto.vselrv2: docs/isa/vector/ops/compare-select/vselrv2.md
+              - Instruction Set Overview: isa/vector/compare-select.md
+              - pto.vcmp: isa/vector/ops/compare-select/vcmp.md
+              - pto.vcmps: isa/vector/ops/compare-select/vcmps.md
+              - pto.vsel: isa/vector/ops/compare-select/vsel.md
+              - pto.vselr: isa/vector/ops/compare-select/vselr.md
+              - pto.vselrv2: isa/vector/ops/compare-select/vselrv2.md
           - Data Rearrangement:
-              - Instruction Set Overview: docs/isa/vector/data-rearrangement.md
-              - pto.vintlv: docs/isa/vector/ops/data-rearrangement/vintlv.md
-              - pto.vdintlv: docs/isa/vector/ops/data-rearrangement/vdintlv.md
-              - pto.vslide: docs/isa/vector/ops/data-rearrangement/vslide.md
-              - pto.vshift: docs/isa/vector/ops/data-rearrangement/vshift.md
-              - pto.vsqz: docs/isa/vector/ops/data-rearrangement/vsqz.md
-              - pto.vusqz: docs/isa/vector/ops/data-rearrangement/vusqz.md
-              - pto.vperm: docs/isa/vector/ops/data-rearrangement/vperm.md
-              - pto.vpack: docs/isa/vector/ops/data-rearrangement/vpack.md
-              - pto.vsunpack: docs/isa/vector/ops/data-rearrangement/vsunpack.md
-              - pto.vzunpack: docs/isa/vector/ops/data-rearrangement/vzunpack.md
-              - pto.vintlvv2: docs/isa/vector/ops/data-rearrangement/vintlvv2.md
-              - pto.vdintlvv2: docs/isa/vector/ops/data-rearrangement/vdintlvv2.md
+              - Instruction Set Overview: isa/vector/data-rearrangement.md
+              - pto.vintlv: isa/vector/ops/data-rearrangement/vintlv.md
+              - pto.vdintlv: isa/vector/ops/data-rearrangement/vdintlv.md
+              - pto.vslide: isa/vector/ops/data-rearrangement/vslide.md
+              - pto.vshift: isa/vector/ops/data-rearrangement/vshift.md
+              - pto.vsqz: isa/vector/ops/data-rearrangement/vsqz.md
+              - pto.vusqz: isa/vector/ops/data-rearrangement/vusqz.md
+              - pto.vperm: isa/vector/ops/data-rearrangement/vperm.md
+              - pto.vpack: isa/vector/ops/data-rearrangement/vpack.md
+              - pto.vsunpack: isa/vector/ops/data-rearrangement/vsunpack.md
+              - pto.vzunpack: isa/vector/ops/data-rearrangement/vzunpack.md
+              - pto.vintlvv2: isa/vector/ops/data-rearrangement/vintlvv2.md
+              - pto.vdintlvv2: isa/vector/ops/data-rearrangement/vdintlvv2.md
           - SFU And DSA Instructions:
-              - Instruction Set Overview: docs/isa/vector/sfu-and-dsa-ops.md
-              - pto.vprelu: docs/isa/vector/ops/sfu-and-dsa-ops/vprelu.md
-              - pto.vexpdiff: docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff.md
-              - pto.vaddrelu: docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md
-              - pto.vsubrelu: docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md
-              - pto.vaxpy: docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy.md
-              - pto.vaddreluconv: docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv.md
-              - pto.vmulconv: docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv.md
-              - pto.vmull: docs/isa/vector/ops/sfu-and-dsa-ops/vmull.md
-              - pto.vmula: docs/isa/vector/ops/sfu-and-dsa-ops/vmula.md
-              - pto.vtranspose: docs/isa/vector/ops/sfu-and-dsa-ops/vtranspose.md
-              - pto.vsort32: docs/isa/vector/ops/sfu-and-dsa-ops/vsort32.md
-              - pto.vbitsort: docs/isa/vector/ops/sfu-and-dsa-ops/vbitsort.md
-              - pto.vmrgsort: docs/isa/vector/ops/sfu-and-dsa-ops/vmrgsort.md
+              - Instruction Set Overview: isa/vector/sfu-and-dsa-ops.md
+              - pto.vprelu: isa/vector/ops/sfu-and-dsa-ops/vprelu.md
+              - pto.vexpdiff: isa/vector/ops/sfu-and-dsa-ops/vexpdiff.md
+              - pto.vaddrelu: isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md
+              - pto.vsubrelu: isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md
+              - pto.vaxpy: isa/vector/ops/sfu-and-dsa-ops/vaxpy.md
+              - pto.vaddreluconv: isa/vector/ops/sfu-and-dsa-ops/vaddreluconv.md
+              - pto.vmulconv: isa/vector/ops/sfu-and-dsa-ops/vmulconv.md
+              - pto.vmull: isa/vector/ops/sfu-and-dsa-ops/vmull.md
+              - pto.vmula: isa/vector/ops/sfu-and-dsa-ops/vmula.md
+              - pto.vtranspose: isa/vector/ops/sfu-and-dsa-ops/vtranspose.md
+              - pto.vsort32: isa/vector/ops/sfu-and-dsa-ops/vsort32.md
+              - pto.vbitsort: isa/vector/ops/sfu-and-dsa-ops/vbitsort.md
+              - pto.vmrgsort: isa/vector/ops/sfu-and-dsa-ops/vmrgsort.md
   - 11. Scalar And Control Reference:
-          - Overview: docs/isa/scalar/README.md
-          - Control And Configuration: docs/isa/scalar/control-and-configuration.md
+          - Overview: isa/scalar/README.md
+          - Control And Configuration: isa/scalar/control-and-configuration.md
           - Pipeline Sync:
-              - Instruction Set Overview: docs/isa/scalar/pipeline-sync.md
-              - pto.set_flag: docs/isa/scalar/ops/pipeline-sync/set-flag.md
-              - pto.wait_flag: docs/isa/scalar/ops/pipeline-sync/wait-flag.md
-              - pto.pipe_barrier: docs/isa/scalar/ops/pipeline-sync/pipe-barrier.md
-              - pto.get_buf: docs/isa/scalar/ops/pipeline-sync/get-buf.md
-              - pto.rls_buf: docs/isa/scalar/ops/pipeline-sync/rls-buf.md
-              - pto.mem_bar: docs/isa/scalar/ops/pipeline-sync/mem-bar.md
-              - pto.set_cross_core: docs/isa/scalar/ops/pipeline-sync/set-cross-core.md
-              - pto.wait_flag_dev: docs/isa/scalar/ops/pipeline-sync/wait-flag-dev.md
-              - pto.set_intra_block: docs/isa/scalar/ops/pipeline-sync/set-intra-block.md
-              - pto.wait_intra_core: docs/isa/scalar/ops/pipeline-sync/wait-intra-core.md
+              - Instruction Set Overview: isa/scalar/pipeline-sync.md
+              - pto.set_flag: isa/scalar/ops/pipeline-sync/set-flag.md
+              - pto.wait_flag: isa/scalar/ops/pipeline-sync/wait-flag.md
+              - pto.pipe_barrier: isa/scalar/ops/pipeline-sync/pipe-barrier.md
+              - pto.get_buf: isa/scalar/ops/pipeline-sync/get-buf.md
+              - pto.rls_buf: isa/scalar/ops/pipeline-sync/rls-buf.md
+              - pto.mem_bar: isa/scalar/ops/pipeline-sync/mem-bar.md
+              - pto.set_cross_core: isa/scalar/ops/pipeline-sync/set-cross-core.md
+              - pto.wait_flag_dev: isa/scalar/ops/pipeline-sync/wait-flag-dev.md
+              - pto.set_intra_block: isa/scalar/ops/pipeline-sync/set-intra-block.md
+              - pto.wait_intra_core: isa/scalar/ops/pipeline-sync/wait-intra-core.md
           - DMA Copy:
-              - Instruction Set Overview: docs/isa/scalar/dma-copy.md
-              - pto.set_loop_size_outtoub: docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub.md
-              - pto.set_loop2_stride_outtoub: docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub.md
-              - pto.set_loop1_stride_outtoub: docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub.md
-              - pto.set_loop_size_ubtoout: docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout.md
-              - pto.set_loop2_stride_ubtoout: docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout.md
-              - pto.set_loop1_stride_ubtoout: docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout.md
-              - pto.copy_gm_to_ubuf: docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf.md
-              - pto.copy_ubuf_to_gm: docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm.md
-              - pto.copy_ubuf_to_ubuf: docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf.md
+              - Instruction Set Overview: isa/scalar/dma-copy.md
+              - pto.set_loop_size_outtoub: isa/scalar/ops/dma-copy/set-loop-size-outtoub.md
+              - pto.set_loop2_stride_outtoub: isa/scalar/ops/dma-copy/set-loop2-stride-outtoub.md
+              - pto.set_loop1_stride_outtoub: isa/scalar/ops/dma-copy/set-loop1-stride-outtoub.md
+              - pto.set_loop_size_ubtoout: isa/scalar/ops/dma-copy/set-loop-size-ubtoout.md
+              - pto.set_loop2_stride_ubtoout: isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout.md
+              - pto.set_loop1_stride_ubtoout: isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout.md
+              - pto.copy_gm_to_ubuf: isa/scalar/ops/dma-copy/copy-gm-to-ubuf.md
+              - pto.copy_ubuf_to_gm: isa/scalar/ops/dma-copy/copy-ubuf-to-gm.md
+              - pto.copy_ubuf_to_ubuf: isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf.md
           - Predicate Load Store:
-              - Instruction Set Overview: docs/isa/scalar/predicate-load-store.md
-              - pto.plds: docs/isa/scalar/ops/predicate-load-store/plds.md
-              - pto.pld: docs/isa/scalar/ops/predicate-load-store/pld.md
-              - pto.pldi: docs/isa/scalar/ops/predicate-load-store/pldi.md
-              - pto.psts: docs/isa/scalar/ops/predicate-load-store/psts.md
-              - pto.pst: docs/isa/scalar/ops/predicate-load-store/pst.md
-              - pto.psti: docs/isa/scalar/ops/predicate-load-store/psti.md
-              - pto.pstu: docs/isa/scalar/ops/predicate-load-store/pstu.md
+              - Instruction Set Overview: isa/scalar/predicate-load-store.md
+              - pto.plds: isa/scalar/ops/predicate-load-store/plds.md
+              - pto.pld: isa/scalar/ops/predicate-load-store/pld.md
+              - pto.pldi: isa/scalar/ops/predicate-load-store/pldi.md
+              - pto.psts: isa/scalar/ops/predicate-load-store/psts.md
+              - pto.pst: isa/scalar/ops/predicate-load-store/pst.md
+              - pto.psti: isa/scalar/ops/predicate-load-store/psti.md
+              - pto.pstu: isa/scalar/ops/predicate-load-store/pstu.md
           - Predicate Generation And Algebra:
-              - Instruction Set Overview: docs/isa/scalar/predicate-generation-and-algebra.md
-              - pto.pset_b8: docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8.md
-              - pto.pset_b16: docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16.md
-              - pto.pset_b32: docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32.md
-              - pto.pge_b8: docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8.md
-              - pto.pge_b16: docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16.md
-              - pto.pge_b32: docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32.md
-              - pto.plt_b8: docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8.md
-              - pto.plt_b16: docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16.md
-              - pto.plt_b32: docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32.md
-              - pto.ppack: docs/isa/scalar/ops/predicate-generation-and-algebra/ppack.md
-              - pto.punpack: docs/isa/scalar/ops/predicate-generation-and-algebra/punpack.md
-              - pto.pand: docs/isa/scalar/ops/predicate-generation-and-algebra/pand.md
-              - pto.por: docs/isa/scalar/ops/predicate-generation-and-algebra/por.md
-              - pto.pxor: docs/isa/scalar/ops/predicate-generation-and-algebra/pxor.md
-              - pto.pnot: docs/isa/scalar/ops/predicate-generation-and-algebra/pnot.md
-              - pto.psel: docs/isa/scalar/ops/predicate-generation-and-algebra/psel.md
-              - pto.pdintlv_b8: docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8.md
-              - pto.pintlv_b16: docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16.md
-          - Shared Arithmetic: docs/isa/scalar/shared-arith.md
-          - Shared SCF: docs/isa/scalar/shared-scf.md
+              - Instruction Set Overview: isa/scalar/predicate-generation-and-algebra.md
+              - pto.pset_b8: isa/scalar/ops/predicate-generation-and-algebra/pset-b8.md
+              - pto.pset_b16: isa/scalar/ops/predicate-generation-and-algebra/pset-b16.md
+              - pto.pset_b32: isa/scalar/ops/predicate-generation-and-algebra/pset-b32.md
+              - pto.pge_b8: isa/scalar/ops/predicate-generation-and-algebra/pge-b8.md
+              - pto.pge_b16: isa/scalar/ops/predicate-generation-and-algebra/pge-b16.md
+              - pto.pge_b32: isa/scalar/ops/predicate-generation-and-algebra/pge-b32.md
+              - pto.plt_b8: isa/scalar/ops/predicate-generation-and-algebra/plt-b8.md
+              - pto.plt_b16: isa/scalar/ops/predicate-generation-and-algebra/plt-b16.md
+              - pto.plt_b32: isa/scalar/ops/predicate-generation-and-algebra/plt-b32.md
+              - pto.ppack: isa/scalar/ops/predicate-generation-and-algebra/ppack.md
+              - pto.punpack: isa/scalar/ops/predicate-generation-and-algebra/punpack.md
+              - pto.pand: isa/scalar/ops/predicate-generation-and-algebra/pand.md
+              - pto.por: isa/scalar/ops/predicate-generation-and-algebra/por.md
+              - pto.pxor: isa/scalar/ops/predicate-generation-and-algebra/pxor.md
+              - pto.pnot: isa/scalar/ops/predicate-generation-and-algebra/pnot.md
+              - pto.psel: isa/scalar/ops/predicate-generation-and-algebra/psel.md
+              - pto.pdintlv_b8: isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8.md
+              - pto.pintlv_b16: isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16.md
+          - Shared Arithmetic: isa/scalar/shared-arith.md
+          - Shared SCF: isa/scalar/shared-scf.md
           - PTO Micro-Instruction Reference:
-              - Overview: docs/isa/scalar/ops/micro-instruction/README.md
-              - BlockDim Query: docs/isa/scalar/ops/micro-instruction/block-dim-query.md
-              - Pointer Operations: docs/isa/scalar/ops/micro-instruction/pointer-operations.md
-              - Vector Execution Scope: docs/isa/scalar/ops/micro-instruction/vecscope.md
-              - Alignment State Type: docs/isa/scalar/ops/micro-instruction/align-type.md
+              - Overview: isa/scalar/ops/micro-instruction/README.md
+              - Micro-Instruction Summary: isa/vector/micro-instruction-summary.md
+              - BlockDim Query: isa/scalar/ops/micro-instruction/block-dim-query.md
+              - Pointer Operations: isa/scalar/ops/micro-instruction/pointer-operations.md
+              - Vector Execution Scope: isa/scalar/ops/micro-instruction/vecscope.md
+              - Alignment State Type: isa/scalar/ops/micro-instruction/align-type.md
   - 11A. PTO-AS Reference:
-          - PTO-AS Specification: docs/assembly/PTO-AS.md
-          - PTO-AS 规范: docs/assembly/PTO-AS_zh.md
+          - PTO-AS Specification: assembly/PTO-AS.md
+          - PTO-AS 规范: assembly/PTO-AS_zh.md
   - 12. Other And Communication Reference:
-          - Overview: docs/isa/other/README.md
+          - Overview: isa/other/README.md
           - Communication And Runtime:
-              - Instruction Set Contract: docs/isa/other/communication-and-runtime.md
-              - Communication Overview: docs/isa/comm/README.md
-              - TBROADCAST: docs/isa/comm/TBROADCAST.md
-              - TGET: docs/isa/comm/TGET.md
-              - TGET_ASYNC: docs/isa/comm/TGET_ASYNC.md
-              - TGATHER: docs/isa/comm/TGATHER.md
-              - TNOTIFY: docs/isa/comm/TNOTIFY.md
-              - TPUT: docs/isa/comm/TPUT.md
-              - TPUT_ASYNC: docs/isa/comm/TPUT_ASYNC.md
-              - TREDUCE: docs/isa/comm/TREDUCE.md
-              - TSCATTER: docs/isa/comm/TSCATTER.md
-              - TTEST: docs/isa/comm/TTEST.md
-              - TWAIT: docs/isa/comm/TWAIT.md
+              - Instruction Set Contract: isa/other/communication-and-runtime.md
+              - Communication Overview: isa/comm/README.md
+              - TBROADCAST: isa/comm/TBROADCAST.md
+              - TGET: isa/comm/TGET.md
+              - TGET_ASYNC: isa/comm/TGET_ASYNC.md
+              - TGATHER: isa/comm/TGATHER.md
+              - TNOTIFY: isa/comm/TNOTIFY.md
+              - TPUT: isa/comm/TPUT.md
+              - TPUT_ASYNC: isa/comm/TPUT_ASYNC.md
+              - TREDUCE: isa/comm/TREDUCE.md
+              - TSCATTER: isa/comm/TSCATTER.md
+              - TTEST: isa/comm/TTEST.md
+              - TWAIT: isa/comm/TWAIT.md
           - Non ISA And Supporting Ops:
-              - Instruction Set Contract: docs/isa/other/non-isa-and-supporting-ops.md
-              - TALIAS: docs/isa/TALIAS.md
-              - TAXPY: docs/isa/TAXPY.md
-              - TCONCAT: docs/isa/TCONCAT.md
-              - TDEQUANT: docs/isa/TDEQUANT.md
-              - TFREE: docs/isa/TFREE.md
-              - THISTOGRAM: docs/isa/THISTOGRAM.md
-              - TPACK: docs/isa/TPACK.md
-              - TPOP: docs/isa/TPOP.md
-              - TPUSH: docs/isa/TPUSH.md
-              - TRANDOM: docs/isa/TRANDOM.md
+              - Instruction Set Contract: isa/other/non-isa-and-supporting-ops.md
+              - TALIAS: isa/TALIAS.md
+              - TAXPY: isa/TAXPY.md
+              - TCONCAT: isa/TCONCAT.md
+              - TDEQUANT: isa/TDEQUANT.md
+              - TFREE: isa/TFREE.md
+              - THISTOGRAM: isa/THISTOGRAM.md
+              - TPACK: isa/TPACK.md
+              - TPOP: isa/TPOP.md
+              - TPUSH: isa/TPUSH.md
+              - TRANDOM: isa/TRANDOM.md
   - 13. Reference Notes:
-      - Overview: docs/isa/reference/README.md
-      - Format Of Instruction Descriptions: docs/isa/reference/format-of-instruction-descriptions.md
-      - Diagnostics And Illegal Cases: docs/isa/reference/diagnostics-and-illegal-cases.md
-      - Glossary: docs/isa/reference/glossary.md
-      - Portability And Target Profiles: docs/isa/reference/portability-and-target-profiles.md
-      - Source Of Truth: docs/isa/reference/source-of-truth.md
+      - Overview: isa/reference/README.md
+      - Format Of Instruction Descriptions: isa/reference/format-of-instruction-descriptions.md
+      - Diagnostics And Illegal Cases: isa/reference/diagnostics-and-illegal-cases.md
+      - Glossary: isa/reference/glossary.md
+      - Portability And Target Profiles: isa/reference/portability-and-target-profiles.md
+      - Source Of Truth: isa/reference/source-of-truth.md
diff --git a/docs/mkdocs/src/docs/isa/MGATHER.md b/docs/mkdocs/src/docs/isa/MGATHER.md
deleted file mode 100644
index b2ea79ca..00000000
--- a/docs/mkdocs/src/docs/isa/MGATHER.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/MGATHER.md` -->
-
-# pto.mgather
-
-This compatibility page points to the canonical tile-surface reference page for [pto.mgather](./tile/ops/memory-and-data-movement/mgather.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Memory And Data Movement](./tile/memory-and-data-movement.md)
-- Canonical per-op page: [pto.mgather](./tile/ops/memory-and-data-movement/mgather.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/MSCATTER.md b/docs/mkdocs/src/docs/isa/MSCATTER.md
deleted file mode 100644
index 4ea57cbf..00000000
--- a/docs/mkdocs/src/docs/isa/MSCATTER.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/MSCATTER.md` -->
-
-# pto.mscatter
-
-This compatibility page points to the canonical tile-surface reference page for [pto.mscatter](./tile/ops/memory-and-data-movement/mscatter.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Memory And Data Movement](./tile/memory-and-data-movement.md)
-- Canonical per-op page: [pto.mscatter](./tile/ops/memory-and-data-movement/mscatter.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/README.md b/docs/mkdocs/src/docs/isa/README.md
deleted file mode 100644
index ead4a989..00000000
--- a/docs/mkdocs/src/docs/isa/README.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!-- Generated from `docs/isa/README.md` -->
-
-<p align="center">
-  <img src="../../figures/pto_logo.svg" alt="PTO Tile Lib" width="180" />
-</p>
-
-# PTO ISA Manual And Reference
-
-This directory is the canonical PTO ISA tree. It combines the architecture manual, the surface guides, the family contracts, and the exact instruction-reference groupings in one place.
-
-## Textual Assembly Inside PTO ISA
-
-This tree is the canonical PTO ISA manual. Textual assembly spelling belongs to the PTO ISA syntax surface, not to a second parallel architecture manual.
-
-- PTO ISA defines architecture-visible semantics, legality, state, ordering, target-profile boundaries, and the visible behavior of `pto.t*`, `pto.v*`, `pto.*`, and other operations.
-- PTO-AS is the assembler-facing spelling used to write those operations and operands. It is part of how PTO ISA is expressed, not a separate ISA with different semantics.
-
-If the question is "what does this legal PTO program mean across CPU, A2/A3, and A5?", stay in this tree. If the question is "what is the operand shape or textual spelling of this operation?", use the syntax-and-operands pages in this same tree.
-
-## Start Here
-
-- [PTO ISA landing page](../PTO-Virtual-ISA-Manual.md)
-- [Introduction](introduction/what-is-pto-visa.md)
-- [Document structure](introduction/document-structure.md) (chapter map and reading order)
-- [Goals Of PTO](introduction/goals-of-pto.md)
-- [PTO ISA Version 1.0](introduction/pto-isa-version-1-0.md)
-- [Scope And Boundaries](introduction/design-goals-and-boundaries.md)
-- [Tiles And Valid Regions](programming-model/tiles-and-valid-regions.md)
-- [Execution Agents And Target Profiles](machine-model/execution-agents.md)
-- [Assembly spelling and operands](syntax-and-operands/assembly-model.md)
-- [Operands and attributes](syntax-and-operands/operands-and-attributes.md)
-- [Common conventions](conventions.md)
-- [Type system](state-and-types/type-system.md)
-- [Location intent and legality](state-and-types/location-intent-and-legality.md)
-- [Consistency baseline](memory-model/consistency-baseline.md)
-
-## Model Layers
-
-Reading order matches the manual chapter map: programming and machine models, then syntax and state, then memory, then opcode reference.
-
-- [Programming model](programming-model/tiles-and-valid-regions.md)
-- [Machine model](machine-model/execution-agents.md)
-- [Syntax and operands](syntax-and-operands/assembly-model.md)
-- [Type system](state-and-types/type-system.md)
-- [Location intent and legality](state-and-types/location-intent-and-legality.md)
-- [Memory model](memory-model/consistency-baseline.md)
-
-## Instruction Structure
-
-- [Instruction surfaces](instruction-surfaces/README.md)
-- [Instruction families](instruction-families/README.md)
-- [Format of instruction descriptions](reference/format-of-instruction-descriptions.md)
-- [Tile surface reference](tile/README.md)
-- [Vector surface reference](vector/README.md)
-- [Scalar and control reference](scalar/README.md)
-- [Other and communication reference](other/README.md)
-- [Common conventions](conventions.md)
-
-## Supporting Reference
-
-- [Reference notes](reference/README.md) (glossary, diagnostics, portability, source of truth)
-
-## Compatibility Wrappers
-
-The grouped surface trees under `tile/`, `vector/`, `scalar/`, and `other/` are the canonical PTO ISA paths.
-
-Some older root-level tile pages such as `TADD.md`, `TLOAD.md`, and `TMATMUL.md` now remain only as compatibility wrappers so existing links do not break immediately. New PTO ISA documentation should link to the grouped surface paths, especially the standalone per-op pages under:
-
-- `docs/isa/tile/ops/`
-- `docs/isa/vector/ops/`
-- `docs/isa/scalar/ops/`
diff --git a/docs/mkdocs/src/docs/isa/README_zh.md b/docs/mkdocs/src/docs/isa/README_zh.md
deleted file mode 100644
index 4f84f677..00000000
--- a/docs/mkdocs/src/docs/isa/README_zh.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!-- Generated from `docs/isa/README_zh.md` -->
-
-<p align="center">
-  <img src="../../figures/pto_logo.svg" alt="PTO Tile Lib" width="180" />
-</p>
-
-# PTO ISA 手册与参考
-
-本文档目录是 PTO ISA 的权威文档树。它将架构手册、表面指南、家族契约和精确的指令参考分组整合在同一个位置。
-
-## PTO ISA 中的文本汇编
-
-本树是权威的 PTO ISA 手册。文本汇编拼写属于 PTO ISA 语法表面，而非第二份并行的架构手册。
-
-- PTO ISA 定义了架构可见的语义、合法性、状态、排序、目标 profile 边界，以及 `pto.t*`、`pto.v*`、`pto.*` 及其他操作的可见行为
-- PTO-AS 是用于编写这些操作和操作数的汇编拼写。它是 PTO ISA 的表达方式的一部分，而非具有不同语义的分立 ISA
-
-如果问题是"PTO 程序在 CPU、A2/A3 和 A5 上的含义是什么？"，请留在本树中。如果问题是"这个操作的操作数形状或文本拼写是什么？"，请使用本树中语法与操作数相关的页面。
-
-## 从这里开始
-
-- [PTO ISA 入口页](../PTO-Virtual-ISA-Manual_zh.md)
-- [引言](introduction/what-is-pto-visa.md)
-- [文档结构](introduction/document-structure.md)（章节地图与阅读顺序）
-- [PTO 的设计目标](introduction/goals-of-pto.md)
-- [PTO ISA 版本 1.0](introduction/pto-isa-version-1-0.md)
-- [范围与边界](introduction/design-goals-and-boundaries.md)
-- [Tile 与有效区域](programming-model/tiles-and-valid-regions.md)
-- [执行代理与目标 Profile](machine-model/execution-agents.md)
-- [汇编拼写与操作数](syntax-and-operands/assembly-model.md)
-- [操作数与属性](syntax-and-operands/operands-and-attributes.md)
-- [通用约定](conventions.md)
-- [类型系统](state-and-types/type-system.md)
-- [位置意图与合法性](state-and-types/location-intent-and-legality.md)
-- [一致性基线](memory-model/consistency-baseline.md)
-
-## 模型层次
-
-阅读顺序与手册章节地图一致：先编程模型与机器模型，再语法与状态，再内存，最后是操作码参考。
-
-- [编程模型](programming-model/tiles-and-valid-regions.md)
-- [机器模型](machine-model/execution-agents.md)
-- [语法与操作数](syntax-and-operands/assembly-model.md)
-- [类型系统](state-and-types/type-system.md)
-- [位置意图与合法性](state-and-types/location-intent-and-legality.md)
-- [内存模型](memory-model/consistency-baseline.md)
-
-## 指令结构
-
-- [指令表面](instruction-surfaces/README.md)
-- [指令族](instruction-families/README.md)
-- [指令描述格式](reference/format-of-instruction-descriptions.md)
-- [Tile 表面参考](tile/README.md)
-- [Vector 表面参考](vector/README.md)
-- [标量与控制参考](scalar/README.md)
-- [其他与通信参考](other/README.md)
-- [通用约定](conventions.md)
-
-## 支持性参考
-
-- [参考注释](reference/README.md)（术语表、诊断、可移植性、规范来源）
-
-## 兼容性重定向
-
-`tile/`、`vector/`、`scalar/` 和 `other/` 下的分组表面树是权威的 PTO ISA 路径。
-
-部分旧的根级 tile 页面（如 `TADD_zh.md`、`TLOAD_zh.md`、`TMATMUL_zh.md` 等）现仅作为兼容性重定向保留，以避免现有链接立即失效。新 PTO ISA 文档应链接到分组表面路径，尤其是以下位置的独立 per-op 页面：
-
-- `docs/isa/tile/ops/`
-- `docs/isa/vector/ops/`
-- `docs/isa/scalar/ops/`
diff --git a/docs/mkdocs/src/docs/isa/TABS.md b/docs/mkdocs/src/docs/isa/TABS.md
deleted file mode 100644
index 1ee0daa4..00000000
--- a/docs/mkdocs/src/docs/isa/TABS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TABS.md` -->
-
-# pto.tabs
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tabs](./tile/ops/elementwise-tile-tile/tabs.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tabs](./tile/ops/elementwise-tile-tile/tabs.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TADD.md b/docs/mkdocs/src/docs/isa/TADD.md
deleted file mode 100644
index fa222d24..00000000
--- a/docs/mkdocs/src/docs/isa/TADD.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TADD.md` -->
-
-# pto.tadd
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tadd](./tile/ops/elementwise-tile-tile/tadd.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tadd](./tile/ops/elementwise-tile-tile/tadd.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TADDC.md b/docs/mkdocs/src/docs/isa/TADDC.md
deleted file mode 100644
index 2e8650f7..00000000
--- a/docs/mkdocs/src/docs/isa/TADDC.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TADDC.md` -->
-
-# pto.taddc
-
-This compatibility page points to the canonical tile-surface reference page for [pto.taddc](./tile/ops/elementwise-tile-tile/taddc.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.taddc](./tile/ops/elementwise-tile-tile/taddc.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TADDS.md b/docs/mkdocs/src/docs/isa/TADDS.md
deleted file mode 100644
index 1c99ff93..00000000
--- a/docs/mkdocs/src/docs/isa/TADDS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TADDS.md` -->
-
-# pto.tadds
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tadds](./tile/ops/tile-scalar-and-immediate/tadds.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tadds](./tile/ops/tile-scalar-and-immediate/tadds.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TADDSC.md b/docs/mkdocs/src/docs/isa/TADDSC.md
deleted file mode 100644
index 8578f178..00000000
--- a/docs/mkdocs/src/docs/isa/TADDSC.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TADDSC.md` -->
-
-# pto.taddsc
-
-This compatibility page points to the canonical tile-surface reference page for [pto.taddsc](./tile/ops/tile-scalar-and-immediate/taddsc.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.taddsc](./tile/ops/tile-scalar-and-immediate/taddsc.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TALIAS.md b/docs/mkdocs/src/docs/isa/TALIAS.md
deleted file mode 100644
index 0d7289df..00000000
--- a/docs/mkdocs/src/docs/isa/TALIAS.md
+++ /dev/null
@@ -1,42 +0,0 @@
-<!-- Generated from `docs/isa/TALIAS.md` -->
-
-# TALIAS
-
-## Tile Operation Diagram
-
-![TALIAS tile operation](../figures/isa/TALIAS.svg)
-
-## Introduction
-
-Create an alias tile view that shares the original tile storage.
-
-## Math Interpretation
-
-Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.talias ...
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.talias ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`.
-
-## Constraints
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## Examples
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TALIAS_zh.md b/docs/mkdocs/src/docs/isa/TALIAS_zh.md
deleted file mode 100644
index af90a9e6..00000000
--- a/docs/mkdocs/src/docs/isa/TALIAS_zh.md
+++ /dev/null
@@ -1,43 +0,0 @@
-<!-- Generated from `docs/isa/TALIAS_zh.md` -->
-
-# TALIAS
-
-## 指令示意图
-
-![TALIAS tile operation](../figures/isa/TALIAS.svg)
-
-## 简介
-
-创建一个与原始 Tile 共享底层存储的别名视图。
-
-## 数学语义
-
-语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
-
-## 汇编语法
-
-PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.talias ...
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.talias ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`.
-
-## 约束
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## 示例
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TAND.md b/docs/mkdocs/src/docs/isa/TAND.md
deleted file mode 100644
index b60b4a74..00000000
--- a/docs/mkdocs/src/docs/isa/TAND.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TAND.md` -->
-
-# pto.tand
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tand](./tile/ops/elementwise-tile-tile/tand.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tand](./tile/ops/elementwise-tile-tile/tand.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TANDS.md b/docs/mkdocs/src/docs/isa/TANDS.md
deleted file mode 100644
index ddf950e0..00000000
--- a/docs/mkdocs/src/docs/isa/TANDS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TANDS.md` -->
-
-# pto.tands
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tands](./tile/ops/tile-scalar-and-immediate/tands.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tands](./tile/ops/tile-scalar-and-immediate/tands.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TASSIGN.md b/docs/mkdocs/src/docs/isa/TASSIGN.md
deleted file mode 100644
index eed8ace0..00000000
--- a/docs/mkdocs/src/docs/isa/TASSIGN.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TASSIGN.md` -->
-
-# pto.tassign
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tassign](./tile/ops/sync-and-config/tassign.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Sync And Config](./tile/sync-and-config.md)
-- Canonical per-op page: [pto.tassign](./tile/ops/sync-and-config/tassign.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TAXPY.md b/docs/mkdocs/src/docs/isa/TAXPY.md
deleted file mode 100644
index 06de6e48..00000000
--- a/docs/mkdocs/src/docs/isa/TAXPY.md
+++ /dev/null
@@ -1,42 +0,0 @@
-<!-- Generated from `docs/isa/TAXPY.md` -->
-
-# TAXPY
-
-## Tile Operation Diagram
-
-![TAXPY tile operation](../figures/isa/TAXPY.svg)
-
-## Introduction
-
-AXPY-style fused update: multiply a tile by a scalar and accumulate into the destination tile.
-
-## Math Interpretation
-
-Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.taxpy ...
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.taxpy ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`.
-
-## Constraints
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## Examples
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TAXPY_zh.md b/docs/mkdocs/src/docs/isa/TAXPY_zh.md
deleted file mode 100644
index 904ea7da..00000000
--- a/docs/mkdocs/src/docs/isa/TAXPY_zh.md
+++ /dev/null
@@ -1,43 +0,0 @@
-<!-- Generated from `docs/isa/TAXPY_zh.md` -->
-
-# TAXPY
-
-## 指令示意图
-
-![TAXPY tile operation](../figures/isa/TAXPY.svg)
-
-## 简介
-
-AXPY 风格融合更新：将 Tile 乘以标量并累加到目标 Tile。
-
-## 数学语义
-
-语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
-
-## 汇编语法
-
-PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.taxpy ...
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.taxpy ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`.
-
-## 约束
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## 示例
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TCI.md b/docs/mkdocs/src/docs/isa/TCI.md
deleted file mode 100644
index 4055c7cf..00000000
--- a/docs/mkdocs/src/docs/isa/TCI.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCI.md` -->
-
-# pto.tci
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tci](./tile/ops/irregular-and-complex/tci.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tci](./tile/ops/irregular-and-complex/tci.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCMP.md b/docs/mkdocs/src/docs/isa/TCMP.md
deleted file mode 100644
index 06815b4a..00000000
--- a/docs/mkdocs/src/docs/isa/TCMP.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCMP.md` -->
-
-# pto.tcmp
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcmp](./tile/ops/elementwise-tile-tile/tcmp.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tcmp](./tile/ops/elementwise-tile-tile/tcmp.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCMPS.md b/docs/mkdocs/src/docs/isa/TCMPS.md
deleted file mode 100644
index cdd0b4b0..00000000
--- a/docs/mkdocs/src/docs/isa/TCMPS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCMPS.md` -->
-
-# pto.tcmps
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcmps](./tile/ops/tile-scalar-and-immediate/tcmps.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tcmps](./tile/ops/tile-scalar-and-immediate/tcmps.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLARGMAX.md b/docs/mkdocs/src/docs/isa/TCOLARGMAX.md
deleted file mode 100644
index 9f5ef2ed..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLARGMAX.md
+++ /dev/null
@@ -1,167 +0,0 @@
-<!-- Generated from `docs/isa/TCOLARGMAX.md` -->
-
-# pto.tcolargmax
-
-## Tile Operation Diagram
-
-![TCOLARGMAX tile operation](../figures/isa/TCOLARGMAX.svg)
-
-## Introduction
-
-Get the row index of the maximum element for each column.
-
-## Math Interpretation
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
-
-$$ \mathrm{dst}_{0,j} = \underset{0 \le i < R}{\operatorname{argmax}} \; \mathrm{src}_{i,j} $$
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TCOLARGMAX(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
-```
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must be `TileType::Vec`.
-- `src` may use ND or DN non-fractal layout because the checked helper only requires `SLayout::NoneBox`.
-- `dst` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-- Supported destination element types: `uint32_t`, `int32_t`.
-- Compile-time check: `TileDataIn::ValidCol == 1 || TileDataIn::ValidCol == -1`.
-- Runtime checks:
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `dst.GetValidRow() == 1`
-  - `src.GetValidCol() == dst.GetValidCol()`
-
-### A2A3 implementation checks
-
-- Supported source element types: `half`, `float`, `uint16_t`, `uint32_t`.
-- `tmp` must use the same element type as `src`.
-- In the checked A2A3 implementation path, `tmp` is used as scratch storage for index tracking and current comparison values.
-
-### A5 implementation checks
-
-- Supported source element sizes are 8-bit, 16-bit, or 32-bit; the checked implementation therefore covers `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `float`.
-- In the checked A5 implementation path, `tmp` is accepted by the interface but not used by `TCOLARGMAX_IMPL`.
-
-### About temporary tile `tmp` for A2A3
-
-- `tmp` is always used in the A2A3 implementation as scratch space for intermediate results (current index, argmax index, and current max elements).
-- `tmp` tile's data type must be the same as `src`'s data type.
-- `tmp` tile is organized into three regions within a single row:
-  - Region 0 (`[0, tmpGapEles)`): current row index counter (incremented per row).
-  - Region 1 (`[tmpGapEles, 2 * tmpGapEles)`): current maximum elements for comparison.
-  - Region 2 (`[2 * tmpGapEles, 3 * tmpGapEles)`): argmax index result (before final conversion to `dst`).
-- `tmpGapEles` is determined as follows:
-  - When `srcValidCol >= elemPerRpt`: `tmpGapEles = elemPerRpt`.
-  - When `srcValidCol < elemPerRpt`: `tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock`.
-- Simply set `tmp` tile size the same as `src` when `src` is small, or calculate the required stride based on `src`'s `validCol` using the following formula:
-
-```text
-repeats = ceil(validCol / elementPerRepeat)
-stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
-```
-
-### About temporary tile `tmp` for A5
-
-- `tmp` temporary tile is **not used** in the A5 implementation. The A5 uses vector register-based computation (`__VEC_SCOPE__`) and does not require scratch tile storage.
-- `tmp` is retained in the C++ intrinsic signature solely for API compatibility with A2A3.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
-  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src(16, 255);
-  DstT dst(1, 255);
-  TmpT tmp(1, 32);
-  TCOLARGMAX(dst, src, tmp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
-  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src(16, 255);
-  DstT dst(1, 255);
-  TmpT tmp(1, 32);
-  TASSIGN(src, 0x0);
-  TASSIGN(dst, 0x1000);
-  TASSIGN(tmp, 0x2000);
-  TCOLARGMAX(dst, src, tmp);
-}
-```
-
-## ASM Form Examples
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
-pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/TCOLARGMAX_zh.md b/docs/mkdocs/src/docs/isa/TCOLARGMAX_zh.md
deleted file mode 100644
index 6f6cd5f8..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLARGMAX_zh.md
+++ /dev/null
@@ -1,167 +0,0 @@
-<!-- Generated from `docs/isa/TCOLARGMAX_zh.md` -->
-
-# TCOLARGMAX
-
-## 指令示意图
-
-![TCOLARGMAX tile operation](../figures/isa/TCOLARGMAX.svg)
-
-## 简介
-
-获取每列最大值对应行索引。
-
-## 数学语义
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
-
-$$ \mathrm{dst}_{0,j} = \underset{0 \le i < R}{\operatorname{argmax}} \; \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### IR Level 1（SSA）
-
-```text
-%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2（DPS）
-
-```text
-pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TCOLARGMAX(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须为 `TileType::Vec`。
-- 由于已检查到的辅助检查仅要求 `SLayout::NoneBox`，因此 `src` 可使用 ND 或 DN 的非分形布局。
-- `dst` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- 支持的目标元素类型：`uint32_t`、`int32_t`。
-- 编译时检查：`TileDataIn::ValidCol == 1 || TileDataIn::ValidCol == -1`。
-- 运行时检查：
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `dst.GetValidRow() == 1`
-  - `src.GetValidCol() == dst.GetValidCol()`
-
-### A2A3 实现检查
-
-- 支持的源元素类型：`half`、`float`、`uint16_t`、`uint32_t`。
-- `tmp` 的元素类型必须与 `src` 一致。
-- 在已检查到的 A2A3 实现路径中，`tmp` 用作索引跟踪和当前比较值的临时存储。
-
-### A5 实现检查
-
-- 支持的源元素宽度为 8 位、16 位或 32 位，因此已检查到的实现覆盖 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float`。
-- 在已检查到的 A5 实现路径中，接口仍接收 `tmp`，但 `TCOLARGMAX_IMPL` 实际并不使用它。
-
-### A2A3 `tmp` 临时 Tile 相关说明
-
-- A2A3 实现中 `tmp` **始终被使用**，作为中间结果的临时存储空间（当前行索引、argmax 索引、当前最大值元素）。
-- `tmp` Tile 的数据类型必须与 `src` 的数据类型一致。
-- `tmp` Tile 在单行内被划分为三个区域：
-  - 区域 0（`[0, tmpGapEles)`）：当前行索引计数器（每行递增）。
-  - 区域 1（`[tmpGapEles, 2 * tmpGapEles)`）：当前最大值元素，用于比较。
-  - 区域 2（`[2 * tmpGapEles, 3 * tmpGapEles)`）：argmax 索引结果（最终转换后写入 `dst`）。
-- `tmpGapEles` 的确定方式：
-  - 当 `srcValidCol >= elemPerRpt` 时：`tmpGapEles = elemPerRpt`。
-  - 当 `srcValidCol < elemPerRpt` 时：`tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock`。
-- 当 `src` 较小时，可直接将 `tmp` Tile 大小设为与 `src` 相同；也可按以下公式根据 `src` 的 `validCol` 算出 `tmp` Tile 所需 stride：
-
-```text
-repeats = ceil(validCol / elementPerRepeat)
-stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
-```
-
-### A5 `tmp` 临时 Tile 相关说明
-
-- A5 实现中 `tmp` 临时 Tile **不使用**。A5 使用基于向量寄存器的计算方式（`__VEC_SCOPE__`），不需要临时 Tile 存储。
-- `tmp` 在 C++ 内建接口签名中保留，仅为了与 A2A3 的 API 兼容。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
-  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src(16, 255);
-  DstT dst(1, 255);
-  TmpT tmp(1, 32);
-  TCOLARGMAX(dst, src, tmp);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
-  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src(16, 255);
-  DstT dst(1, 255);
-  TmpT tmp(1, 32);
-  TASSIGN(src, 0x0);
-  TASSIGN(dst, 0x1000);
-  TASSIGN(tmp, 0x2000);
-  TCOLARGMAX(dst, src, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
-pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/TCOLARGMIN.md b/docs/mkdocs/src/docs/isa/TCOLARGMIN.md
deleted file mode 100644
index e89b5303..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLARGMIN.md
+++ /dev/null
@@ -1,167 +0,0 @@
-<!-- Generated from `docs/isa/TCOLARGMIN.md` -->
-
-# pto.tcolargmin
-
-## Tile Operation Diagram
-
-![TCOLARGMIN tile operation](../figures/isa/TCOLARGMIN.svg)
-
-## Introduction
-
-Get the row index of the minimum element for each column.
-
-## Math Interpretation
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
-
-$$ \mathrm{dst}_{0,j} = \underset{0 \le i < R}{\operatorname{argmin}} \; \mathrm{src}_{i,j} $$
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TCOLARGMIN(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
-```
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must be `TileType::Vec`.
-- `src` may use ND or DN non-fractal layout because the checked helper only requires `SLayout::NoneBox`.
-- `dst` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-- Supported destination element types: `uint32_t`, `int32_t`.
-- Compile-time check: `TileDataIn::ValidCol == 1 || TileDataIn::ValidCol == -1`.
-- Runtime checks:
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `dst.GetValidRow() == 1`
-  - `src.GetValidCol() == dst.GetValidCol()`
-
-### A2A3 implementation checks
-
-- Supported source element types: `half`, `float`, `uint16_t`, `uint32_t`.
-- `tmp` must use the same element type as `src`.
-- In the checked A2A3 implementation path, `tmp` is used as scratch storage for index tracking and current comparison values.
-
-### A5 implementation checks
-
-- Supported source element sizes are 8-bit, 16-bit, or 32-bit; the checked implementation therefore covers `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `float`.
-- In the checked A5 implementation path, `tmp` is accepted by the interface but not used by `TCOLARGMIN_IMPL`.
-
-### About temporary tile `tmp` for A2A3
-
-- `tmp` is always used in the A2A3 implementation as scratch space for intermediate results (current index, argmin index, and current min elements).
-- `tmp` tile's data type must be the same as `src`'s data type.
-- `tmp` tile is organized into three regions within a single row:
-  - Region 0 (`[0, tmpGapEles)`): current row index counter (incremented per row).
-  - Region 1 (`[tmpGapEles, 2 * tmpGapEles)`): current minimum elements for comparison.
-  - Region 2 (`[2 * tmpGapEles, 3 * tmpGapEles)`): argmin index result (before final conversion to `dst`).
-- `tmpGapEles` is determined as follows:
-  - When `srcValidCol >= elemPerRpt`: `tmpGapEles = elemPerRpt`.
-  - When `srcValidCol < elemPerRpt`: `tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock`.
-- Simply set `tmp` tile size the same as `src` when `src` is small, or calculate the required stride based on `src`'s `validCol` using the following formula:
-
-```text
-repeats = ceil(validCol / elementPerRepeat)
-stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
-```
-
-### About temporary tile `tmp` for A5
-
-- `tmp` temporary tile is **not used** in the A5 implementation. The A5 uses vector register-based computation (`__VEC_SCOPE__`) and does not require scratch tile storage.
-- `tmp` is retained in the C++ intrinsic signature solely for API compatibility with A2A3.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
-  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src(16, 255);
-  DstT dst(1, 255);
-  TmpT tmp(1, 32);
-  TCOLARGMIN(dst, src, tmp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
-  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src(16, 255);
-  DstT dst(1, 255);
-  TmpT tmp(1, 32);
-  TASSIGN(src, 0x0);
-  TASSIGN(dst, 0x1000);
-  TASSIGN(tmp, 0x2000);
-  TCOLARGMIN(dst, src, tmp);
-}
-```
-
-## ASM Form Examples
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
-pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/TCOLARGMIN_zh.md b/docs/mkdocs/src/docs/isa/TCOLARGMIN_zh.md
deleted file mode 100644
index 29bb9831..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLARGMIN_zh.md
+++ /dev/null
@@ -1,167 +0,0 @@
-<!-- Generated from `docs/isa/TCOLARGMIN_zh.md` -->
-
-# TCOLARGMIN
-
-## 指令示意图
-
-![TCOLARGMIN tile operation](../figures/isa/TCOLARGMIN.svg)
-
-## 简介
-
-获取每列最小值对应行索引。
-
-## 数学语义
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
-
-$$ \mathrm{dst}_{0,j} = \underset{0 \le i < R}{\operatorname{argmin}} \; \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### IR Level 1（SSA）
-
-```text
-%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2（DPS）
-
-```text
-pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TCOLARGMIN(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须为 `TileType::Vec`。
-- 由于已检查到的辅助检查仅要求 `SLayout::NoneBox`，因此 `src` 可使用 ND 或 DN 的非分形布局。
-- `dst` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- 支持的目标元素类型：`uint32_t`、`int32_t`。
-- 编译时检查：`TileDataIn::ValidCol == 1 || TileDataIn::ValidCol == -1`。
-- 运行时检查：
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `dst.GetValidRow() == 1`
-  - `src.GetValidCol() == dst.GetValidCol()`
-
-### A2A3 实现检查
-
-- 支持的源元素类型：`half`、`float`、`uint16_t`、`uint32_t`。
-- `tmp` 的元素类型必须与 `src` 一致。
-- 在已检查到的 A2A3 实现路径中，`tmp` 用作索引跟踪和当前比较值的临时存储。
-
-### A5 实现检查
-
-- 支持的源元素宽度为 8 位、16 位或 32 位，因此已检查到的实现覆盖 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float`。
-- 在已检查到的 A5 实现路径中，接口仍接收 `tmp`，但 `TCOLARGMIN_IMPL` 实际并不使用它。
-
-### A2A3 `tmp` 临时 Tile 相关说明
-
-- A2A3 实现中 `tmp` **始终被使用**，作为中间结果的临时存储空间（当前行索引、argmin 索引、当前最小值元素）。
-- `tmp` Tile 的数据类型必须与 `src` 的数据类型一致。
-- `tmp` Tile 在单行内被划分为三个区域：
-  - 区域 0（`[0, tmpGapEles)`）：当前行索引计数器（每行递增）。
-  - 区域 1（`[tmpGapEles, 2 * tmpGapEles)`）：当前最小值元素，用于比较。
-  - 区域 2（`[2 * tmpGapEles, 3 * tmpGapEles)`）：argmin 索引结果（最终转换后写入 `dst`）。
-- `tmpGapEles` 的确定方式：
-  - 当 `srcValidCol >= elemPerRpt` 时：`tmpGapEles = elemPerRpt`。
-  - 当 `srcValidCol < elemPerRpt` 时：`tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock`。
-- 当 `src` 较小时，可直接将 `tmp` Tile 大小设为与 `src` 相同；也可按以下公式根据 `src` 的 `validCol` 算出 `tmp` Tile 所需 stride：
-
-```text
-repeats = ceil(validCol / elementPerRepeat)
-stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
-```
-
-### A5 `tmp` 临时 Tile 相关说明
-
-- A5 实现中 `tmp` 临时 Tile **不使用**。A5 使用基于向量寄存器的计算方式（`__VEC_SCOPE__`），不需要临时 Tile 存储。
-- `tmp` 在 C++ 内建接口签名中保留，仅为了与 A2A3 的 API 兼容。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
-  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src(16, 255);
-  DstT dst(1, 255);
-  TmpT tmp(1, 32);
-  TCOLARGMIN(dst, src, tmp);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
-  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src(16, 255);
-  DstT dst(1, 255);
-  TmpT tmp(1, 32);
-  TASSIGN(src, 0x0);
-  TASSIGN(dst, 0x1000);
-  TASSIGN(tmp, 0x2000);
-  TCOLARGMIN(dst, src, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
-pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/TCOLEXPAND.md b/docs/mkdocs/src/docs/isa/TCOLEXPAND.md
deleted file mode 100644
index 1e701bb7..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLEXPAND.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLEXPAND.md` -->
-
-# pto.tcolexpand
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolexpand](./tile/ops/reduce-and-expand/tcolexpand.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolexpand](./tile/ops/reduce-and-expand/tcolexpand.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLEXPANDADD.md b/docs/mkdocs/src/docs/isa/TCOLEXPANDADD.md
deleted file mode 100644
index 250f41d3..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLEXPANDADD.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLEXPANDADD.md` -->
-
-# pto.tcolexpandadd
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolexpandadd](./tile/ops/reduce-and-expand/tcolexpandadd.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolexpandadd](./tile/ops/reduce-and-expand/tcolexpandadd.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLEXPANDDIV.md b/docs/mkdocs/src/docs/isa/TCOLEXPANDDIV.md
deleted file mode 100644
index 45dc9915..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLEXPANDDIV.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLEXPANDDIV.md` -->
-
-# pto.tcolexpanddiv
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolexpanddiv](./tile/ops/reduce-and-expand/tcolexpanddiv.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolexpanddiv](./tile/ops/reduce-and-expand/tcolexpanddiv.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLEXPANDEXPDIF.md b/docs/mkdocs/src/docs/isa/TCOLEXPANDEXPDIF.md
deleted file mode 100644
index 6acfe670..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLEXPANDEXPDIF.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLEXPANDEXPDIF.md` -->
-
-# pto.tcolexpandexpdif
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolexpandexpdif](./tile/ops/reduce-and-expand/tcolexpandexpdif.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolexpandexpdif](./tile/ops/reduce-and-expand/tcolexpandexpdif.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLEXPANDMAX.md b/docs/mkdocs/src/docs/isa/TCOLEXPANDMAX.md
deleted file mode 100644
index cd1a9e73..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLEXPANDMAX.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLEXPANDMAX.md` -->
-
-# pto.tcolexpandmax
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolexpandmax](./tile/ops/reduce-and-expand/tcolexpandmax.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolexpandmax](./tile/ops/reduce-and-expand/tcolexpandmax.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLEXPANDMIN.md b/docs/mkdocs/src/docs/isa/TCOLEXPANDMIN.md
deleted file mode 100644
index 2ba2a37c..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLEXPANDMIN.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLEXPANDMIN.md` -->
-
-# pto.tcolexpandmin
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolexpandmin](./tile/ops/reduce-and-expand/tcolexpandmin.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolexpandmin](./tile/ops/reduce-and-expand/tcolexpandmin.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLEXPANDMUL.md b/docs/mkdocs/src/docs/isa/TCOLEXPANDMUL.md
deleted file mode 100644
index 74baa6d5..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLEXPANDMUL.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLEXPANDMUL.md` -->
-
-# pto.tcolexpandmul
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolexpandmul](./tile/ops/reduce-and-expand/tcolexpandmul.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolexpandmul](./tile/ops/reduce-and-expand/tcolexpandmul.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLEXPANDSUB.md b/docs/mkdocs/src/docs/isa/TCOLEXPANDSUB.md
deleted file mode 100644
index 43e49d72..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLEXPANDSUB.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLEXPANDSUB.md` -->
-
-# pto.tcolexpandsub
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolexpandsub](./tile/ops/reduce-and-expand/tcolexpandsub.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolexpandsub](./tile/ops/reduce-and-expand/tcolexpandsub.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLMAX.md b/docs/mkdocs/src/docs/isa/TCOLMAX.md
deleted file mode 100644
index 69e2e825..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLMAX.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLMAX.md` -->
-
-# pto.tcolmax
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolmax](./tile/ops/reduce-and-expand/tcolmax.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolmax](./tile/ops/reduce-and-expand/tcolmax.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLMIN.md b/docs/mkdocs/src/docs/isa/TCOLMIN.md
deleted file mode 100644
index 93103118..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLMIN.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLMIN.md` -->
-
-# pto.tcolmin
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolmin](./tile/ops/reduce-and-expand/tcolmin.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolmin](./tile/ops/reduce-and-expand/tcolmin.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLPROD.md b/docs/mkdocs/src/docs/isa/TCOLPROD.md
deleted file mode 100644
index 917d3874..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLPROD.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLPROD.md` -->
-
-# pto.tcolprod
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolprod](./tile/ops/reduce-and-expand/tcolprod.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolprod](./tile/ops/reduce-and-expand/tcolprod.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCOLSUM.md b/docs/mkdocs/src/docs/isa/TCOLSUM.md
deleted file mode 100644
index 28ca1d92..00000000
--- a/docs/mkdocs/src/docs/isa/TCOLSUM.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCOLSUM.md` -->
-
-# pto.tcolsum
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcolsum](./tile/ops/reduce-and-expand/tcolsum.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.tcolsum](./tile/ops/reduce-and-expand/tcolsum.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TCONCAT.md b/docs/mkdocs/src/docs/isa/TCONCAT.md
deleted file mode 100644
index 03c54751..00000000
--- a/docs/mkdocs/src/docs/isa/TCONCAT.md
+++ /dev/null
@@ -1,42 +0,0 @@
-<!-- Generated from `docs/isa/TCONCAT.md` -->
-
-# TCONCAT
-
-## Tile Operation Diagram
-
-![TCONCAT tile operation](../figures/isa/TCONCAT.svg)
-
-## Introduction
-
-Concatenate two source tiles along the column dimension into a destination tile.
-
-## Math Interpretation
-
-Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tconcat ...
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tconcat ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`.
-
-## Constraints
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## Examples
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TCONCAT_zh.md b/docs/mkdocs/src/docs/isa/TCONCAT_zh.md
deleted file mode 100644
index ba0ed8b1..00000000
--- a/docs/mkdocs/src/docs/isa/TCONCAT_zh.md
+++ /dev/null
@@ -1,43 +0,0 @@
-<!-- Generated from `docs/isa/TCONCAT_zh.md` -->
-
-# TCONCAT
-
-## 指令示意图
-
-![TCONCAT tile operation](../figures/isa/TCONCAT.svg)
-
-## 简介
-
-沿列维将两个源 Tile 拼接到目标 Tile。
-
-## 数学语义
-
-语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
-
-## 汇编语法
-
-PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tconcat ...
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tconcat ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`.
-
-## 约束
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## 示例
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TCVT.md b/docs/mkdocs/src/docs/isa/TCVT.md
deleted file mode 100644
index 90ce4a31..00000000
--- a/docs/mkdocs/src/docs/isa/TCVT.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TCVT.md` -->
-
-# pto.tcvt
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tcvt](./tile/ops/elementwise-tile-tile/tcvt.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tcvt](./tile/ops/elementwise-tile-tile/tcvt.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TDEQUANT.md b/docs/mkdocs/src/docs/isa/TDEQUANT.md
deleted file mode 100644
index fdbc9299..00000000
--- a/docs/mkdocs/src/docs/isa/TDEQUANT.md
+++ /dev/null
@@ -1,42 +0,0 @@
-<!-- Generated from `docs/isa/TDEQUANT.md` -->
-
-# TDEQUANT
-
-## Tile Operation Diagram
-
-![TDEQUANT tile operation](../figures/isa/TDEQUANT.svg)
-
-## Introduction
-
-Dequantize an integer tile into a floating-point tile using scale and offset tiles.
-
-## Math Interpretation
-
-Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tdequant ...
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tdequant ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`.
-
-## Constraints
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## Examples
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TDEQUANT_zh.md b/docs/mkdocs/src/docs/isa/TDEQUANT_zh.md
deleted file mode 100644
index e2a98f68..00000000
--- a/docs/mkdocs/src/docs/isa/TDEQUANT_zh.md
+++ /dev/null
@@ -1,43 +0,0 @@
-<!-- Generated from `docs/isa/TDEQUANT_zh.md` -->
-
-# TDEQUANT
-
-## 指令示意图
-
-![TDEQUANT tile operation](../figures/isa/TDEQUANT.svg)
-
-## 简介
-
-使用 scale 与 offset Tile 将整数量化 Tile 反量化为浮点 Tile。
-
-## 数学语义
-
-语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
-
-## 汇编语法
-
-PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tdequant ...
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tdequant ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`.
-
-## 约束
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## 示例
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TDIV.md b/docs/mkdocs/src/docs/isa/TDIV.md
deleted file mode 100644
index 2a87109f..00000000
--- a/docs/mkdocs/src/docs/isa/TDIV.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TDIV.md` -->
-
-# pto.tdiv
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tdiv](./tile/ops/elementwise-tile-tile/tdiv.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tdiv](./tile/ops/elementwise-tile-tile/tdiv.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TDIVS.md b/docs/mkdocs/src/docs/isa/TDIVS.md
deleted file mode 100644
index 7ecb8e5d..00000000
--- a/docs/mkdocs/src/docs/isa/TDIVS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TDIVS.md` -->
-
-# pto.tdivs
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tdivs](./tile/ops/tile-scalar-and-immediate/tdivs.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tdivs](./tile/ops/tile-scalar-and-immediate/tdivs.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TEXP.md b/docs/mkdocs/src/docs/isa/TEXP.md
deleted file mode 100644
index d10389c9..00000000
--- a/docs/mkdocs/src/docs/isa/TEXP.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TEXP.md` -->
-
-# pto.texp
-
-This compatibility page points to the canonical tile-surface reference page for [pto.texp](./tile/ops/elementwise-tile-tile/texp.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.texp](./tile/ops/elementwise-tile-tile/texp.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TEXPANDS.md b/docs/mkdocs/src/docs/isa/TEXPANDS.md
deleted file mode 100644
index 84129b2a..00000000
--- a/docs/mkdocs/src/docs/isa/TEXPANDS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TEXPANDS.md` -->
-
-# pto.texpands
-
-This compatibility page points to the canonical tile-surface reference page for [pto.texpands](./tile/ops/tile-scalar-and-immediate/texpands.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.texpands](./tile/ops/tile-scalar-and-immediate/texpands.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TEXTRACT.md b/docs/mkdocs/src/docs/isa/TEXTRACT.md
deleted file mode 100644
index 18f5b654..00000000
--- a/docs/mkdocs/src/docs/isa/TEXTRACT.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TEXTRACT.md` -->
-
-# pto.textract
-
-This compatibility page points to the canonical tile-surface reference page for [pto.textract](./tile/ops/layout-and-rearrangement/textract.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.textract](./tile/ops/layout-and-rearrangement/textract.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TEXTRACT_FP.md b/docs/mkdocs/src/docs/isa/TEXTRACT_FP.md
deleted file mode 100644
index dfae7ce7..00000000
--- a/docs/mkdocs/src/docs/isa/TEXTRACT_FP.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TEXTRACT_FP.md` -->
-
-# pto.textract_fp
-
-This compatibility page points to the canonical tile-surface reference page for [pto.textract_fp](./tile/ops/layout-and-rearrangement/textract-fp.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.textract_fp](./tile/ops/layout-and-rearrangement/textract-fp.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TFILLPAD.md b/docs/mkdocs/src/docs/isa/TFILLPAD.md
deleted file mode 100644
index 057315f8..00000000
--- a/docs/mkdocs/src/docs/isa/TFILLPAD.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TFILLPAD.md` -->
-
-# pto.tfillpad
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tfillpad](./tile/ops/layout-and-rearrangement/tfillpad.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.tfillpad](./tile/ops/layout-and-rearrangement/tfillpad.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TFILLPAD_EXPAND.md b/docs/mkdocs/src/docs/isa/TFILLPAD_EXPAND.md
deleted file mode 100644
index 393100df..00000000
--- a/docs/mkdocs/src/docs/isa/TFILLPAD_EXPAND.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TFILLPAD_EXPAND.md` -->
-
-# pto.tfillpad_expand
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tfillpad_expand](./tile/ops/layout-and-rearrangement/tfillpad-expand.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.tfillpad_expand](./tile/ops/layout-and-rearrangement/tfillpad-expand.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TFILLPAD_INPLACE.md b/docs/mkdocs/src/docs/isa/TFILLPAD_INPLACE.md
deleted file mode 100644
index 639b6abf..00000000
--- a/docs/mkdocs/src/docs/isa/TFILLPAD_INPLACE.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TFILLPAD_INPLACE.md` -->
-
-# pto.tfillpad_inplace
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tfillpad_inplace](./tile/ops/layout-and-rearrangement/tfillpad-inplace.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.tfillpad_inplace](./tile/ops/layout-and-rearrangement/tfillpad-inplace.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TFMOD.md b/docs/mkdocs/src/docs/isa/TFMOD.md
deleted file mode 100644
index c2025557..00000000
--- a/docs/mkdocs/src/docs/isa/TFMOD.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TFMOD.md` -->
-
-# pto.tfmod
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tfmod](./tile/ops/elementwise-tile-tile/tfmod.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tfmod](./tile/ops/elementwise-tile-tile/tfmod.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TFMODS.md b/docs/mkdocs/src/docs/isa/TFMODS.md
deleted file mode 100644
index 81fe3176..00000000
--- a/docs/mkdocs/src/docs/isa/TFMODS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TFMODS.md` -->
-
-# pto.tfmods
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tfmods](./tile/ops/tile-scalar-and-immediate/tfmods.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tfmods](./tile/ops/tile-scalar-and-immediate/tfmods.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TFREE.md b/docs/mkdocs/src/docs/isa/TFREE.md
deleted file mode 100644
index 7fedb165..00000000
--- a/docs/mkdocs/src/docs/isa/TFREE.md
+++ /dev/null
@@ -1,42 +0,0 @@
-<!-- Generated from `docs/isa/TFREE.md` -->
-
-# TFREE
-
-## Tile Operation Diagram
-
-![TFREE tile operation](../figures/isa/TFREE.svg)
-
-## Introduction
-
-Release the currently held pipe or FIFO slot back to the producer.
-
-## Math Interpretation
-
-Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tfree ...
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tfree ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`.
-
-## Constraints
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## Examples
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TFREE_zh.md b/docs/mkdocs/src/docs/isa/TFREE_zh.md
deleted file mode 100644
index 0f428458..00000000
--- a/docs/mkdocs/src/docs/isa/TFREE_zh.md
+++ /dev/null
@@ -1,43 +0,0 @@
-<!-- Generated from `docs/isa/TFREE_zh.md` -->
-
-# TFREE
-
-## 指令示意图
-
-![TFREE tile operation](../figures/isa/TFREE.svg)
-
-## 简介
-
-将当前占用的 pipe 或 FIFO 槽位释放回生产者。
-
-## 数学语义
-
-语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
-
-## 汇编语法
-
-PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tfree ...
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tfree ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`.
-
-## 约束
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## 示例
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TGATHER.md b/docs/mkdocs/src/docs/isa/TGATHER.md
deleted file mode 100644
index cb05aa73..00000000
--- a/docs/mkdocs/src/docs/isa/TGATHER.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TGATHER.md` -->
-
-# pto.tgather
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tgather](./tile/ops/irregular-and-complex/tgather.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tgather](./tile/ops/irregular-and-complex/tgather.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TGATHERB.md b/docs/mkdocs/src/docs/isa/TGATHERB.md
deleted file mode 100644
index 906756c6..00000000
--- a/docs/mkdocs/src/docs/isa/TGATHERB.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TGATHERB.md` -->
-
-# pto.tgatherb
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tgatherb](./tile/ops/irregular-and-complex/tgatherb.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tgatherb](./tile/ops/irregular-and-complex/tgatherb.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TGEMV.md b/docs/mkdocs/src/docs/isa/TGEMV.md
deleted file mode 100644
index 84d3f5b3..00000000
--- a/docs/mkdocs/src/docs/isa/TGEMV.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TGEMV.md` -->
-
-# pto.tgemv
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tgemv](./tile/ops/matrix-and-matrix-vector/tgemv.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Matrix And Matrix Vector](./tile/matrix-and-matrix-vector.md)
-- Canonical per-op page: [pto.tgemv](./tile/ops/matrix-and-matrix-vector/tgemv.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TGEMV_ACC.md b/docs/mkdocs/src/docs/isa/TGEMV_ACC.md
deleted file mode 100644
index 3000247b..00000000
--- a/docs/mkdocs/src/docs/isa/TGEMV_ACC.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TGEMV_ACC.md` -->
-
-# pto.tgemv_acc
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tgemv_acc](./tile/ops/matrix-and-matrix-vector/tgemv-acc.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Matrix And Matrix Vector](./tile/matrix-and-matrix-vector.md)
-- Canonical per-op page: [pto.tgemv_acc](./tile/ops/matrix-and-matrix-vector/tgemv-acc.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TGEMV_BIAS.md b/docs/mkdocs/src/docs/isa/TGEMV_BIAS.md
deleted file mode 100644
index 9fb05806..00000000
--- a/docs/mkdocs/src/docs/isa/TGEMV_BIAS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TGEMV_BIAS.md` -->
-
-# pto.tgemv_bias
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tgemv_bias](./tile/ops/matrix-and-matrix-vector/tgemv-bias.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Matrix And Matrix Vector](./tile/matrix-and-matrix-vector.md)
-- Canonical per-op page: [pto.tgemv_bias](./tile/ops/matrix-and-matrix-vector/tgemv-bias.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TGEMV_MX.md b/docs/mkdocs/src/docs/isa/TGEMV_MX.md
deleted file mode 100644
index f0089604..00000000
--- a/docs/mkdocs/src/docs/isa/TGEMV_MX.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TGEMV_MX.md` -->
-
-# pto.tgemv_mx
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tgemv_mx](./tile/ops/matrix-and-matrix-vector/tgemv-mx.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Matrix And Matrix Vector](./tile/matrix-and-matrix-vector.md)
-- Canonical per-op page: [pto.tgemv_mx](./tile/ops/matrix-and-matrix-vector/tgemv-mx.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TGET_SCALE_ADDR.md b/docs/mkdocs/src/docs/isa/TGET_SCALE_ADDR.md
deleted file mode 100644
index c281c91a..00000000
--- a/docs/mkdocs/src/docs/isa/TGET_SCALE_ADDR.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TGET_SCALE_ADDR.md` -->
-
-# pto.tget_scale_addr
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tget_scale_addr](./tile/ops/sync-and-config/tget-scale-addr.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Sync And Config](./tile/sync-and-config.md)
-- Canonical per-op page: [pto.tget_scale_addr](./tile/ops/sync-and-config/tget-scale-addr.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/THISTOGRAM.md b/docs/mkdocs/src/docs/isa/THISTOGRAM.md
deleted file mode 100644
index 6ab799cf..00000000
--- a/docs/mkdocs/src/docs/isa/THISTOGRAM.md
+++ /dev/null
@@ -1,42 +0,0 @@
-<!-- Generated from `docs/isa/THISTOGRAM.md` -->
-
-# THISTOGRAM
-
-## Tile Operation Diagram
-
-![THISTOGRAM tile operation](../figures/isa/THISTOGRAM.svg)
-
-## Introduction
-
-Accumulate histogram bin counts from source values using an index tile.
-
-## Math Interpretation
-
-Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.thistogram ...
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.thistogram ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`.
-
-## Constraints
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## Examples
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/THISTOGRAM_zh.md b/docs/mkdocs/src/docs/isa/THISTOGRAM_zh.md
deleted file mode 100644
index 8353108b..00000000
--- a/docs/mkdocs/src/docs/isa/THISTOGRAM_zh.md
+++ /dev/null
@@ -1,43 +0,0 @@
-<!-- Generated from `docs/isa/THISTOGRAM_zh.md` -->
-
-# THISTOGRAM
-
-## 指令示意图
-
-![THISTOGRAM tile operation](../figures/isa/THISTOGRAM.svg)
-
-## 简介
-
-使用索引 Tile 从源值中累计直方图 bin 计数。
-
-## 数学语义
-
-语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
-
-## 汇编语法
-
-PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.thistogram ...
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.thistogram ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`.
-
-## 约束
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## 示例
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TIMG2COL.md b/docs/mkdocs/src/docs/isa/TIMG2COL.md
deleted file mode 100644
index 83b1ddc2..00000000
--- a/docs/mkdocs/src/docs/isa/TIMG2COL.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TIMG2COL.md` -->
-
-# pto.timg2col
-
-This compatibility page points to the canonical tile-surface reference page for [pto.timg2col](./tile/ops/layout-and-rearrangement/timg2col.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.timg2col](./tile/ops/layout-and-rearrangement/timg2col.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TIMG2COL_zh.md b/docs/mkdocs/src/docs/isa/TIMG2COL_zh.md
deleted file mode 100644
index 2803541a..00000000
--- a/docs/mkdocs/src/docs/isa/TIMG2COL_zh.md
+++ /dev/null
@@ -1,60 +0,0 @@
-<!-- Generated from `docs/isa/TIMG2COL_zh.md` -->
-
-# TIMG2COL
-
-## 指令示意图
-
-![TIMG2COL tile operation](../figures/isa/TIMG2COL.svg)
-
-## 简介
-
-用于类卷积工作负载的图像到列变换。
-
-## 数学语义
-
-除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST RecordEvent TIMG2COL(TileData &dst, ConvTileData &src, uint16_t posM = 0, uint16_t posK = 0,
-                              WaitEvents&... events);
-```
-
-## 约束
-
-- This instruction is target/implementation-specific. See `include/pto/npu/*/TImg2col.hpp` for the supported tile types/layouts and config fields.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/TINSERT.md b/docs/mkdocs/src/docs/isa/TINSERT.md
deleted file mode 100644
index f7805e0d..00000000
--- a/docs/mkdocs/src/docs/isa/TINSERT.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TINSERT.md` -->
-
-# pto.tinsert
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tinsert](./tile/ops/layout-and-rearrangement/tinsert.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.tinsert](./tile/ops/layout-and-rearrangement/tinsert.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TINSERT_FP.md b/docs/mkdocs/src/docs/isa/TINSERT_FP.md
deleted file mode 100644
index 5f454735..00000000
--- a/docs/mkdocs/src/docs/isa/TINSERT_FP.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TINSERT_FP.md` -->
-
-# pto.tinsert_fp
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tinsert_fp](./tile/ops/layout-and-rearrangement/tinsert-fp.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.tinsert_fp](./tile/ops/layout-and-rearrangement/tinsert-fp.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TLOAD.md b/docs/mkdocs/src/docs/isa/TLOAD.md
deleted file mode 100644
index f84b4c83..00000000
--- a/docs/mkdocs/src/docs/isa/TLOAD.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TLOAD.md` -->
-
-# pto.tload
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tload](./tile/ops/memory-and-data-movement/tload.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Memory And Data Movement](./tile/memory-and-data-movement.md)
-- Canonical per-op page: [pto.tload](./tile/ops/memory-and-data-movement/tload.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TLOG.md b/docs/mkdocs/src/docs/isa/TLOG.md
deleted file mode 100644
index 2aa5b1b0..00000000
--- a/docs/mkdocs/src/docs/isa/TLOG.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TLOG.md` -->
-
-# pto.tlog
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tlog](./tile/ops/elementwise-tile-tile/tlog.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tlog](./tile/ops/elementwise-tile-tile/tlog.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TLRELU.md b/docs/mkdocs/src/docs/isa/TLRELU.md
deleted file mode 100644
index eba06688..00000000
--- a/docs/mkdocs/src/docs/isa/TLRELU.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TLRELU.md` -->
-
-# pto.tlrelu
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tlrelu](./tile/ops/tile-scalar-and-immediate/tlrelu.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tlrelu](./tile/ops/tile-scalar-and-immediate/tlrelu.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMATMUL.md b/docs/mkdocs/src/docs/isa/TMATMUL.md
deleted file mode 100644
index 14c2691d..00000000
--- a/docs/mkdocs/src/docs/isa/TMATMUL.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMATMUL.md` -->
-
-# pto.tmatmul
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmatmul](./tile/ops/matrix-and-matrix-vector/tmatmul.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Matrix And Matrix Vector](./tile/matrix-and-matrix-vector.md)
-- Canonical per-op page: [pto.tmatmul](./tile/ops/matrix-and-matrix-vector/tmatmul.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMATMUL_ACC.md b/docs/mkdocs/src/docs/isa/TMATMUL_ACC.md
deleted file mode 100644
index c8715daf..00000000
--- a/docs/mkdocs/src/docs/isa/TMATMUL_ACC.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMATMUL_ACC.md` -->
-
-# pto.tmatmul_acc
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmatmul_acc](./tile/ops/matrix-and-matrix-vector/tmatmul-acc.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Matrix And Matrix Vector](./tile/matrix-and-matrix-vector.md)
-- Canonical per-op page: [pto.tmatmul_acc](./tile/ops/matrix-and-matrix-vector/tmatmul-acc.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMATMUL_BIAS.md b/docs/mkdocs/src/docs/isa/TMATMUL_BIAS.md
deleted file mode 100644
index 18ee7eb7..00000000
--- a/docs/mkdocs/src/docs/isa/TMATMUL_BIAS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMATMUL_BIAS.md` -->
-
-# pto.tmatmul_bias
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmatmul_bias](./tile/ops/matrix-and-matrix-vector/tmatmul-bias.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Matrix And Matrix Vector](./tile/matrix-and-matrix-vector.md)
-- Canonical per-op page: [pto.tmatmul_bias](./tile/ops/matrix-and-matrix-vector/tmatmul-bias.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMATMUL_MX.md b/docs/mkdocs/src/docs/isa/TMATMUL_MX.md
deleted file mode 100644
index a58f688e..00000000
--- a/docs/mkdocs/src/docs/isa/TMATMUL_MX.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMATMUL_MX.md` -->
-
-# pto.tmatmul_mx
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmatmul_mx](./tile/ops/matrix-and-matrix-vector/tmatmul-mx.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Matrix And Matrix Vector](./tile/matrix-and-matrix-vector.md)
-- Canonical per-op page: [pto.tmatmul_mx](./tile/ops/matrix-and-matrix-vector/tmatmul-mx.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMAX.md b/docs/mkdocs/src/docs/isa/TMAX.md
deleted file mode 100644
index cf859a52..00000000
--- a/docs/mkdocs/src/docs/isa/TMAX.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMAX.md` -->
-
-# pto.tmax
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmax](./tile/ops/elementwise-tile-tile/tmax.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tmax](./tile/ops/elementwise-tile-tile/tmax.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMAXS.md b/docs/mkdocs/src/docs/isa/TMAXS.md
deleted file mode 100644
index eda0e955..00000000
--- a/docs/mkdocs/src/docs/isa/TMAXS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMAXS.md` -->
-
-# pto.tmaxs
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmaxs](./tile/ops/tile-scalar-and-immediate/tmaxs.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tmaxs](./tile/ops/tile-scalar-and-immediate/tmaxs.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMIN.md b/docs/mkdocs/src/docs/isa/TMIN.md
deleted file mode 100644
index 4856d91f..00000000
--- a/docs/mkdocs/src/docs/isa/TMIN.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMIN.md` -->
-
-# pto.tmin
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmin](./tile/ops/elementwise-tile-tile/tmin.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tmin](./tile/ops/elementwise-tile-tile/tmin.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMINS.md b/docs/mkdocs/src/docs/isa/TMINS.md
deleted file mode 100644
index 77d2430a..00000000
--- a/docs/mkdocs/src/docs/isa/TMINS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMINS.md` -->
-
-# pto.tmins
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmins](./tile/ops/tile-scalar-and-immediate/tmins.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tmins](./tile/ops/tile-scalar-and-immediate/tmins.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMOV.md b/docs/mkdocs/src/docs/isa/TMOV.md
deleted file mode 100644
index 5c46dd9e..00000000
--- a/docs/mkdocs/src/docs/isa/TMOV.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMOV.md` -->
-
-# pto.tmov
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmov](./tile/ops/layout-and-rearrangement/tmov.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.tmov](./tile/ops/layout-and-rearrangement/tmov.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMOV_FP.md b/docs/mkdocs/src/docs/isa/TMOV_FP.md
deleted file mode 100644
index 7950e450..00000000
--- a/docs/mkdocs/src/docs/isa/TMOV_FP.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMOV_FP.md` -->
-
-# pto.tmov_fp
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmov_fp](./tile/ops/layout-and-rearrangement/tmov-fp.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.tmov_fp](./tile/ops/layout-and-rearrangement/tmov-fp.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMRGSORT.md b/docs/mkdocs/src/docs/isa/TMRGSORT.md
deleted file mode 100644
index e2992e04..00000000
--- a/docs/mkdocs/src/docs/isa/TMRGSORT.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMRGSORT.md` -->
-
-# pto.tmrgsort
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmrgsort](./tile/ops/irregular-and-complex/tmrgsort.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tmrgsort](./tile/ops/irregular-and-complex/tmrgsort.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMUL.md b/docs/mkdocs/src/docs/isa/TMUL.md
deleted file mode 100644
index 7129628b..00000000
--- a/docs/mkdocs/src/docs/isa/TMUL.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMUL.md` -->
-
-# pto.tmul
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmul](./tile/ops/elementwise-tile-tile/tmul.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tmul](./tile/ops/elementwise-tile-tile/tmul.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TMULS.md b/docs/mkdocs/src/docs/isa/TMULS.md
deleted file mode 100644
index cef47aea..00000000
--- a/docs/mkdocs/src/docs/isa/TMULS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TMULS.md` -->
-
-# pto.tmuls
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tmuls](./tile/ops/tile-scalar-and-immediate/tmuls.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tmuls](./tile/ops/tile-scalar-and-immediate/tmuls.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TNEG.md b/docs/mkdocs/src/docs/isa/TNEG.md
deleted file mode 100644
index f49b4b82..00000000
--- a/docs/mkdocs/src/docs/isa/TNEG.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TNEG.md` -->
-
-# pto.tneg
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tneg](./tile/ops/elementwise-tile-tile/tneg.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tneg](./tile/ops/elementwise-tile-tile/tneg.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TNOT.md b/docs/mkdocs/src/docs/isa/TNOT.md
deleted file mode 100644
index ae074984..00000000
--- a/docs/mkdocs/src/docs/isa/TNOT.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TNOT.md` -->
-
-# pto.tnot
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tnot](./tile/ops/elementwise-tile-tile/tnot.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tnot](./tile/ops/elementwise-tile-tile/tnot.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TOR.md b/docs/mkdocs/src/docs/isa/TOR.md
deleted file mode 100644
index 88fe27b9..00000000
--- a/docs/mkdocs/src/docs/isa/TOR.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TOR.md` -->
-
-# pto.tor
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tor](./tile/ops/elementwise-tile-tile/tor.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tor](./tile/ops/elementwise-tile-tile/tor.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TORS.md b/docs/mkdocs/src/docs/isa/TORS.md
deleted file mode 100644
index b8ec70fd..00000000
--- a/docs/mkdocs/src/docs/isa/TORS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TORS.md` -->
-
-# pto.tors
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tors](./tile/ops/tile-scalar-and-immediate/tors.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tors](./tile/ops/tile-scalar-and-immediate/tors.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TPACK.md b/docs/mkdocs/src/docs/isa/TPACK.md
deleted file mode 100644
index a9126cc1..00000000
--- a/docs/mkdocs/src/docs/isa/TPACK.md
+++ /dev/null
@@ -1,42 +0,0 @@
-<!-- Generated from `docs/isa/TPACK.md` -->
-
-# TPACK
-
-## Tile Operation Diagram
-
-![TPACK tile operation](../figures/isa/TPACK.svg)
-
-## Introduction
-
-Pack or convert tile elements into a narrower destination representation.
-
-## Math Interpretation
-
-Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tpack ...
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tpack ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`.
-
-## Constraints
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## Examples
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TPACK_zh.md b/docs/mkdocs/src/docs/isa/TPACK_zh.md
deleted file mode 100644
index 8ec9f04c..00000000
--- a/docs/mkdocs/src/docs/isa/TPACK_zh.md
+++ /dev/null
@@ -1,43 +0,0 @@
-<!-- Generated from `docs/isa/TPACK_zh.md` -->
-
-# TPACK
-
-## 指令示意图
-
-![TPACK tile operation](../figures/isa/TPACK.svg)
-
-## 简介
-
-将 Tile 元素打包或转换为更窄的目标表示。
-
-## 数学语义
-
-语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
-
-## 汇编语法
-
-PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tpack ...
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tpack ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`.
-
-## 约束
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## 示例
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TPARTADD.md b/docs/mkdocs/src/docs/isa/TPARTADD.md
deleted file mode 100644
index 988c2c92..00000000
--- a/docs/mkdocs/src/docs/isa/TPARTADD.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TPARTADD.md` -->
-
-# pto.tpartadd
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tpartadd](./tile/ops/irregular-and-complex/tpartadd.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tpartadd](./tile/ops/irregular-and-complex/tpartadd.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TPARTMAX.md b/docs/mkdocs/src/docs/isa/TPARTMAX.md
deleted file mode 100644
index ca9fb450..00000000
--- a/docs/mkdocs/src/docs/isa/TPARTMAX.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TPARTMAX.md` -->
-
-# pto.tpartmax
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tpartmax](./tile/ops/irregular-and-complex/tpartmax.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tpartmax](./tile/ops/irregular-and-complex/tpartmax.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TPARTMIN.md b/docs/mkdocs/src/docs/isa/TPARTMIN.md
deleted file mode 100644
index cbe51ade..00000000
--- a/docs/mkdocs/src/docs/isa/TPARTMIN.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TPARTMIN.md` -->
-
-# pto.tpartmin
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tpartmin](./tile/ops/irregular-and-complex/tpartmin.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tpartmin](./tile/ops/irregular-and-complex/tpartmin.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TPARTMUL.md b/docs/mkdocs/src/docs/isa/TPARTMUL.md
deleted file mode 100644
index bc2b8b95..00000000
--- a/docs/mkdocs/src/docs/isa/TPARTMUL.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TPARTMUL.md` -->
-
-# pto.tpartmul
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tpartmul](./tile/ops/irregular-and-complex/tpartmul.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tpartmul](./tile/ops/irregular-and-complex/tpartmul.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TPOP.md b/docs/mkdocs/src/docs/isa/TPOP.md
deleted file mode 100644
index c0cde74b..00000000
--- a/docs/mkdocs/src/docs/isa/TPOP.md
+++ /dev/null
@@ -1,42 +0,0 @@
-<!-- Generated from `docs/isa/TPOP.md` -->
-
-# TPOP
-
-## Tile Operation Diagram
-
-![TPOP tile operation](../figures/isa/TPOP.svg)
-
-## Introduction
-
-Pop a tile from a pipe or FIFO consumer endpoint.
-
-## Math Interpretation
-
-Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tpop ...
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tpop ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`.
-
-## Constraints
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## Examples
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TPOP_zh.md b/docs/mkdocs/src/docs/isa/TPOP_zh.md
deleted file mode 100644
index f05e52fa..00000000
--- a/docs/mkdocs/src/docs/isa/TPOP_zh.md
+++ /dev/null
@@ -1,43 +0,0 @@
-<!-- Generated from `docs/isa/TPOP_zh.md` -->
-
-# TPOP
-
-## 指令示意图
-
-![TPOP tile operation](../figures/isa/TPOP.svg)
-
-## 简介
-
-从 pipe 或 FIFO 的消费者端弹出一个 Tile。
-
-## 数学语义
-
-语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
-
-## 汇编语法
-
-PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tpop ...
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tpop ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`.
-
-## 约束
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## 示例
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TPREFETCH.md b/docs/mkdocs/src/docs/isa/TPREFETCH.md
deleted file mode 100644
index fffd66f9..00000000
--- a/docs/mkdocs/src/docs/isa/TPREFETCH.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TPREFETCH.md` -->
-
-# pto.tprefetch
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tprefetch](./tile/ops/memory-and-data-movement/tprefetch.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Memory And Data Movement](./tile/memory-and-data-movement.md)
-- Canonical per-op page: [pto.tprefetch](./tile/ops/memory-and-data-movement/tprefetch.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TPRELU.md b/docs/mkdocs/src/docs/isa/TPRELU.md
deleted file mode 100644
index fd4d0dbe..00000000
--- a/docs/mkdocs/src/docs/isa/TPRELU.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TPRELU.md` -->
-
-# pto.tprelu
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tprelu](./tile/ops/elementwise-tile-tile/tprelu.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tprelu](./tile/ops/elementwise-tile-tile/tprelu.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TPRINT.md b/docs/mkdocs/src/docs/isa/TPRINT.md
deleted file mode 100644
index c8089fee..00000000
--- a/docs/mkdocs/src/docs/isa/TPRINT.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TPRINT.md` -->
-
-# pto.tprint
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tprint](./tile/ops/irregular-and-complex/tprint.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tprint](./tile/ops/irregular-and-complex/tprint.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TPUSH.md b/docs/mkdocs/src/docs/isa/TPUSH.md
deleted file mode 100644
index 992e3fe7..00000000
--- a/docs/mkdocs/src/docs/isa/TPUSH.md
+++ /dev/null
@@ -1,42 +0,0 @@
-<!-- Generated from `docs/isa/TPUSH.md` -->
-
-# TPUSH
-
-## Tile Operation Diagram
-
-![TPUSH tile operation](../figures/isa/TPUSH.svg)
-
-## Introduction
-
-Push a tile into a pipe or FIFO producer endpoint.
-
-## Math Interpretation
-
-Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tpush ...
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tpush ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`.
-
-## Constraints
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## Examples
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TPUSH_zh.md b/docs/mkdocs/src/docs/isa/TPUSH_zh.md
deleted file mode 100644
index 4f71fc03..00000000
--- a/docs/mkdocs/src/docs/isa/TPUSH_zh.md
+++ /dev/null
@@ -1,43 +0,0 @@
-<!-- Generated from `docs/isa/TPUSH_zh.md` -->
-
-# TPUSH
-
-## 指令示意图
-
-![TPUSH tile operation](../figures/isa/TPUSH.svg)
-
-## 简介
-
-将 Tile 推入 pipe 或 FIFO 的生产者端。
-
-## 数学语义
-
-语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
-
-## 汇编语法
-
-PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tpush ...
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tpush ins(...) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`.
-
-## 约束
-
-Refer to backend-specific legality checks for data type/layout/location/shape constraints.
-
-## 示例
-
-See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/docs/mkdocs/src/docs/isa/TQUANT.md b/docs/mkdocs/src/docs/isa/TQUANT.md
deleted file mode 100644
index 3f0c3a05..00000000
--- a/docs/mkdocs/src/docs/isa/TQUANT.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TQUANT.md` -->
-
-# pto.tquant
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tquant](./tile/ops/irregular-and-complex/tquant.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tquant](./tile/ops/irregular-and-complex/tquant.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TRANDOM.md b/docs/mkdocs/src/docs/isa/TRANDOM.md
deleted file mode 100644
index e226e144..00000000
--- a/docs/mkdocs/src/docs/isa/TRANDOM.md
+++ /dev/null
@@ -1,123 +0,0 @@
-<!-- Generated from `docs/isa/TRANDOM.md` -->
-
-# TRANDOM
-
-
-## Tile Operation Diagram
-
-![TRANDOM tile operation](../figures/isa/TRANDOM.svg)
-
-## Introduction
-
-Generates random numbers in the destination tile using a counter-based cipher algorithm.
-
-## Math Interpretation
-
-This instruction implements a counter-based random number generator. For each element in the valid region, it generates pseudo-random values based on a key and counter state using a cipher-like transformation with configurable rounds.
-
-The algorithm uses:
-- 128-bit state (4 × 32-bit counters)
-- 64-bit key (2 × 32-bit words)
-- ChaCha-like quarter-round operations with vector instructions
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-trandom %dst, %key, %counter : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trandom ins(%key, %counter : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/npu/a5/TRandom.hpp`:
-
-```cpp
-template <uint16_t Rounds = 10, typename DstTile>
-PTO_INST void TRANDOM_IMPL(DstTile &dst, TRandomKey &key, TRandomCounter &counter);
-```
-
-## Constraints
-
-- **Implementation checks (A5)**:
-    - `DstTile::DType` must be one of: `int32_t`, `uint32_t`.
-    - Tile layout must be row-major (`DstTile::isRowMajor`).
-    - `Rounds` must be either 7 or 10 (default: 10).
-    - `key` and `counter` must not be null.
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileT dst;
-  TRandomKey key = {0x01234, 0x56789};
-  TRandomCounter counter = {0, 0, 0, 0};
-  TRANDOM_IMPL(dst, key, counter);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileT dst;
-  TRandomKey key = {0x01234, 0x56789};
-  TRandomCounter counter = {0, 0, 0, 0};
-  TASSIGN(dst, 0x0);
-  TRANDOM_IMPL<10>(dst, key, counter);
-}
-```
-
-## ASM Form Examples
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x3000)
-%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-trandom %dst, %key, %counter : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trandom ins(%key, %counter : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/TRANDOM_zh.md b/docs/mkdocs/src/docs/isa/TRANDOM_zh.md
deleted file mode 100644
index 211c7cd9..00000000
--- a/docs/mkdocs/src/docs/isa/TRANDOM_zh.md
+++ /dev/null
@@ -1,123 +0,0 @@
-<!-- Generated from `docs/isa/TRANDOM_zh.md` -->
-
-# TRANDOM
-
-
-## Tile Operation Diagram
-
-![TRANDOM tile operation](../figures/isa/TRANDOM.svg)
-
-## 简介
-
-使用基于计数器的密码算法在目标 Tile 中生成随机数。
-
-## 数学解释
-
-该指令实现了一个基于计数器的随机数生成器。对于有效区域中的每个元素，它基于密钥和计数器状态，使用可配置轮数的密码类变换生成伪随机值。
-
-该算法使用：
-- 128 位状态（4 × 32 位计数器）
-- 64 位密钥（2 × 32 位字）
-- 类似 ChaCha 的四分之一轮操作，使用向量指令
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS.md)。
-
-同步形式：
-
-```text
-trandom %dst, %key, %counter : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trandom ins(%key, %counter : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内置函数
-
-声明于 `include/pto/npu/a5/TRandom.hpp`：
-
-```cpp
-template <uint16_t Rounds = 10, typename DstTile>
-PTO_INST void TRANDOM_IMPL(DstTile &dst, TRandomKey &key, TRandomCounter &counter);
-```
-
-## 约束条件
-
-- **实现检查（A5）**：
-    - `DstTile::DType` 必须为以下类型之一：`int32_t`、`uint32_t`。
-    - Tile 布局必须为行主序（`DstTile::isRowMajor`）。
-    - `Rounds` 必须为 7 或 10（默认为 10）。
-    - `key` 和 `counter` 不能为空。
-- **有效区域**：
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-### Auto 模式
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileT dst;
-  TRandomKey key = {0x01234, 0x56789};
-  TRandomCounter counter = {0, 0, 0, 0};
-  TRANDOM_IMPL(dst, key, counter);
-}
-```
-
-### Manual 模式
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileT dst;
-  TRandomKey key = {0x01234, 0x56789};
-  TRandomCounter counter = {0, 0, 0, 0};
-  TASSIGN(dst, 0x0);
-  TRANDOM_IMPL<10>(dst, key, counter);
-}
-```
-
-## 汇编形式示例
-
-### Auto 模式
-
-```text
-# Auto 模式：编译器/运行时管理的布局和调度。
-%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual 模式
-
-```text
-# Manual 模式：在发出指令之前显式绑定资源。
-# Tile 操作数可选：
-# pto.tassign %arg0, @tile(0x3000)
-%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-trandom %dst, %key, %counter : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trandom ins(%key, %counter : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/TRECIP.md b/docs/mkdocs/src/docs/isa/TRECIP.md
deleted file mode 100644
index ccaffdb4..00000000
--- a/docs/mkdocs/src/docs/isa/TRECIP.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TRECIP.md` -->
-
-# pto.trecip
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trecip](./tile/ops/elementwise-tile-tile/trecip.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.trecip](./tile/ops/elementwise-tile-tile/trecip.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TRELU.md b/docs/mkdocs/src/docs/isa/TRELU.md
deleted file mode 100644
index 5d509a93..00000000
--- a/docs/mkdocs/src/docs/isa/TRELU.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TRELU.md` -->
-
-# pto.trelu
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trelu](./tile/ops/elementwise-tile-tile/trelu.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.trelu](./tile/ops/elementwise-tile-tile/trelu.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TREM.md b/docs/mkdocs/src/docs/isa/TREM.md
deleted file mode 100644
index 7538b209..00000000
--- a/docs/mkdocs/src/docs/isa/TREM.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TREM.md` -->
-
-# pto.trem
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trem](./tile/ops/elementwise-tile-tile/trem.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.trem](./tile/ops/elementwise-tile-tile/trem.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TREMS.md b/docs/mkdocs/src/docs/isa/TREMS.md
deleted file mode 100644
index d68bac52..00000000
--- a/docs/mkdocs/src/docs/isa/TREMS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TREMS.md` -->
-
-# pto.trems
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trems](./tile/ops/tile-scalar-and-immediate/trems.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.trems](./tile/ops/tile-scalar-and-immediate/trems.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TRESHAPE.md b/docs/mkdocs/src/docs/isa/TRESHAPE.md
deleted file mode 100644
index ad69ae85..00000000
--- a/docs/mkdocs/src/docs/isa/TRESHAPE.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TRESHAPE.md` -->
-
-# pto.treshape
-
-This compatibility page points to the canonical tile-surface reference page for [pto.treshape](./tile/ops/layout-and-rearrangement/treshape.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.treshape](./tile/ops/layout-and-rearrangement/treshape.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWARGMAX.md b/docs/mkdocs/src/docs/isa/TROWARGMAX.md
deleted file mode 100644
index 94e9c3bf..00000000
--- a/docs/mkdocs/src/docs/isa/TROWARGMAX.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWARGMAX.md` -->
-
-# pto.trowargmax
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowargmax](./tile/ops/reduce-and-expand/trowargmax.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowargmax](./tile/ops/reduce-and-expand/trowargmax.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWARGMIN.md b/docs/mkdocs/src/docs/isa/TROWARGMIN.md
deleted file mode 100644
index aba9ab7e..00000000
--- a/docs/mkdocs/src/docs/isa/TROWARGMIN.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWARGMIN.md` -->
-
-# pto.trowargmin
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowargmin](./tile/ops/reduce-and-expand/trowargmin.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowargmin](./tile/ops/reduce-and-expand/trowargmin.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWEXPAND.md b/docs/mkdocs/src/docs/isa/TROWEXPAND.md
deleted file mode 100644
index 85dc547a..00000000
--- a/docs/mkdocs/src/docs/isa/TROWEXPAND.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWEXPAND.md` -->
-
-# pto.trowexpand
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowexpand](./tile/ops/reduce-and-expand/trowexpand.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowexpand](./tile/ops/reduce-and-expand/trowexpand.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWEXPANDADD.md b/docs/mkdocs/src/docs/isa/TROWEXPANDADD.md
deleted file mode 100644
index de26075a..00000000
--- a/docs/mkdocs/src/docs/isa/TROWEXPANDADD.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWEXPANDADD.md` -->
-
-# pto.trowexpandadd
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowexpandadd](./tile/ops/reduce-and-expand/trowexpandadd.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowexpandadd](./tile/ops/reduce-and-expand/trowexpandadd.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWEXPANDDIV.md b/docs/mkdocs/src/docs/isa/TROWEXPANDDIV.md
deleted file mode 100644
index 1eccca35..00000000
--- a/docs/mkdocs/src/docs/isa/TROWEXPANDDIV.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWEXPANDDIV.md` -->
-
-# pto.trowexpanddiv
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowexpanddiv](./tile/ops/reduce-and-expand/trowexpanddiv.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowexpanddiv](./tile/ops/reduce-and-expand/trowexpanddiv.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWEXPANDEXPDIF.md b/docs/mkdocs/src/docs/isa/TROWEXPANDEXPDIF.md
deleted file mode 100644
index 2e47d5d5..00000000
--- a/docs/mkdocs/src/docs/isa/TROWEXPANDEXPDIF.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWEXPANDEXPDIF.md` -->
-
-# pto.trowexpandexpdif
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowexpandexpdif](./tile/ops/reduce-and-expand/trowexpandexpdif.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowexpandexpdif](./tile/ops/reduce-and-expand/trowexpandexpdif.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWEXPANDMAX.md b/docs/mkdocs/src/docs/isa/TROWEXPANDMAX.md
deleted file mode 100644
index ae07768e..00000000
--- a/docs/mkdocs/src/docs/isa/TROWEXPANDMAX.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWEXPANDMAX.md` -->
-
-# pto.trowexpandmax
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowexpandmax](./tile/ops/reduce-and-expand/trowexpandmax.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowexpandmax](./tile/ops/reduce-and-expand/trowexpandmax.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWEXPANDMIN.md b/docs/mkdocs/src/docs/isa/TROWEXPANDMIN.md
deleted file mode 100644
index 8e862b09..00000000
--- a/docs/mkdocs/src/docs/isa/TROWEXPANDMIN.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWEXPANDMIN.md` -->
-
-# pto.trowexpandmin
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowexpandmin](./tile/ops/reduce-and-expand/trowexpandmin.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowexpandmin](./tile/ops/reduce-and-expand/trowexpandmin.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWEXPANDMUL.md b/docs/mkdocs/src/docs/isa/TROWEXPANDMUL.md
deleted file mode 100644
index 3c1005a6..00000000
--- a/docs/mkdocs/src/docs/isa/TROWEXPANDMUL.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWEXPANDMUL.md` -->
-
-# pto.trowexpandmul
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowexpandmul](./tile/ops/reduce-and-expand/trowexpandmul.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowexpandmul](./tile/ops/reduce-and-expand/trowexpandmul.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWEXPANDSUB.md b/docs/mkdocs/src/docs/isa/TROWEXPANDSUB.md
deleted file mode 100644
index 79643052..00000000
--- a/docs/mkdocs/src/docs/isa/TROWEXPANDSUB.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWEXPANDSUB.md` -->
-
-# pto.trowexpandsub
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowexpandsub](./tile/ops/reduce-and-expand/trowexpandsub.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowexpandsub](./tile/ops/reduce-and-expand/trowexpandsub.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWMAX.md b/docs/mkdocs/src/docs/isa/TROWMAX.md
deleted file mode 100644
index fc6588f6..00000000
--- a/docs/mkdocs/src/docs/isa/TROWMAX.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWMAX.md` -->
-
-# pto.trowmax
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowmax](./tile/ops/reduce-and-expand/trowmax.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowmax](./tile/ops/reduce-and-expand/trowmax.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWMIN.md b/docs/mkdocs/src/docs/isa/TROWMIN.md
deleted file mode 100644
index 67f1efa8..00000000
--- a/docs/mkdocs/src/docs/isa/TROWMIN.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWMIN.md` -->
-
-# pto.trowmin
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowmin](./tile/ops/reduce-and-expand/trowmin.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowmin](./tile/ops/reduce-and-expand/trowmin.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TROWPROD.md b/docs/mkdocs/src/docs/isa/TROWPROD.md
deleted file mode 100644
index 2b78ac74..00000000
--- a/docs/mkdocs/src/docs/isa/TROWPROD.md
+++ /dev/null
@@ -1,150 +0,0 @@
-<!-- Generated from `docs/isa/TROWPROD.md` -->
-
-# pto.trowprod
-
-## Tile Operation Diagram
-
-![TROWPROD tile operation](../figures/isa/TROWPROD.svg)
-
-## Introduction
-
-Reduce each row by multiplying across columns.
-
-## Math Interpretation
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
-
-$$ \mathrm{dst}_{i,0} = \prod_{j=0}^{C-1} \mathrm{src}_{i,j} $$
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowprod ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWPROD(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must both be `TileType::Vec`.
-- `src` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-- `dst` must use one of the following non-fractal layouts:
-  - ND layout (`BLayout::RowMajor`, `SLayout::NoneBox`), or
-  - DN layout with exactly one column (`BLayout::ColMajor`, `SLayout::NoneBox`, `Cols == 1`).
-- `dst` and `src` must use the same element type.
-- Runtime valid-region checks:
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-- The intrinsic signature requires an explicit `tmp` operand.
-
-### A5 implementation checks
-
-- Supported element types: `half`, `float`, `int32_t`, `int16_t`.
-- In the currently inspected implementation path, the enforced constraints are on `src` and `dst`.
-- No extra shape/layout assertions on `tmp` are enforced in the current implementation path.
-
-## Implementation Notes
-
-`TROWPROD` follows the currently implemented A5 backend path in this codebase. It performs row-wise multiplication reduction directly from `src` to `dst` after validating `src`/`dst` constraints.
-
-The C++ intrinsic still takes a `tmp` operand for interface consistency:
-
-1. `tmp` remains part of the intrinsic signature and AS lowering form.
-2. The currently inspected implementation-enforced constraints are on `src` and `dst`.
-3. If another backend introduces additional `tmp` requirements later, the documentation should be updated to match that backend implementation exactly.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWPROD(dst, src, tmp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWPROD(dst, src, tmp);
-}
-```
-
-## ASM Form Examples
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowprod %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowprod ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/TROWPROD_zh.md b/docs/mkdocs/src/docs/isa/TROWPROD_zh.md
deleted file mode 100644
index bd065b49..00000000
--- a/docs/mkdocs/src/docs/isa/TROWPROD_zh.md
+++ /dev/null
@@ -1,150 +0,0 @@
-<!-- Generated from `docs/isa/TROWPROD_zh.md` -->
-
-# TROWPROD
-
-## 指令示意图
-
-![TROWPROD tile operation](../figures/isa/TROWPROD.svg)
-
-## 简介
-
-对每行元素进行乘积归约。
-
-## 数学定义
-
-设 `R = src.GetValidRow()` 且 `C = src.GetValidCol()`。对于 `0 <= i < R`：
-
-$$ \mathrm{dst}_{i,0} = \prod_{j=0}^{C-1} \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trowprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-降级可能引入内部临时 tile；C++ 内建函数需要显式的 `tmp` 操作数。
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowprod ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建函数
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWPROD(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束条件
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须均为 `TileType::Vec`。
-- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 必须使用以下两种非分形布局之一：
-  - ND 布局（`BLayout::RowMajor`、`SLayout::NoneBox`），或
-  - 列数严格为 1 的 DN 布局（`BLayout::ColMajor`、`SLayout::NoneBox`、`Cols == 1`）。
-- `dst` 和 `src` 的元素类型必须一致。
-- 运行时有效区域检查：
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-- 内建接口签名要求显式传入 `tmp` 操作数。
-
-### A5 实现检查
-
-- 支持的元素类型：`half`、`float`、`int32_t`、`int16_t`。
-- 当前检查到的实现路径中，实际受约束的是 `src` 和 `dst`。
-- 当前实现路径中，没有额外要求 `tmp` 必须满足特定 shape/layout 约束。
-
-## 实现说明
-
-`TROWPROD` 在当前代码库中遵循已实现的 A5 后端路径。该实现会在校验 `src` / `dst` 约束后，直接完成按行乘积归约。
-
-C++ 内建接口中仍然保留 `tmp` 参数，以保持接口形式一致：
-
-1. `tmp` 仍然保留在内建接口签名和 AS lowering 形式中。
-2. 当前检查到的实现路径中，实际被约束的是 `src` 和 `dst`。
-3. 如果后续该指令的其他后端实现对 `tmp` 引入额外要求，文档应再按对应实现同步更新。
-
-## 示例
-
-### Auto 模式
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWPROD(dst, src, tmp);
-}
-```
-
-### Manual 模式
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWPROD(dst, src, tmp);
-}
-```
-
-## ASM 形式示例
-
-### Auto 模式
-
-```text
-# Auto 模式：编译器/运行时管理的放置和调度。
-%dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual 模式
-
-```text
-# Manual 模式：在发出指令前显式绑定资源。
-# Tile 操作数可选：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trowprod %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowprod ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/TROWSUM.md b/docs/mkdocs/src/docs/isa/TROWSUM.md
deleted file mode 100644
index 29f2c9cc..00000000
--- a/docs/mkdocs/src/docs/isa/TROWSUM.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TROWSUM.md` -->
-
-# pto.trowsum
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trowsum](./tile/ops/reduce-and-expand/trowsum.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Reduce And Expand](./tile/reduce-and-expand.md)
-- Canonical per-op page: [pto.trowsum](./tile/ops/reduce-and-expand/trowsum.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TRSQRT.md b/docs/mkdocs/src/docs/isa/TRSQRT.md
deleted file mode 100644
index 92da2e3e..00000000
--- a/docs/mkdocs/src/docs/isa/TRSQRT.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TRSQRT.md` -->
-
-# pto.trsqrt
-
-This compatibility page points to the canonical tile-surface reference page for [pto.trsqrt](./tile/ops/elementwise-tile-tile/trsqrt.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.trsqrt](./tile/ops/elementwise-tile-tile/trsqrt.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSCATTER.md b/docs/mkdocs/src/docs/isa/TSCATTER.md
deleted file mode 100644
index 85cef507..00000000
--- a/docs/mkdocs/src/docs/isa/TSCATTER.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSCATTER.md` -->
-
-# pto.tscatter
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tscatter](./tile/ops/irregular-and-complex/tscatter.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tscatter](./tile/ops/irregular-and-complex/tscatter.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSEL.md b/docs/mkdocs/src/docs/isa/TSEL.md
deleted file mode 100644
index 7871a652..00000000
--- a/docs/mkdocs/src/docs/isa/TSEL.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSEL.md` -->
-
-# pto.tsel
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsel](./tile/ops/elementwise-tile-tile/tsel.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tsel](./tile/ops/elementwise-tile-tile/tsel.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSELS.md b/docs/mkdocs/src/docs/isa/TSELS.md
deleted file mode 100644
index 32c15036..00000000
--- a/docs/mkdocs/src/docs/isa/TSELS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSELS.md` -->
-
-# pto.tsels
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsels](./tile/ops/tile-scalar-and-immediate/tsels.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tsels](./tile/ops/tile-scalar-and-immediate/tsels.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSETFMATRIX.md b/docs/mkdocs/src/docs/isa/TSETFMATRIX.md
deleted file mode 100644
index 6de17398..00000000
--- a/docs/mkdocs/src/docs/isa/TSETFMATRIX.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSETFMATRIX.md` -->
-
-# pto.tsetfmatrix
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsetfmatrix](./tile/ops/sync-and-config/tsetfmatrix.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Sync And Config](./tile/sync-and-config.md)
-- Canonical per-op page: [pto.tsetfmatrix](./tile/ops/sync-and-config/tsetfmatrix.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSETHF32MODE.md b/docs/mkdocs/src/docs/isa/TSETHF32MODE.md
deleted file mode 100644
index 913e52b4..00000000
--- a/docs/mkdocs/src/docs/isa/TSETHF32MODE.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSETHF32MODE.md` -->
-
-# pto.tsethf32mode
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsethf32mode](./tile/ops/sync-and-config/tsethf32mode.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Sync And Config](./tile/sync-and-config.md)
-- Canonical per-op page: [pto.tsethf32mode](./tile/ops/sync-and-config/tsethf32mode.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSETHF32MODE_zh.md b/docs/mkdocs/src/docs/isa/TSETHF32MODE_zh.md
deleted file mode 100644
index 84a0f2c8..00000000
--- a/docs/mkdocs/src/docs/isa/TSETHF32MODE_zh.md
+++ /dev/null
@@ -1,63 +0,0 @@
-<!-- Generated from `docs/isa/TSETHF32MODE_zh.md` -->
-
-# TSETHF32MODE
-
-## 指令示意图
-
-![TSETHF32MODE tile operation](../figures/isa/TSETHF32MODE.svg)
-
-## 简介
-
-设置 HF32 变换模式（实现定义）。
-
-## 数学语义
-
-No direct tensor arithmetic is produced by this instruction. It updates target mode state used by subsequent instructions.
-
-## 汇编语法
-
-PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
-
-Schematic form:
-
-```text
-tsethf32mode {enable = true, mode = ...}
-```
-
-### AS Level 1（SSA）
-
-```text
-pto.tsethf32mode {enable = true, mode = ...}
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tsethf32mode ins({enable = true, mode = ...}) outs()
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <bool isEnable, RoundMode hf32TransMode = RoundMode::CAST_ROUND, typename... WaitEvents>
-PTO_INST RecordEvent TSETHF32MODE(WaitEvents &... events);
-```
-
-## 约束
-
-- Available only when the corresponding backend capability macro is enabled.
-- Exact mode values and hardware behavior are target-defined.
-- This instruction has control-state side effects and should be ordered appropriately relative to dependent compute instructions.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void example_enable_hf32() {
-  TSETHF32MODE<true, RoundMode::CAST_ROUND>();
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/TSETTF32MODE.md b/docs/mkdocs/src/docs/isa/TSETTF32MODE.md
deleted file mode 100644
index 4753d8ed..00000000
--- a/docs/mkdocs/src/docs/isa/TSETTF32MODE.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSETTF32MODE.md` -->
-
-# pto.tsettf32mode
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsettf32mode](./tile/ops/sync-and-config/tsettf32mode.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Sync And Config](./tile/sync-and-config.md)
-- Canonical per-op page: [pto.tsettf32mode](./tile/ops/sync-and-config/tsettf32mode.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSETTF32MODE_zh.md b/docs/mkdocs/src/docs/isa/TSETTF32MODE_zh.md
deleted file mode 100644
index aa3019b3..00000000
--- a/docs/mkdocs/src/docs/isa/TSETTF32MODE_zh.md
+++ /dev/null
@@ -1,63 +0,0 @@
-<!-- Generated from `docs/isa/TSETTF32MODE_zh.md` -->
-
-# TSETTF32MODE
-
-## 指令示意图
-
-![TSETTF32MODE tile operation](../figures/isa/TSETTF32MODE.svg)
-
-## 简介
-
-设置 TF32 变换模式（实现定义）。
-
-## 数学语义
-
-No direct tensor arithmetic is produced by this instruction. It updates target mode state used by subsequent instructions.
-
-## 汇编语法
-
-PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
-
-Schematic form:
-
-```text
-tsettf32mode {enable = true, mode = ...}
-```
-
-### AS Level 1（SSA）
-
-```text
-pto.tsettf32mode {enable = true, mode = ...}
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tsettf32mode ins({enable = true, mode = ...}) outs()
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <bool isEnable, RoundMode tf32TransMode = RoundMode::CAST_ROUND, typename... WaitEvents>
-PTO_INST RecordEvent TSETTF32MODE(WaitEvents &... events);
-```
-
-## 约束
-
-- Available only when the corresponding backend capability macro is enabled.
-- Exact mode values and hardware behavior are target-defined.
-- This instruction has control-state side effects and should be ordered appropriately relative to dependent compute instructions.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void example_enable_tf32() {
-  TSETTF32MODE<true, RoundMode::CAST_ROUND>();
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/TSET_IMG2COL_PADDING.md b/docs/mkdocs/src/docs/isa/TSET_IMG2COL_PADDING.md
deleted file mode 100644
index 67e6f32f..00000000
--- a/docs/mkdocs/src/docs/isa/TSET_IMG2COL_PADDING.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSET_IMG2COL_PADDING.md` -->
-
-# pto.tset_img2col_padding
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tset_img2col_padding](./tile/ops/sync-and-config/tset-img2col-padding.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Sync And Config](./tile/sync-and-config.md)
-- Canonical per-op page: [pto.tset_img2col_padding](./tile/ops/sync-and-config/tset-img2col-padding.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSET_IMG2COL_PADDING_zh.md b/docs/mkdocs/src/docs/isa/TSET_IMG2COL_PADDING_zh.md
deleted file mode 100644
index 33616ac9..00000000
--- a/docs/mkdocs/src/docs/isa/TSET_IMG2COL_PADDING_zh.md
+++ /dev/null
@@ -1,82 +0,0 @@
-<!-- Generated from `docs/isa/TSET_IMG2COL_PADDING_zh.md` -->
-
-# TSET_IMG2COL_PADDING
-
-## 指令示意图
-
-![TSET_IMG2COL_PADDING tile operation](../figures/isa/TSET_IMG2COL_PADDING.svg)
-
-## 简介
-
-从 IMG2COL 配置 Tile 设置 IMG2COL 填充元数据。
-
-## 数学语义
-
-No direct tensor arithmetic is produced by this instruction. It updates IMG2COL padding control state consumed by subsequent data-movement operations.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-Schematic form:
-
-```text
-tset_img2col_padding %cfg
-```
-
-### AS Level 1 (SSA)
-
-```text
-pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tset_img2col_padding ins(%cfg : !pto.fmatrix_config) outs()
-```
-
-### AS Level 1（SSA）
-
-```text
-pto.tset_img2col_padding %cfg
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tset_img2col_padding ins(%cfg) outs()
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename ConvTileData, typename... WaitEvents>
-PTO_INST RecordEvent TSET_IMG2COL_PADDING(ConvTileData &src, WaitEvents &... events);
-
-template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
-PTO_INST RecordEvent TSET_IMG2COL_PADDING(ConvTileData &src, WaitEvents &... events);
-```
-
-For `MEMORY_BASE` targets, an overload without `SetFmatrixMode` is also provided.
-
-## 约束
-
-- This instruction is backend-specific and available only for backends that expose IMG2COL configuration state.
-- `src` must be a valid IMG2COL configuration tile type accepted by the backend implementation.
-- The exact padding fields updated by this instruction are implementation-defined.
-- Use this instruction before dependent `TIMG2COL` operations in the same execution stream.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_set_img2col_padding(Img2colTileConfig<uint64_t>& cfg) {
-  TSET_IMG2COL_PADDING(cfg);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/TSET_IMG2COL_RPT.md b/docs/mkdocs/src/docs/isa/TSET_IMG2COL_RPT.md
deleted file mode 100644
index b733a24a..00000000
--- a/docs/mkdocs/src/docs/isa/TSET_IMG2COL_RPT.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSET_IMG2COL_RPT.md` -->
-
-# pto.tset_img2col_rpt
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tset_img2col_rpt](./tile/ops/sync-and-config/tset-img2col-rpt.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Sync And Config](./tile/sync-and-config.md)
-- Canonical per-op page: [pto.tset_img2col_rpt](./tile/ops/sync-and-config/tset-img2col-rpt.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSET_IMG2COL_RPT_zh.md b/docs/mkdocs/src/docs/isa/TSET_IMG2COL_RPT_zh.md
deleted file mode 100644
index 33316a23..00000000
--- a/docs/mkdocs/src/docs/isa/TSET_IMG2COL_RPT_zh.md
+++ /dev/null
@@ -1,82 +0,0 @@
-<!-- Generated from `docs/isa/TSET_IMG2COL_RPT_zh.md` -->
-
-# TSET_IMG2COL_RPT
-
-## 指令示意图
-
-![TSET_IMG2COL_RPT tile operation](../figures/isa/TSET_IMG2COL_RPT.svg)
-
-## 简介
-
-从 IMG2COL 配置 Tile 设置 IMG2COL 重复次数元数据。
-
-## 数学语义
-
-No direct tensor arithmetic is produced by this instruction. It updates IMG2COL control state used by subsequent data-movement operations.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-Schematic form:
-
-```text
-tset_img2col_rpt %cfg
-```
-
-### AS Level 1 (SSA)
-
-```text
-pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tset_img2col_rpt ins(%cfg : !pto.fmatrix_config) outs()
-```
-
-### AS Level 1（SSA）
-
-```text
-pto.tset_img2col_rpt %cfg
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tset_img2col_rpt ins(%cfg) outs()
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename ConvTileData, typename... WaitEvents>
-PTO_INST RecordEvent TSET_IMG2COL_RPT(ConvTileData &src, WaitEvents &... events);
-
-template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
-PTO_INST RecordEvent TSET_IMG2COL_RPT(ConvTileData &src, WaitEvents &... events);
-```
-
-For `MEMORY_BASE` targets, an overload without `SetFmatrixMode` is also provided.
-
-## 约束
-
-- This instruction is backend-specific and available only for backends that expose IMG2COL configuration state.
-- `src` must be a valid IMG2COL configuration tile type accepted by the backend implementation.
-- The exact register/metadata fields updated by this instruction are implementation-defined.
-- Use this instruction before dependent `TIMG2COL` operations in the same execution stream.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_set_img2col_rpt(Img2colTileConfig<uint64_t>& cfg) {
-  TSET_IMG2COL_RPT(cfg);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/TSHL.md b/docs/mkdocs/src/docs/isa/TSHL.md
deleted file mode 100644
index 5f194781..00000000
--- a/docs/mkdocs/src/docs/isa/TSHL.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSHL.md` -->
-
-# pto.tshl
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tshl](./tile/ops/elementwise-tile-tile/tshl.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tshl](./tile/ops/elementwise-tile-tile/tshl.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSHLS.md b/docs/mkdocs/src/docs/isa/TSHLS.md
deleted file mode 100644
index 3e712f04..00000000
--- a/docs/mkdocs/src/docs/isa/TSHLS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSHLS.md` -->
-
-# pto.tshls
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tshls](./tile/ops/tile-scalar-and-immediate/tshls.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tshls](./tile/ops/tile-scalar-and-immediate/tshls.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSHR.md b/docs/mkdocs/src/docs/isa/TSHR.md
deleted file mode 100644
index 0e66ee8b..00000000
--- a/docs/mkdocs/src/docs/isa/TSHR.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSHR.md` -->
-
-# pto.tshr
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tshr](./tile/ops/elementwise-tile-tile/tshr.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tshr](./tile/ops/elementwise-tile-tile/tshr.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSHRS.md b/docs/mkdocs/src/docs/isa/TSHRS.md
deleted file mode 100644
index fc93595d..00000000
--- a/docs/mkdocs/src/docs/isa/TSHRS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSHRS.md` -->
-
-# pto.tshrs
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tshrs](./tile/ops/tile-scalar-and-immediate/tshrs.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tshrs](./tile/ops/tile-scalar-and-immediate/tshrs.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSORT32.md b/docs/mkdocs/src/docs/isa/TSORT32.md
deleted file mode 100644
index 8e03e32a..00000000
--- a/docs/mkdocs/src/docs/isa/TSORT32.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSORT32.md` -->
-
-# pto.tsort32
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsort32](./tile/ops/irregular-and-complex/tsort32.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.tsort32](./tile/ops/irregular-and-complex/tsort32.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSORT32_zh.md b/docs/mkdocs/src/docs/isa/TSORT32_zh.md
deleted file mode 100644
index c207090d..00000000
--- a/docs/mkdocs/src/docs/isa/TSORT32_zh.md
+++ /dev/null
@@ -1,152 +0,0 @@
-<!-- Generated from `docs/isa/TSORT32_zh.md` -->
-
-# TSORT32
-
-## 指令示意图
-
-![TSORT32 tile operation](../figures/isa/TSORT32.svg)
-
-## 简介
-
-对 `src` 的每个 32 元素块，与 `idx` 中对应的索引一起进行排序，并将排序后的值-索引对写入 `dst`。
-
-## 数学语义
-
-对每一行，`TSORT32` 会按独立的 32 元素块处理 `src`。设第 `b` 个块覆盖列 `32b ... 32b+31`，该块的有效元素数为 `n_b = min(32, C - 32b)`。
-
-对于块中的每个有效元素，先构造一个二元组：
-
-$$
-(v_k, i_k) = (\mathrm{src}_{r,32b+k}, \mathrm{idx}_{r,32b+k}), \quad 0 \le k < n_b
-$$
-
-然后按值对这些二元组排序，并将排序后的值-索引对写入 `dst`。`dst` 中的具体打包布局由目标实现定义，但从语义上看，每个块的输出可表示为：
-
-$$
-[(v_{\pi(0)}, i_{\pi(0)}), (v_{\pi(1)}, i_{\pi(1)}), \ldots, (v_{\pi(n_b-1)}, i_{\pi(n_b-1)})]
-$$
-
-其中 `π` 是该 32 元素块对应的排序置换。
-
-说明：
-
-- `idx` 是输入 Tile，不是输出 Tile。
-- `dst` 保存的是排序后的值-索引对，而不只是排序后的值。
-- 在 CPU 仿真实现中，按值降序排序；当值相同时，索引较小者优先。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename IdxTileData>
-PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx);
-
-template <typename DstTileData, typename SrcTileData, typename IdxTileData, typename TmpTileData>
-PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx, TmpTileData &tmp);
-```
-
-## 约束
-
-- `TSORT32` 不接受 `WaitEvents&...` 参数，也不在内部调用 `TSYNC(...)`；如有需要请显式同步。
-- `idx` 在两个重载中都是必需的输入操作数；它提供与 `src` 一起参与重排的索引。
-- **实现检查 (A2A3/A5)**:
-    - `DstTileData::DType` 必须是 `half` 或 `float`。
-    - `SrcTileData::DType` 必须与 `DstTileData::DType` 匹配。
-    - `IdxTileData::DType` 必须是 `uint32_t`。
-    - `dst`/`src`/`idx` Tile 位置必须是 `TileType::Vec`，且都必须是行主序（`isRowMajor`）。
-- **有效区域**:
-    - 实现使用 `dst.GetValidRow()` 作为行数。
-    - 实现使用 `src.GetValidCol()` 确定每行参与排序的元素数量。
-    - 排序按独立的 32 元素块进行；4 参数重载额外通过 `tmp` 支持非 32 对齐尾块。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 1, 32>;
-  using IdxT = Tile<TileType::Vec, uint32_t, 1, 32>;
-  using DstT = Tile<TileType::Vec, float, 1, 64>;
-  SrcT src;
-  IdxT idx;
-  DstT dst;
-  TSORT32(dst, src, idx);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 1, 32>;
-  using IdxT = Tile<TileType::Vec, uint32_t, 1, 32>;
-  using DstT = Tile<TileType::Vec, float, 1, 64>;
-  SrcT src;
-  IdxT idx;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(idx, 0x2000);
-  TASSIGN(dst, 0x3000);
-  TSORT32(dst, src, idx);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-# pto.tassign %arg2, @tile(0x3000)
-%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/TSQRT.md b/docs/mkdocs/src/docs/isa/TSQRT.md
deleted file mode 100644
index 682985aa..00000000
--- a/docs/mkdocs/src/docs/isa/TSQRT.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSQRT.md` -->
-
-# pto.tsqrt
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsqrt](./tile/ops/elementwise-tile-tile/tsqrt.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tsqrt](./tile/ops/elementwise-tile-tile/tsqrt.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSTORE.md b/docs/mkdocs/src/docs/isa/TSTORE.md
deleted file mode 100644
index 7312a193..00000000
--- a/docs/mkdocs/src/docs/isa/TSTORE.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSTORE.md` -->
-
-# pto.tstore
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tstore](./tile/ops/memory-and-data-movement/tstore.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Memory And Data Movement](./tile/memory-and-data-movement.md)
-- Canonical per-op page: [pto.tstore](./tile/ops/memory-and-data-movement/tstore.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSTORE_FP.md b/docs/mkdocs/src/docs/isa/TSTORE_FP.md
deleted file mode 100644
index e954298d..00000000
--- a/docs/mkdocs/src/docs/isa/TSTORE_FP.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSTORE_FP.md` -->
-
-# pto.tstore_fp
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tstore_fp](./tile/ops/memory-and-data-movement/tstore-fp.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Memory And Data Movement](./tile/memory-and-data-movement.md)
-- Canonical per-op page: [pto.tstore_fp](./tile/ops/memory-and-data-movement/tstore-fp.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSUB.md b/docs/mkdocs/src/docs/isa/TSUB.md
deleted file mode 100644
index 01e4560b..00000000
--- a/docs/mkdocs/src/docs/isa/TSUB.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSUB.md` -->
-
-# pto.tsub
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsub](./tile/ops/elementwise-tile-tile/tsub.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tsub](./tile/ops/elementwise-tile-tile/tsub.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSUBC.md b/docs/mkdocs/src/docs/isa/TSUBC.md
deleted file mode 100644
index d09f3393..00000000
--- a/docs/mkdocs/src/docs/isa/TSUBC.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSUBC.md` -->
-
-# pto.tsubc
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsubc](./tile/ops/elementwise-tile-tile/tsubc.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.tsubc](./tile/ops/elementwise-tile-tile/tsubc.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSUBS.md b/docs/mkdocs/src/docs/isa/TSUBS.md
deleted file mode 100644
index bf7f61a2..00000000
--- a/docs/mkdocs/src/docs/isa/TSUBS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSUBS.md` -->
-
-# pto.tsubs
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsubs](./tile/ops/tile-scalar-and-immediate/tsubs.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tsubs](./tile/ops/tile-scalar-and-immediate/tsubs.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSUBSC.md b/docs/mkdocs/src/docs/isa/TSUBSC.md
deleted file mode 100644
index 3d6886ee..00000000
--- a/docs/mkdocs/src/docs/isa/TSUBSC.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSUBSC.md` -->
-
-# pto.tsubsc
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsubsc](./tile/ops/tile-scalar-and-immediate/tsubsc.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.tsubsc](./tile/ops/tile-scalar-and-immediate/tsubsc.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSUBVIEW.md b/docs/mkdocs/src/docs/isa/TSUBVIEW.md
deleted file mode 100644
index f66aba73..00000000
--- a/docs/mkdocs/src/docs/isa/TSUBVIEW.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSUBVIEW.md` -->
-
-# pto.tsubview
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsubview](./tile/ops/sync-and-config/tsubview.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Sync And Config](./tile/sync-and-config.md)
-- Canonical per-op page: [pto.tsubview](./tile/ops/sync-and-config/tsubview.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TSYNC.md b/docs/mkdocs/src/docs/isa/TSYNC.md
deleted file mode 100644
index 76f982f6..00000000
--- a/docs/mkdocs/src/docs/isa/TSYNC.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TSYNC.md` -->
-
-# pto.tsync
-
-This compatibility page points to the canonical tile-surface reference page for [pto.tsync](./tile/ops/sync-and-config/tsync.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Sync And Config](./tile/sync-and-config.md)
-- Canonical per-op page: [pto.tsync](./tile/ops/sync-and-config/tsync.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TTRANS.md b/docs/mkdocs/src/docs/isa/TTRANS.md
deleted file mode 100644
index 39791017..00000000
--- a/docs/mkdocs/src/docs/isa/TTRANS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TTRANS.md` -->
-
-# pto.ttrans
-
-This compatibility page points to the canonical tile-surface reference page for [pto.ttrans](./tile/ops/layout-and-rearrangement/ttrans.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Layout And Rearrangement](./tile/layout-and-rearrangement.md)
-- Canonical per-op page: [pto.ttrans](./tile/ops/layout-and-rearrangement/ttrans.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TTRI.md b/docs/mkdocs/src/docs/isa/TTRI.md
deleted file mode 100644
index c91e46fe..00000000
--- a/docs/mkdocs/src/docs/isa/TTRI.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TTRI.md` -->
-
-# pto.ttri
-
-This compatibility page points to the canonical tile-surface reference page for [pto.ttri](./tile/ops/irregular-and-complex/ttri.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Irregular And Complex](./tile/irregular-and-complex.md)
-- Canonical per-op page: [pto.ttri](./tile/ops/irregular-and-complex/ttri.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TXOR.md b/docs/mkdocs/src/docs/isa/TXOR.md
deleted file mode 100644
index cac55a73..00000000
--- a/docs/mkdocs/src/docs/isa/TXOR.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TXOR.md` -->
-
-# pto.txor
-
-This compatibility page points to the canonical tile-surface reference page for [pto.txor](./tile/ops/elementwise-tile-tile/txor.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Elementwise Tile Tile](./tile/elementwise-tile-tile.md)
-- Canonical per-op page: [pto.txor](./tile/ops/elementwise-tile-tile/txor.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/TXORS.md b/docs/mkdocs/src/docs/isa/TXORS.md
deleted file mode 100644
index 4170ddc5..00000000
--- a/docs/mkdocs/src/docs/isa/TXORS.md
+++ /dev/null
@@ -1,16 +0,0 @@
-<!-- Generated from `docs/isa/TXORS.md` -->
-
-# pto.txors
-
-This compatibility page points to the canonical tile-surface reference page for [pto.txors](./tile/ops/tile-scalar-and-immediate/txors.md).
-
-The PTO ISA manual now treats tile, vector, and scalar/control operations consistently: the canonical per-op pages live under `docs/isa/tile/ops/`, `docs/isa/vector/ops/`, and `docs/isa/scalar/ops/`.
-
-## Canonical Location
-
-- Family overview: [Tile Scalar And Immediate](./tile/tile-scalar-and-immediate.md)
-- Canonical per-op page: [pto.txors](./tile/ops/tile-scalar-and-immediate/txors.md)
-
-## Compatibility Note
-
-Old links into the root-level tile pages continue to resolve through this wrapper, but new PTO ISA documentation should link to the grouped tile op path.
diff --git a/docs/mkdocs/src/docs/isa/comm/README.md b/docs/mkdocs/src/docs/isa/comm/README.md
deleted file mode 100644
index b49fcfe3..00000000
--- a/docs/mkdocs/src/docs/isa/comm/README.md
+++ /dev/null
@@ -1,109 +0,0 @@
-<!-- Generated from `docs/isa/comm/README.md` -->
-
-# PTO Communication ISA Reference
-
-This directory contains the per-instruction reference for the PTO Communication ISA. Communication operations enable data movement and synchronization across execution agents and parallel ranks.
-
-## Naming Convention
-
-`pto.t*` is the IR/assembly spelling; the corresponding `T*` is the C++ intrinsic spelling. Both refer to the same operation. This manual documents both spellings on each page.
-
-## Point-to-Point Communication (Synchronous)
-
-- [**TGET / pto.tget**](./TGET.md): Remote read — copy data from remote NPU GM to local GM via UB staging tile
-- [**TPUT / pto.tput**](./TPUT.md): Remote write — copy data from local GM to remote NPU GM via UB staging tile
-
-## Point-to-Point Communication (Asynchronous)
-
-- [**TGET_ASYNC / pto.tget_async**](./TGET_ASYNC.md): Asynchronous remote read
-- [**TPUT_ASYNC / pto.tput_async**](./TPUT_ASYNC.md): Asynchronous remote write
-
-## Signal-Based Synchronization
-
-- [**TNOTIFY / pto.tnotify**](./TNOTIFY.md): Send notification to remote NPU
-- [**TWAIT / pto.twait**](./TWAIT.md): Blocking wait for signal condition
-- [**TTEST / pto.ttest**](./TTEST.md): Non-blocking test of signal condition
-
-## Collective Communication
-
-- [**TBROADCAST / pto.tbroadcast**](./TBROADCAST.md): Broadcast from root NPU to all ranks
-- [**TGATHER / pto.tgather**](./TGATHER.md): Gather data from all ranks to root
-- [**TSCATTER / pto.tscatter**](./TSCATTER.md): Scatter data from root to all ranks
-- [**TREDUCE / pto.treduce**](./TREDUCE.md): Collective reduction across all ranks to root
-
-## Type Definitions
-
-These are normative specifications, not implementation declarations. Actual values are defined by each target profile.
-
-### NotifyOp
-
-Operation type for `TNOTIFY`:
-
-| Value | Description |
-|-------|-------------|
-| `NotifyOp::Set` | Direct set (`signal = value`) |
-| `NotifyOp::AtomicAdd` | Atomic add (`signal += value`) |
-
-### WaitCmp
-
-Comparison operators for `TWAIT` and `TTEST`:
-
-| Value | Description |
-|-------|-------------|
-| `WaitCmp::EQ` | Equal (`==`) |
-| `WaitCmp::NE` | Not equal (`!=`) |
-| `WaitCmp::GT` | Greater than (`>`) |
-| `WaitCmp::GE` | Greater or equal (`>=`) |
-| `WaitCmp::LT` | Less than (`<`) |
-| `WaitCmp::LE` | Less or equal (`<=`) |
-
-### ReduceOp
-
-Reduction operators for `TREDUCE`:
-
-| Value | Description |
-|-------|-------------|
-| `ReduceOp::Sum` | Element-wise sum |
-| `ReduceOp::Max` | Element-wise maximum |
-| `ReduceOp::Min` | Element-wise minimum |
-
-### AtomicType
-
-Atomic operation type for `TPUT`:
-
-| Value | Description |
-|-------|-------------|
-| `AtomicType::AtomicNone` | No atomic operation (default) |
-| `AtomicType::AtomicAdd` | Atomic add operation |
-
-### DmaEngine
-
-DMA backend selection for `TPUT_ASYNC` and `TGET_ASYNC`:
-
-| Value | Description |
-|-------|-------------|
-| `DmaEngine::SDMA` | SDMA engine (supports 2D transfer) |
-| `DmaEngine::URMA` | URMA engine (supports 1D transfer; availability is **profile-specific** and **MUST** be verified against the target profile specification) |
-
-### AsyncEvent
-
-Returned by `TPUT_ASYNC` / `TGET_ASYNC`. Represents an outstanding asynchronous DMA transfer. Programs use `AsyncEvent` to poll or block until the transfer completes:
-
-- A valid event **MUST** be tested with the corresponding `AsyncSession`
-- An invalid event (e.g., handle value of zero) indicates the operation completed synchronously or failed
-
-### AsyncSession
-
-Engine-agnostic session for async DMA operations. Programs build a session once and pass it to all async calls. The session encapsulates the engine type, scratch buffer, and workspace needed for asynchronous progress.
-
-### ParallelGroup
-
-Wrapper for collective communication across multiple ranks. Encapsulates:
-
-- An array of `GlobalData` objects (each wraps a GM address; addresses may be local or remote depending on the collective operation)
-- The number of participating ranks
-- The root rank index for root-based collectives
-
-## Source Of Truth
-
-The authoritative specification for communication operation behavior is the PTO ISA manual. Backend implementations in `include/pto/comm/` are **informative** and may reflect implementation details that are not part of the ISA guarantee.
diff --git a/docs/mkdocs/src/docs/isa/comm/README_zh.md b/docs/mkdocs/src/docs/isa/comm/README_zh.md
deleted file mode 100644
index fbd14f4c..00000000
--- a/docs/mkdocs/src/docs/isa/comm/README_zh.md
+++ /dev/null
@@ -1,109 +0,0 @@
-<!-- Generated from `docs/isa/comm/README_zh.md` -->
-
-# PTO 通信 ISA 参考
-
-本目录包含 PTO 通信 ISA 的逐指令参考。通信操作用于跨执行代理和并行 rank 的数据传输和同步。
-
-## 命名约定
-
-`pto.t*` 是 IR/汇编语法；对应的 `T*` 是 C++ 内建语法。两者指同一操作。本手册在每个页面同时记录两种拼写。
-
-## 点对点通信（同步）
-
-- [**TGET / pto.tget**](./TGET_zh.md): 远程读 — 通过 UB 暂存 tile 将远端 NPU GM 数据复制到本地 GM
-- [**TPUT / pto.tput**](./TPUT_zh.md): 远程写 — 通过 UB 暂存 tile 将本地 GM 数据复制到远端 NPU GM
-
-## 点对点通信（异步）
-
-- [**TGET_ASYNC / pto.tget_async**](./TGET_ASYNC_zh.md): 异步远程读
-- [**TPUT_ASYNC / pto.tput_async**](./TPUT_ASYNC_zh.md): 异步远程写
-
-## 信号式同步
-
-- [**TNOTIFY / pto.tnotify**](./TNOTIFY_zh.md): 向远端 NPU 发送通知
-- [**TWAIT / pto.twait**](./TWAIT_zh.md): 阻塞等待信号条件
-- [**TTEST / pto.ttest**](./TTEST_zh.md): 非阻塞测试信号条件
-
-## 集合通信
-
-- [**TBROADCAST / pto.tbroadcast**](./TBROADCAST_zh.md): 从根 NPU 广播到所有 rank
-- [**TGATHER / pto.tgather**](./TGATHER_zh.md): 从所有 rank 聚集到根节点
-- [**TSCATTER / pto.tscatter**](./TSCATTER_zh.md): 从根节点散射数据到所有 rank
-- [**TREDUCE / pto.treduce**](./TREDUCE_zh.md): 跨所有 rank 的集合归约到根节点
-
-## 类型定义
-
-以下为规范性规范，非实现声明。实际值由各目标 profile 定义。
-
-### NotifyOp
-
-`TNOTIFY` 的操作类型：
-
-| 值 | 描述 |
-|-------|-------------|
-| `NotifyOp::Set` | 直接设置 (`signal = value`) |
-| `NotifyOp::AtomicAdd` | 原子加 (`signal += value`) |
-
-### WaitCmp
-
-`TWAIT` 和 `TTEST` 的比较操作符：
-
-| 值 | 描述 |
-|-------|-------------|
-| `WaitCmp::EQ` | 等于 (`==`) |
-| `WaitCmp::NE` | 不等于 (`!=`) |
-| `WaitCmp::GT` | 大于 (`>`) |
-| `WaitCmp::GE` | 大于或等于 (`>=`) |
-| `WaitCmp::LT` | 小于 (`<`) |
-| `WaitCmp::LE` | 小于或等于 (`<=`) |
-
-### ReduceOp
-
-`TREDUCE` 的归约操作符：
-
-| 值 | 描述 |
-|-------|-------------|
-| `ReduceOp::Sum` | 按元素求和 |
-| `ReduceOp::Max` | 按元素取最大值 |
-| `ReduceOp::Min` | 按元素取最小值 |
-
-### AtomicType
-
-`TPUT` 的原子操作类型：
-
-| 值 | 描述 |
-|-------|-------------|
-| `AtomicType::AtomicNone` | 无原子操作（默认） |
-| `AtomicType::AtomicAdd` | 原子加操作 |
-
-### DmaEngine
-
-`TPUT_ASYNC` 和 `TGET_ASYNC` 的 DMA 后端选择：
-
-| 值 | 描述 |
-|-------|-------------|
-| `DmaEngine::SDMA` | SDMA 引擎（支持 2D 传输） |
-| `DmaEngine::URMA` | URMA 引擎（支持 1D 传输；可用性**因 profile 而异**，必须对照目标 profile 规范验证） |
-
-### AsyncEvent
-
-`TPUT_ASYNC` / `TGET_ASYNC` 的返回值。表示一个待处理的异步 DMA 传输。程序使用 `AsyncEvent` 来轮询或阻塞直到传输完成：
-
-- 有效事件**必须**与对应的 `AsyncSession` 一起使用
-- 无效事件（例如句柄值为零）表示操作已同步完成或失败
-
-### AsyncSession
-
-异步 DMA 操作的引擎无关会话。程序构建一次会话并将其传递给所有异步调用。会话封装了引擎类型、暂存缓冲区和异步推进所需的工作区。
-
-### ParallelGroup
-
-跨多个 rank 的集合通信封装器。包含：
-
-- `GlobalData` 对象数组（每个封装一个 GM 地址；根据集合操作类型，地址可能是本地或远端）
-- 参与 rank 的数量
-- 基于根节点的集合操作的根节点索引
-
-## 规范来源
-
-通信操作行为的权威规范是 PTO ISA 手册。`include/pto/comm/` 中的后端实现是**参考性**的，可能反映不属于 ISA 保证范围的实现细节。
diff --git a/docs/mkdocs/src/docs/isa/comm/TBROADCAST.md b/docs/mkdocs/src/docs/isa/comm/TBROADCAST.md
deleted file mode 100644
index d7cae951..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TBROADCAST.md
+++ /dev/null
@@ -1,124 +0,0 @@
-<!-- Generated from `docs/isa/comm/TBROADCAST.md` -->
-
-# TBROADCAST
-
-## Introduction
-
-Broadcast data from current NPU to all ranks in the parallel group. The calling NPU is the root and its data is copied to all other NPUs.
-
-Only the root needs to execute `TBROADCAST`. Non-root ranks only need to ensure their destination buffers are allocated and writable for the duration of the operation. Calling `TBROADCAST` on non-root ranks is undefined behavior.
-
-**Large Tile Support**: When the GlobalTensor exceeds the UB (Unified Buffer) tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding.
-
-## Math Interpretation
-
-After the operation:
-
-$$ \mathrm{dst}^{(k)}_{i,j} = \mathrm{src}^{(\text{root})}_{i,j} \quad \forall k \in [0, N) $$
-
-where $N$ is the number of ranks and `root` is the calling NPU.
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-tbroadcast %group, %src : (!pto.group<...>, !pto.memref<...>)
-```
-Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`:
-
-```cpp
-// Basic broadcast (single staging tile)
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                                TileData &stagingTileData, WaitEvents&... events);
-
-// Ping-pong broadcast (double buffering with two staging tiles)
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                                TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-## Constraints
-
-- **Type constraints**:
-    - `ParallelGroup::value_type::RawDType` must equal `GlobalSrcData::RawDType`.
-    - `TileData::DType` must equal `GlobalSrcData::RawDType`.
-- **Memory constraints**:
-    - `srcGlobalData` must point to local memory (current NPU).
-    - `stagingTileData` (or `pingTile` / `pongTile`) must be pre-allocated in UB.
-- **ParallelGroup constraints**:
-    - `parallelGroup.tensors[k]` must refer to rank `k`'s destination buffer (remote GM as seen by the root).
-    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the broadcast root.
-    - All destination tensors are assumed to have the same shape and strides.
-- **Chunked mode constraints** (when data exceeds a single UB tile):
-    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
-    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
-
-## Examples
-
-### Basic Broadcast
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void broadcast(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
-    // Tile dimensions can differ from tensor dimensions.
-    // The 2D sliding chunked path automatically tiles both row and column.
-    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GTensor = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                 BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
-        tensors[i] = GTensor(group_addrs[i]);
-    }
-
-    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
-    GTensor srcG(my_data);
-    TileT stagingTile(TILE_ROWS, TILE_COLS);
-
-    // Current NPU broadcasts its data to all others
-    comm::TBROADCAST(group, srcG, stagingTile);
-}
-```
-
-### Ping-Pong Broadcast (Double Buffering)
-
-Uses two UB tiles to overlap TLOAD of the next chunk with TSTORE of the current chunk.
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void broadcast_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
-
-    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
-        tensors[i] = GPerRank(group_addrs[i]);
-    }
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GPerRank srcG(my_data);
-    TileT pingTile(TILE_ROWS, TILE_COLS);
-    TileT pongTile(TILE_ROWS, TILE_COLS);
-
-    // Ping-pong: overlaps TLOAD and TSTORE for better throughput
-    comm::TBROADCAST(group, srcG, pingTile, pongTile);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TBROADCAST_zh.md b/docs/mkdocs/src/docs/isa/comm/TBROADCAST_zh.md
deleted file mode 100644
index d66a5d97..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TBROADCAST_zh.md
+++ /dev/null
@@ -1,125 +0,0 @@
-<!-- Generated from `docs/isa/comm/TBROADCAST_zh.md` -->
-
-# TBROADCAST
-
-## 简介
-
-将当前 NPU 的数据广播到并行组中所有 rank。调用方 NPU 为根节点，其数据将被复制到所有其他 NPU。
-
-只有根节点需要执行 `TBROADCAST`。非根节点只需确保在操作期间其目标缓冲区已分配且可写。在非根节点上调用 `TBROADCAST` 属于未定义行为。
-
-**大 Tile 支持**：当 GlobalTensor 在行和/或列方向超出 UB（统一缓冲区）Tile 容量时，传输将通过二维滑动自动分块。
-
-## 数学语义
-
-操作完成后：
-
-$$ \mathrm{dst}^{(k)}_{i,j} = \mathrm{src}^{(\text{root})}_{i,j} \quad \forall k \in [0, N) $$
-
-其中 $N$ 为 rank 总数，`root` 为调用方 NPU。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-tbroadcast %group, %src : (!pto.group<...>, !pto.memref<...>)
-```
-
-降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-// 基础广播（单暂存 Tile）
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                                TileData &stagingTileData, WaitEvents&... events);
-
-// 乒乓广播（使用两个暂存 Tile 实现双缓冲）
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                                TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-## 约束
-
-- **类型约束**：
-    - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalSrcData::RawDType`。
-    - `TileData::DType` 必须等于 `GlobalSrcData::RawDType`。
-- **内存约束**：
-    - `srcGlobalData` 必须指向本地内存（当前 NPU）。
-    - `stagingTileData`（或 `pingTile` / `pongTile`）必须预先在 UB 中分配。
-- **ParallelGroup 约束**：
-    - `parallelGroup.tensors[k]` 必须指向 rank `k` 的目标缓冲区（从根节点视角看到的远端 GM）。
-    - `parallelGroup.GetRootIdx()` 标识调用方 NPU 为广播根节点。
-    - 所有目标 tensor 假定具有相同的形状和步幅。
-- **分块模式约束**（数据超出单个 UB Tile 时）：
-    - 若 `TileData` 具有静态 `ValidRow`，则 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足一行的情况，请使用 `DYNAMIC` ValidRow 的 Tile。
-    - 若 `TileData` 具有静态 `ValidCol`，则 `GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足一列的情况，请使用 `DYNAMIC` ValidCol 的 Tile。
-
-## 示例
-
-### 基础广播
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void broadcast(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
-    // Tile 维度可以与 tensor 维度不同。
-    // 二维滑动分块路径会自动在行和列两个方向进行分块。
-    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GTensor = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                 BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
-        tensors[i] = GTensor(group_addrs[i]);
-    }
-
-    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
-    GTensor srcG(my_data);
-    TileT stagingTile(TILE_ROWS, TILE_COLS);
-
-    // 当前 NPU 将自身数据广播到所有其他 NPU
-    comm::TBROADCAST(group, srcG, stagingTile);
-}
-```
-
-### 乒乓广播（双缓冲）
-
-使用两个 UB Tile，将下一块的 TLOAD 与当前块的 TSTORE 重叠执行。
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void broadcast_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
-
-    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
-        tensors[i] = GPerRank(group_addrs[i]);
-    }
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GPerRank srcG(my_data);
-    TileT pingTile(TILE_ROWS, TILE_COLS);
-    TileT pongTile(TILE_ROWS, TILE_COLS);
-
-    // 乒乓模式：将 TLOAD 与 TSTORE 重叠执行以提升吞吐量
-    comm::TBROADCAST(group, srcG, pingTile, pongTile);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TGATHER.md b/docs/mkdocs/src/docs/isa/comm/TGATHER.md
deleted file mode 100644
index 1207b19b..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TGATHER.md
+++ /dev/null
@@ -1,130 +0,0 @@
-<!-- Generated from `docs/isa/comm/TGATHER.md` -->
-
-# TGATHER
-
-## Introduction
-
-Gather operation: the calling NPU (root) collects data from all ranks in the parallel group and concatenates the results along **DIM_3** (row dimension) into a local output buffer.
-
-
-Only the root needs to execute `TGATHER`. Non-root ranks only need to ensure their source buffers are ready and remain valid for the duration of the operation. Calling `TGATHER` on non-root ranks is undefined behavior.
-
-**Large Tile Support**: When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding — the same mechanism used by other PTO-COMM instructions.
-
-## Math Interpretation
-
-Each rank $r$ has source data of shape $(D_0, D_1, D_2, H, W)$. The gather concatenates all $N$ ranks along DIM_3:
-
-$$\mathrm{dst}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} = \mathrm{src}^{(r)}_{d_0, d_1, d_2,\; i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
-
-The destination tensor has shape $(D_0, D_1, D_2, N \times H, W)$.
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-tgather %group, %dst : (!pto.group<...>, !pto.memref<...>)
-```
-Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`:
-
-```cpp
-// Basic gather (single staging tile)
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                             TileData &stagingTileData, WaitEvents&... events);
-
-// Ping-pong gather (double buffering with two staging tiles)
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                             TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-## Constraints
-
-- **Type constraints**:
-    - `ParallelGroup::value_type::RawDType` must equal `GlobalDstData::RawDType`.
-    - `TileData::DType` must equal `GlobalDstData::RawDType`.
-- **Memory constraints**:
-    - `dstGlobalData` must point to local memory (current NPU) and be large enough to hold the concatenated result from all ranks. Specifically, `dstGlobalData.GetShape(DIM_3)` must be $\geq N \times H$ where $H$ is each rank's `GetShape(DIM_3)`.
-    - If `dstGlobalData.GetShape(DIM_3) > N × H`, only the first `N × H` rows are written; remaining rows are left unchanged.
-    - `stagingTileData` (or `pingTile` / `pongTile`) must be pre-allocated in UB.
-- **ParallelGroup constraints**:
-    - `parallelGroup.tensors[r]` must refer to rank `r`'s source buffer (remote GM as seen by the root).
-    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the gather root.
-    - All source tensors are assumed to have the same shape and strides; behavior is undefined if they differ.
-- **Chunked mode constraints** (when source data exceeds a single UB tile):
-    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's source must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
-    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
-
-## Examples
-
-### Basic Gather (Single Staging Tile)
-
-Each rank contributes `ROWS × COLS` data. The root collects them into `NRANKS * ROWS` rows.
-The tile size (`TILE_ROWS × TILE_COLS`) can be smaller than the per-rank data — when it is, the implementation automatically chunks the transfer along both DIM_3 and DIM_4 via 2D sliding.
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void gather(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GResult = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
-        tensors[i] = GPerRank(group_addrs[i]);
-    }
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GResult dstG(result);
-    TileT stagingTile(TILE_ROWS, TILE_COLS);
-
-    comm::TGATHER(group, dstG, stagingTile);
-}
-```
-
-### Ping-Pong Gather (Double Buffering)
-
-Uses two UB tiles to overlap TLOAD of the next chunk (MTE2) with TSTORE of the current chunk (MTE3).
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void gather_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    // Tile can be smaller than the data in both dimensions
-    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GResult = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
-        tensors[i] = GPerRank(group_addrs[i]);
-    }
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GResult dstG(result);
-    TileT pingTile(TILE_ROWS, TILE_COLS);
-    TileT pongTile(TILE_ROWS, TILE_COLS);
-
-    // Ping-pong: overlaps TLOAD and TSTORE for better throughput
-    comm::TGATHER(group, dstG, pingTile, pongTile);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TGATHER_zh.md b/docs/mkdocs/src/docs/isa/comm/TGATHER_zh.md
deleted file mode 100644
index ef2fc8ad..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TGATHER_zh.md
+++ /dev/null
@@ -1,123 +0,0 @@
-<!-- Generated from `docs/isa/comm/TGATHER_zh.md` -->
-
-# TGATHER
-
-## 简介
-
-Gather 操作：调用方 NPU（根节点）从并行组中所有 rank 收集数据，并沿 **DIM_3**（行维度）拼接到本地输出缓冲区。
-
-只有根节点需要执行 `TGATHER`。非根节点只需确保在操作期间其源缓冲区已就绪且保持有效。在非根节点上调用 `TGATHER` 属于未定义行为。
-
-**大 Tile 支持**：当 GlobalTensor 在行和/或列方向超出 UB Tile 容量时，传输将通过二维滑动自动分块——与其他 PTO-COMM 指令采用相同机制。
-
-## 数学语义
-
-每个 rank $r$ 的源数据形状为 $(D_0, D_1, D_2, H, W)$。gather 沿 DIM_3 拼接所有 $N$ 个 rank 的数据：
-
-$$\mathrm{dst}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} = \mathrm{src}^{(r)}_{d_0, d_1, d_2,\; i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
-
-目标 tensor 的形状为 $(D_0, D_1, D_2, N \times H, W)$。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-tgather %group, %dst : (!pto.group<...>, !pto.memref<...>)
-```
-
-降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-// 基础 gather（单暂存 Tile）
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                             TileData &stagingTileData, WaitEvents&... events);
-
-// 乒乓 gather（使用两个暂存 Tile 实现双缓冲）
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                             TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-## 约束
-
-- **类型约束**：
-    - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalDstData::RawDType`。
-    - `TileData::DType` 必须等于 `GlobalDstData::RawDType`。
-- **内存约束**：
-    - `dstGlobalData` 必须指向本地内存（当前 NPU），且足够容纳所有 rank 拼接后的结果。具体要求：`dstGlobalData.GetShape(DIM_3)` 必须 $\geq N \times H$，其中 $H$ 为每个 rank 的 `GetShape(DIM_3)`。
-    - 若 `dstGlobalData.GetShape(DIM_3) > N × H`，则只写入前 `N × H` 行，其余行保持不变。
-    - `stagingTileData`（或 `pingTile` / `pongTile`）必须预先在 UB 中分配。
-- **ParallelGroup 约束**：
-    - `parallelGroup.tensors[r]` 必须指向 rank `r` 的源缓冲区（从根节点视角看到的远端 GM）。
-    - `parallelGroup.GetRootIdx()` 标识调用方 NPU 为 gather 根节点。
-    - 所有源 tensor 假定具有相同的形状和步幅；否则行为未定义。
-- **分块模式约束**（源数据超出单个 UB Tile 时）：
-    - 若 `TileData` 具有静态 `ValidRow`，则每个 rank 源数据的 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足一行的情况，请使用 `DYNAMIC` ValidRow 的 Tile。
-    - 若 `TileData` 具有静态 `ValidCol`，则 `GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足一列的情况，请使用 `DYNAMIC` ValidCol 的 Tile。
-
-## 示例
-
-### 基础 Gather（单暂存 Tile）
-
-每个 rank 提供 `ROWS × COLS` 的数据，根节点将其收集到 `NRANKS * ROWS` 行中。
-Tile 大小（`TILE_ROWS × TILE_COLS`）可小于每 rank 的数据——此时实现会自动沿 DIM_3 和 DIM_4 通过二维滑动进行分块传输。
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void gather(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GResult  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GResult dstG(result);
-    TileT stagingTile(TILE_ROWS, TILE_COLS);
-    comm::TGATHER(group, dstG, stagingTile);
-}
-```
-
-### 乒乓 Gather（双缓冲）
-
-使用两个 UB Tile，将下一块的 TLOAD（MTE2）与当前块的 TSTORE（MTE3）重叠执行。
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void gather_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GResult  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GResult dstG(result);
-    TileT pingTile(TILE_ROWS, TILE_COLS);
-    TileT pongTile(TILE_ROWS, TILE_COLS);
-    // 乒乓模式：将 TLOAD 与 TSTORE 重叠执行以提升吞吐量
-    comm::TGATHER(group, dstG, pingTile, pongTile);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TGET.md b/docs/mkdocs/src/docs/isa/comm/TGET.md
deleted file mode 100644
index 4a875898..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TGET.md
+++ /dev/null
@@ -1,145 +0,0 @@
-<!-- Generated from `docs/isa/comm/TGET.md` -->
-
-# pto.tget / TGET
-
-## Summary
-
-Remote read operation: copies data from a remote NPU's global memory (GM) to local GM. `pto.tget` is the IR spelling; `TGET` is the C++ intrinsic spelling — both refer to the same operation.
-
-Data is transferred via a staging tile in the Unified Buffer (UB) as an intermediate buffer. The complete data path is:
-
-```
-remote GM ──► staging tile (UB) ──► local GM
-```
-
-## Math Semantics
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}^{\mathrm{local}}_{i,j} = \mathrm{src}^{\mathrm{remote}}_{i,j} $$
-
-## Auto-Chunking (2D Sliding)
-
-When the `GlobalTensor` exceeds the UB tile capacity in rows or columns, `TGET` automatically performs **2D sliding** — chunking along rows (DIM_3) and columns (DIM_4) to fit each chunk into the tile, iterating over all outer dimensions (DIM_0, DIM_1, DIM_2).
-
-The author does not need to manually partition the transfer. The staging tile size determines the chunk granularity.
-
-## Assembly Syntax
-
-PTO-AS form (IR/assembly spelling):
-
-```text
-pto.tget %dst_local, %src_remote : (!pto.memref<...>, !pto.memref<...>)
-```
-
-The lowering introduces a UB staging tile for the GM→UB→GM data path. The C++ intrinsic requires an explicit `stagingTileData` (or `pingTile`/`pongTile`) operand.
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`.
-
-### Single-tile (auto-chunking)
-
-```cpp
-template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData,
-                          GlobalSrcData &srcGlobalData,
-                          TileData &stagingTileData,
-                          WaitEvents&... events);
-```
-
-### Ping-pong double buffering
-
-Uses two staging tiles to overlap DMA transfers for adjacent chunks, hiding transfer latency behind computation.
-
-```cpp
-template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData,
-                          GlobalSrcData &srcGlobalData,
-                          TileData &pingTile,
-                          TileData &pongTile,
-                          WaitEvents&... events);
-```
-
-## Constraints
-
-### Type constraints
-
-- `GlobalSrcData::RawDType` must equal `GlobalDstData::RawDType`
-- `TileData::DType` must equal `GlobalSrcData::RawDType`
-- `GlobalSrcData::layout` must equal `GlobalDstData::layout`
-
-### Memory constraints
-
-- `srcGlobalData` must point to a remote address (on the source NPU)
-- `dstGlobalData` must point to a local address (on the current NPU)
-- Staging tile(s) must be pre-allocated in UB
-
-### Ping-pong constraints
-
-- `pingTile` and `pongTile` must have identical type and dimensions
-- They must reside at non-overlapping UB offsets
-
-## Examples
-
-### Basic usage
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void remote_read(__gm__ T* local_data, __gm__ T* remote_addr) {
-    using TileT   = Tile<TileType::Vec, T, 16, 16>;
-    using GShape  = Shape<1, 1, 1, 16, 16>;
-    using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-    using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-    GTensor srcG(remote_addr);
-    GTensor dstG(local_data);
-    TileT stagingTile;
-    TASSIGN(stagingTile, 0);
-
-    // Remote read: remote GM -> staging tile -> local GM
-    comm::TGET(dstG, srcG, stagingTile);
-}
-```
-
-### Large tensor with auto-chunking
-
-```cpp
-// GlobalTensor larger than UB tile: 2D sliding is automatic
-using GShape  = Shape<1, 1, 1, 4096, 4096>;
-using GStride = BaseShape2D<T, 4096, 4096, Layout::ND>;
-using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-GTensor srcG(remote_addr);
-GTensor dstG(local_data);
-TileT stagingTile(64, 64);   // chunk size = 64x64
-TASSIGN(stagingTile, 0);
-
-// TGET automatically chunks the 4096x4096 transfer into 64x64 tiles
-comm::TGET(dstG, srcG, stagingTile);
-```
-
-### Ping-pong double buffering
-
-```cpp
-constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
-TileT pingTile(64, 64);
-TileT pongTile(64, 64);
-TASSIGN(pingTile, 0);
-TASSIGN(pongTile, tileUBBytes);  // non-overlapping UB offsets
-
-// Overlaps TGET[i+1] with TGET[i] for better pipeline utilization
-comm::TGET(dstG, srcG, pingTile, pongTile);
-```
-
-## See Also
-
-- [Communication And Runtime](../communication-and-runtime.md) — Family overview
-- [TPUT](./TPUT.md) — Remote write (inverse operation)
-- [TBROADCAST](./TBROADCAST.md) — Broadcast from one NPU to all
-- [TGATHER](./TGATHER.md) — Gather from all NPUs to one
diff --git a/docs/mkdocs/src/docs/isa/comm/TGET_ASYNC.md b/docs/mkdocs/src/docs/isa/comm/TGET_ASYNC.md
deleted file mode 100644
index a62f3016..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TGET_ASYNC.md
+++ /dev/null
@@ -1,169 +0,0 @@
-<!-- Generated from `docs/isa/comm/TGET_ASYNC.md` -->
-
-# TGET_ASYNC
-
-## Introduction
-
-`TGET_ASYNC` is an asynchronous remote read primitive. It starts a transfer from remote GM to local GM and returns an `AsyncEvent` immediately.
-
-Data flow:
-
-`srcGlobalData (remote GM) -> DMA engine -> dstGlobalData (local GM)`
-
-## Template Parameter
-
-- `engine`:
-    - `DmaEngine::SDMA` (default)
-    - `DmaEngine::URMA` (todo)
-
-> **Important (SDMA path)**
-> `TGET_ASYNC` with `DmaEngine::SDMA` currently supports **only flat contiguous logical 1D tensors**.
-> Non-1D or non-contiguous layouts are not supported by the current SDMA async implementation.
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`.
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA,
-          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
-PTO_INST AsyncEvent TGET_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                               const AsyncSession &session, WaitEvents &... events);
-```
-
-`AsyncSession` is an engine-agnostic session object. Build once with
-`BuildAsyncSession<engine>()`, then pass to all async calls and event waits.
-The template `engine` parameter selects the DMA backend at compile time, making the
-code forward-compatible with future engines (URMA, CCU, etc.).
-
-## AsyncSession Construction
-
-Use `BuildAsyncSession` from `include/pto/comm/async/async_event_impl.hpp`:
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
-PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
-                                    __gm__ uint8_t *workspace,
-                                    AsyncSession &session,
-                                    uint32_t syncId = 0,
-                                    const sdma::SdmaBaseConfig &baseConfig = {1024 * 1024, 0, 1},
-                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
-```
-
-The engine template parameter selects the backend (currently only SDMA).
-
-Parameters with defaults:
-
-| Parameter | Default | Description |
-|---|---|---|
-| `syncId` | `0` | MTE3/MTE2 pipe sync event id (0-7). Override if kernel uses other pipe barriers on the same id. |
-| `baseConfig` | `{1024*1024, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`. Suitable for most single-queue transfers. |
-| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA channel group index. Default uses `get_block_idx()` internally, mapping to current AI core. Override for multi-block or custom channel mapping scenarios. |
-
-## Constraints
-
-- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
-- `GlobalSrcData::layout == GlobalDstData::layout`
-- SDMA path requires source tensor to be **flat contiguous logical 1D only**
-- workspace must be a valid GM pointer allocated by host-side `SdmaWorkspaceManager`
-
-If the 1D contiguous requirement is not met, current implementation returns an invalid async event (`handle == 0`).
-
-## scratchTile Role
-
-`scratchTile` is **not** used to hold transferred payload data.
-It is converted to `TmpBuffer` and used as temporary UB workspace for:
-
-- writing/reading SDMA control words (flag, sq_tail, channel_info)
-- polling event completion flags
-- committing queue tail during completion
-
-The real payload path remains remote GM -> DMA engine -> local GM; `scratchTile` is only for control/synchronization metadata.
-
-## scratchTile Type and Size Constraints
-
-- must be a `pto::Tile` type
-- must be UB/Vec tile (`ScratchTile::Loc == TileType::Vec`)
-- available bytes must be at least `sizeof(uint64_t)` (8 bytes)
-
-Recommended: `Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>` (256B).
-
-## Completion Semantics (Quiet Semantics)
-
-`TGET_ASYNC` only submits data transfer SQEs without submitting a flag SQE. The flag SQE submission is deferred to the `Wait` call.
-
-- `event.Wait(session)` — submits a flag SQE and blocks until **all async operations issued since the last Wait** are complete
-
-This means after multiple `TGET_ASYNC` calls, a single `Wait` on the last returned `AsyncEvent` drains all pending operations (similar to shmem's quiet semantics).
-
-After wait succeeds, all issued reads into `dstGlobalData` are complete.
-
-## Example
-
-### Single Transfer
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
-
-using namespace pto;
-
-template <typename T>
-__global__ AICORE void SimpleGet(__gm__ T *localDst, __gm__ T *remoteSrc,
-                                 __gm__ uint8_t *sdmaWorkspace)
-{
-    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT dstG(localDst, shape, stride);
-    GT srcG(remoteSrc, shape, stride);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    auto event = comm::TGET_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
-    (void)event.Wait(session);
-}
-```
-
-### Batch Transfer (Quiet Semantics)
-
-```cpp
-template <typename T>
-__global__ AICORE void BatchGet(__gm__ T *localDstBase, __gm__ T *remoteSrcBase,
-                                __gm__ uint8_t *sdmaWorkspace, int nranks)
-{
-    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    comm::AsyncEvent lastEvent;
-    for (int rank = 0; rank < nranks; ++rank) {
-        GT dstG(localDstBase + rank * 1024, shape, stride);
-        GT srcG(remoteSrcBase + rank * 1024, shape, stride);
-        lastEvent = comm::TGET_ASYNC(dstG, srcG, session);
-    }
-    (void)lastEvent.Wait(session);  // single Wait drains all pending ops
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TGET_ASYNC_zh.md b/docs/mkdocs/src/docs/isa/comm/TGET_ASYNC_zh.md
deleted file mode 100644
index 61ac0968..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TGET_ASYNC_zh.md
+++ /dev/null
@@ -1,164 +0,0 @@
-<!-- Generated from `docs/isa/comm/TGET_ASYNC_zh.md` -->
-
-# TGET_ASYNC
-
-## 简介
-
-`TGET_ASYNC` 是异步远程读原语。它启动一次从远端 GM 到本地 GM 的传输，并立即返回 `AsyncEvent`。
-
-数据流：
-
-`srcGlobalData（远端 GM）` → DMA 引擎 → `dstGlobalData（本地 GM）`
-
-## 模板参数
-
-- `engine`：
-    - `DmaEngine::SDMA`（默认）
-    - `DmaEngine::URMA`（待实现）
-
-> **注意（SDMA 路径）**
-> `TGET_ASYNC` 配合 `DmaEngine::SDMA` 目前**仅支持扁平连续的逻辑一维 tensor**。
-> 当前 SDMA 异步实现不支持非一维或非连续布局。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA,
-          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
-PTO_INST AsyncEvent TGET_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                               const AsyncSession &session, WaitEvents &... events);
-```
-
-`AsyncSession` 是引擎无关的会话对象。使用 `BuildAsyncSession<engine>()` 构建一次后，传递给所有异步调用和事件等待。模板参数 `engine` 在编译期选择 DMA 后端，使代码对未来引擎（URMA、CCU 等）保持前向兼容。
-
-## AsyncSession 构建
-
-使用 `include/pto/comm/async/async_event_impl.hpp` 中的 `BuildAsyncSession`：
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
-PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
-                                    __gm__ uint8_t *workspace,
-                                    AsyncSession &session,
-                                    uint32_t syncId = 0,
-                                    const sdma::SdmaBaseConfig &baseConfig = {1024 * 1024, 0, 1},
-                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
-```
-
-带默认值的参数说明：
-
-| 参数 | 默认值 | 说明 |
-|---|---|---|
-| `syncId` | `0` | MTE3/MTE2 管道同步事件 ID（0-7）。若 kernel 在相同 ID 上使用了其他管道屏障，则需覆盖此值。|
-| `baseConfig` | `{1024*1024, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`。适用于大多数单队列传输场景。|
-| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA 通道组索引。默认内部使用 `get_block_idx()` 映射到当前 AI Core。多 block 或自定义通道映射场景下需覆盖此值。|
-
-## 约束
-
-- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
-- `GlobalSrcData::layout == GlobalDstData::layout`
-- SDMA 路径要求源 tensor 为**扁平连续的逻辑一维**
-- workspace 必须是由主机侧 `SdmaWorkspaceManager` 分配的有效 GM 指针
-
-若不满足一维连续要求，当前实现返回无效 async event（`handle == 0`）。
-
-## scratchTile 的作用
-
-`scratchTile` **不是**用于传输数据负载的暂存缓冲区。
-它被转换为 `TmpBuffer`，用作临时 UB 工作区，用于：
-
-- 写入/读取 SDMA 控制字（flag、sq_tail、channel_info）
-- 轮询事件完成标志
-- 完成时提交队列尾部
-
-实际数据路径为远端 GM → DMA 引擎 → 本地 GM；`scratchTile` 仅用于控制和同步元数据。
-
-## scratchTile 类型与大小约束
-
-- 必须是 `pto::Tile` 类型
-- 必须是 UB/Vec tile（`ScratchTile::Loc == TileType::Vec`）
-- 可用字节数至少为 `sizeof(uint64_t)`（8 字节）
-
-推荐使用：`Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>`（256B）。
-
-## 完成语义（Quiet 语义）
-
-`TGET_ASYNC` 仅提交数据传输 SQE，不提交 flag SQE。flag SQE 的提交延迟到 `Wait` 调用时进行。
-
-- `event.Wait(session)` — 提交 flag SQE 并阻塞，直到**自上次 Wait 以来所有已发出的异步操作**全部完成
-
-这意味着多次 `TGET_ASYNC` 调用后，只需对最后一个返回的 `AsyncEvent` 调用一次 `Wait`，即可等待所有 pending 操作完成（类似 shmem 的 quiet 语义）。
-
-wait 成功后，所有已发出的 `dstGlobalData` 读入数据均已全部就绪。
-
-## 示例
-
-### 单次传输
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
-
-using namespace pto;
-
-template <typename T>
-__global__ AICORE void SimpleGet(__gm__ T *localDst, __gm__ T *remoteSrc,
-                                 __gm__ uint8_t *sdmaWorkspace)
-{
-    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT dstG(localDst,  shape, stride);
-    GT srcG(remoteSrc, shape, stride);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    auto event = comm::TGET_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
-    (void)event.Wait(session);
-}
-```
-
-### 批量传输（Quiet 语义）
-
-```cpp
-template <typename T>
-__global__ AICORE void BatchGet(__gm__ T *localDstBase, __gm__ T *remoteSrcBase,
-                                __gm__ uint8_t *sdmaWorkspace, int nranks)
-{
-    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    comm::AsyncEvent lastEvent;
-    for (int rank = 0; rank < nranks; ++rank) {
-        GT dstG(localDstBase + rank * 1024, shape, stride);
-        GT srcG(remoteSrcBase + rank * 1024, shape, stride);
-        lastEvent = comm::TGET_ASYNC(dstG, srcG, session);
-    }
-    (void)lastEvent.Wait(session);  // 一次 Wait 等待所有 pending 操作
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TGET_zh.md b/docs/mkdocs/src/docs/isa/comm/TGET_zh.md
deleted file mode 100644
index 46734a31..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TGET_zh.md
+++ /dev/null
@@ -1,141 +0,0 @@
-<!-- Generated from `docs/isa/comm/TGET_zh.md` -->
-
-# pto.tget / TGET
-
-## 简介
-
-远程读操作：将远端 NPU 的数据读取到本地内存。`pto.tget` 是 IR 语法，`TGET` 是 C++ 内建语法 — 两者指同一操作。
-
-数据通过 UB 中的 staging tile 作为中间暂存缓冲区传输。完整数据流为：
-
-```
-远端 GM ──► staging tile (UB) ──► 本地 GM
-```
-
-当 GlobalTensor 超出 UB Tile 容量时，TGET 将自动执行**二维滑动**——沿行（DIM_3）和列（DIM_4）分块以适配 Tile，并遍历所有外层维度（DIM_0、DIM_1、DIM_2）。
-
-## 数学语义
-
-对有效区域内每个元素 `(i, j)`：
-
-$$ \mathrm{dst}^{\mathrm{local}}_{i,j} = \mathrm{src}^{\mathrm{remote}}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式（IR/汇编语法）：
-
-```text
-pto.tget %dst_local, %src_remote : (!pto.memref<...>, !pto.memref<...>)
-```
-
-降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`
-
-### 单 Tile（自动分块）
-
-```cpp
-template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData,
-                          GlobalSrcData &srcGlobalData,
-                          TileData &stagingTileData,
-                          WaitEvents&... events);
-```
-
-### 乒乓双缓冲
-
-使用两个暂存 Tile，将相邻块的传输重叠执行，隐藏 DMA 传输延迟。
-
-```cpp
-template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData,
-                          GlobalSrcData &srcGlobalData,
-                          TileData &pingTile,
-                          TileData &pongTile,
-                          WaitEvents&... events);
-```
-
-## 约束
-
-### 类型约束
-
-- `GlobalSrcData::RawDType` 必须等于 `GlobalDstData::RawDType`
-- `TileData::DType` 必须等于 `GlobalSrcData::RawDType`
-- `GlobalSrcData::layout` 必须等于 `GlobalDstData::layout`
-
-### 内存约束
-
-- `srcGlobalData` 必须指向远端地址（源 NPU）
-- `dstGlobalData` 必须指向本地地址（当前 NPU）
-- staging tile(s) 必须预先在 UB 中分配
-
-### 乒乓约束
-
-- `pingTile` 和 `pongTile` 必须具有相同的类型和维度
-- 必须位于不重叠的 UB 偏移处
-
-## 示例
-
-### 基础用法
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void remote_read(__gm__ T* local_data, __gm__ T* remote_addr) {
-    using TileT   = Tile<TileType::Vec, T, 16, 16>;
-    using GShape  = Shape<1, 1, 1, 16, 16>;
-    using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-    using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-    GTensor srcG(remote_addr);
-    GTensor dstG(local_data);
-    TileT stagingTile;
-    TASSIGN(stagingTile, 0);
-
-    // 远程读：远端 GM -> staging tile -> 本地 GM
-    comm::TGET(dstG, srcG, stagingTile);
-}
-```
-
-### 大张量自动分块
-
-```cpp
-// GlobalTensor 大于 UB Tile：自动执行二维滑动
-using GShape  = Shape<1, 1, 1, 4096, 4096>;
-using GStride = BaseShape2D<T, 4096, 4096, Layout::ND>;
-using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-GTensor srcG(remote_addr);
-GTensor dstG(local_data);
-TileT stagingTile(64, 64);   // 分块大小 = 64x64
-TASSIGN(stagingTile, 0);
-
-// TGET 自动将 4096x4096 的传输分块为 64x64 的小块
-comm::TGET(dstG, srcG, stagingTile);
-```
-
-### 乒乓双缓冲
-
-```cpp
-constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
-TileT pingTile(64, 64);
-TileT pongTile(64, 64);
-TASSIGN(pingTile, 0);
-TASSIGN(pongTile, tileUBBytes);  // 不重叠的 UB 区域
-
-// 将 TGET[i+1] 与 TGET[i] 重叠执行以提升流水线利用率
-comm::TGET(dstG, srcG, pingTile, pongTile);
-```
-
-## 相关参考
-
-- [通信与运行时](../communication-and-runtime.md) — 族概览
-- [TPUT](./TPUT_zh.md) — 远程写（反向操作）
-- [TBROADCAST](./TBROADCAST_zh.md) — 从一个 NPU 广播到所有
-- [TGATHER](./TGATHER_zh.md) — 从所有 NPU 聚集到一个
diff --git a/docs/mkdocs/src/docs/isa/comm/TNOTIFY.md b/docs/mkdocs/src/docs/isa/comm/TNOTIFY.md
deleted file mode 100644
index 9ec44775..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TNOTIFY.md
+++ /dev/null
@@ -1,102 +0,0 @@
-<!-- Generated from `docs/isa/comm/TNOTIFY.md` -->
-
-# TNOTIFY
-
-## Introduction
-
-Send flag notification to remote NPU. Used for lightweight synchronization between NPUs without transferring bulk data.
-
-## Math Interpretation
-
-For `NotifyOp::Set`:
-
-$$ \mathrm{signal}^{\mathrm{remote}} = \mathrm{value} $$
-
-For `NotifyOp::AtomicAdd`:
-
-$$ \mathrm{signal}^{\mathrm{remote}} \mathrel{+}= \mathrm{value} \quad (\text{atomic}) $$
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-```text
-tnotify %signal_remote, %value {op = #pto.notify_op<Set>} : (!pto.memref<i32>, i32)
-tnotify %signal_remote, %value {op = #pto.notify_op<AtomicAdd>} : (!pto.memref<i32>, i32)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`:
-
-```cpp
-template <typename GlobalSignalData, typename... WaitEvents>
-PTO_INST void TNOTIFY(GlobalSignalData &dstSignalData, int32_t value, NotifyOp op, WaitEvents&... events);
-```
-
-## Constraints
-
-- **Type constraints**:
-    - `GlobalSignalData::DType` must be `int32_t` (32-bit signal).
-- **Memory constraints**:
-    - `dstSignalData` must point to remote address (on target NPU).
-    - `dstSignalData` should be 4-byte aligned.
-- **Operation semantics**:
-    - `NotifyOp::Set`: Direct store to remote memory.
-    - `NotifyOp::AtomicAdd`: Hardware atomic add using `st_atomic` instruction.
-
-## Examples
-
-### Basic Set Notification
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void notify_set(__gm__ int32_t* remote_signal) {
-    comm::Signal sig(remote_signal);
-
-    // Set remote signal to 1
-    comm::TNOTIFY(sig, 1, comm::NotifyOp::Set);
-}
-```
-
-### Atomic Counter Increment
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void atomic_increment(__gm__ int32_t* remote_counter) {
-    comm::Signal counter(remote_counter);
-
-    // Atomically add 1 to remote counter
-    comm::TNOTIFY(counter, 1, comm::NotifyOp::AtomicAdd);
-}
-```
-
-### Producer-Consumer Pattern
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-// Producer: notify when data is ready
-void producer(__gm__ int32_t* remote_flag) {
-    // ... produce data ...
-
-    comm::Signal flag(remote_flag);
-    comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
-}
-
-// Consumer: wait for data
-void consumer(__gm__ int32_t* local_flag) {
-    comm::Signal flag(local_flag);
-    comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
-
-    // ... consume data ...
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TNOTIFY_zh.md b/docs/mkdocs/src/docs/isa/comm/TNOTIFY_zh.md
deleted file mode 100644
index 3e3a51c2..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TNOTIFY_zh.md
+++ /dev/null
@@ -1,102 +0,0 @@
-<!-- Generated from `docs/isa/comm/TNOTIFY_zh.md` -->
-
-# TNOTIFY
-
-## 简介
-
-向远端 NPU 发送标志通知。用于 NPU 之间的轻量级同步，无需传输大量数据。
-
-## 数学语义
-
-`NotifyOp::Set` 时：
-
-$$\mathrm{signal}^{\mathrm{remote}} = \mathrm{value}$$
-
-`NotifyOp::AtomicAdd` 时：
-
-$$\mathrm{signal}^{\mathrm{remote}} \mathrel{+}= \mathrm{value} \quad (\text{原子操作})$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-```text
-tnotify %signal_remote, %value {op = #pto.notify_op<Set>} : (!pto.memref<i32>, i32)
-tnotify %signal_remote, %value {op = #pto.notify_op<AtomicAdd>} : (!pto.memref<i32>, i32)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-template <typename GlobalSignalData, typename... WaitEvents>
-PTO_INST void TNOTIFY(GlobalSignalData &dstSignalData, int32_t value, NotifyOp op, WaitEvents&... events);
-```
-
-## 约束
-
-- **类型约束**：
-    - `GlobalSignalData::DType` 必须为 `int32_t`（32 位信号）。
-- **内存约束**：
-    - `dstSignalData` 必须指向远端地址（目标 NPU）。
-    - `dstSignalData` 应 4 字节对齐。
-- **操作语义**：
-    - `NotifyOp::Set`：直接存储到远端内存。
-    - `NotifyOp::AtomicAdd`：使用 `st_atomic` 指令执行硬件原子加。
-
-## 示例
-
-### 基础 Set 通知
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void notify_set(__gm__ int32_t* remote_signal) {
-    comm::Signal sig(remote_signal);
-
-    // 将远端信号置为 1
-    comm::TNOTIFY(sig, 1, comm::NotifyOp::Set);
-}
-```
-
-### 原子计数器自增
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void atomic_increment(__gm__ int32_t* remote_counter) {
-    comm::Signal counter(remote_counter);
-
-    // 对远端计数器原子加 1
-    comm::TNOTIFY(counter, 1, comm::NotifyOp::AtomicAdd);
-}
-```
-
-### 生产者-消费者模式
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-// 生产者：数据就绪后发送通知
-void producer(__gm__ int32_t* remote_flag) {
-    // ... 生产数据 ...
-
-    comm::Signal flag(remote_flag);
-    comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
-}
-
-// 消费者：等待数据就绪
-void consumer(__gm__ int32_t* local_flag) {
-    comm::Signal flag(local_flag);
-    comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
-
-    // ... 消费数据 ...
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TPUT.md b/docs/mkdocs/src/docs/isa/comm/TPUT.md
deleted file mode 100644
index 43a7fd33..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TPUT.md
+++ /dev/null
@@ -1,133 +0,0 @@
-<!-- Generated from `docs/isa/comm/TPUT.md` -->
-
-# TPUT
-
-## Introduction
-
-Remote write operation: write local data to remote NPU's memory. Data is transferred via a UB tile as intermediate staging buffer.
-
-When the GlobalTensor exceeds the UB tile capacity, TPUT automatically performs **2D sliding** — chunking rows (DIM_3) and columns (DIM_4) to fit each chunk into the tile, iterating over all outer dimensions (DIM_0, DIM_1, DIM_2).
-
-## Math Interpretation
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}^{\mathrm{remote}}_{i,j} = \mathrm{src}^{\mathrm{local}}_{i,j} $$
-
-Data flow: `srcGlobalData (local GM)` → `stagingTileData (UB)` → `dstGlobalData (remote GM)`
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-tput %dst_remote, %src_local : (!pto.memref<...>, !pto.memref<...>)
-```
-Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`
-
-### Single-tile (auto-chunking)
-
-```cpp
-template <AtomicType atomicType = AtomicType::AtomicNone,
-          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                          TileData &stagingTileData, WaitEvents&... events);
-```
-
-### Ping-pong double buffering
-
-Uses two staging tiles to overlap TLOAD and TSTORE for adjacent chunks, hiding one DMA transfer behind the other.
-
-```cpp
-template <AtomicType atomicType = AtomicType::AtomicNone,
-          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                          TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-### Runtime atomic type
-
-```cpp
-template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                          TileData &stagingTileData, AtomicType atomicType, WaitEvents&... events);
-```
-
-## Constraints
-
-- **Type constraints**:
-    - `GlobalSrcData::RawDType` must equal `GlobalDstData::RawDType`.
-    - `TileData::DType` must equal `GlobalSrcData::RawDType`.
-    - `GlobalSrcData::layout` must equal `GlobalDstData::layout`.
-- **Memory constraints**:
-    - `dstGlobalData` must point to remote address (on target NPU).
-    - `srcGlobalData` must point to local address (on current NPU).
-    - `stagingTileData` / `pingTile` / `pongTile` must be pre-allocated in Unified Buffer.
-- **Valid region**:
-    - Transfer size is determined by `GlobalTensor` shape (auto-chunked to fit tile).
-- **Atomic operation**:
-    - `atomicType` supports `AtomicNone` and `AtomicAdd`.
-- **Ping-pong**:
-    - `pingTile` and `pongTile` must have the same type and dimensions.
-    - Must reside at non-overlapping UB offsets.
-
-## Examples
-
-### Basic Usage
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void example_tput(__gm__ T* local_data, __gm__ T* remote_addr) {
-    using TileT = Tile<TileType::Vec, T, 16, 16>;
-    using GShape = Shape<1, 1, 1, 16, 16>;
-    using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-    /*
-    If the globalTensor is larger than UB Tile, TPUT will perform 2D sliding automatically.
-    using GShape = Shape<1, 1, 1, 4096, 4096>;
-    using GStride = BaseShape2D<T, 4096, 4096, Layout::ND>;
-    */
-    using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-    GTensor srcG(local_data);
-    GTensor dstG(remote_addr);
-    TileT stagingTile;
-    TASSIGN(stagingTile, 0);
-
-    // Basic remote write
-    comm::TPUT(dstG, srcG, stagingTile);
-
-    // Remote write with atomic add
-    comm::TPUT<AtomicType::AtomicAdd>(dstG, srcG, stagingTile);
-}
-```
-
-### Ping-pong Double Buffering
-
-```cpp
-constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
-TileT pingTile(64, 64);
-TileT pongTile(64, 64);
-TASSIGN(pingTile, 0);
-TASSIGN(pongTile, tileUBBytes);  // Non-overlapping UB region
-
-// Overlaps TLOAD[i+1] with TSTORE[i] for better pipeline utilization
-comm::TPUT(dstG, srcG, pingTile, pongTile);
-```
-
-### Runtime Atomic Type
-
-```cpp
-// Select atomic type at runtime instead of compile-time template parameter
-comm::TPUT(dstG, srcG, stagingTile, AtomicType::AtomicAdd);
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TPUT_ASYNC.md b/docs/mkdocs/src/docs/isa/comm/TPUT_ASYNC.md
deleted file mode 100644
index 7f250acb..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TPUT_ASYNC.md
+++ /dev/null
@@ -1,171 +0,0 @@
-<!-- Generated from `docs/isa/comm/TPUT_ASYNC.md` -->
-
-# TPUT_ASYNC
-
-## Introduction
-
-`TPUT_ASYNC` is an asynchronous remote write primitive. It starts a transfer from local GM to remote GM and returns an `AsyncEvent` immediately.
-
-Data flow:
-
-`srcGlobalData (local GM) -> DMA engine -> dstGlobalData (remote GM)`
-
-
-## Template Parameter
-
-- `engine`:
-    - `DmaEngine::SDMA` (default)
-    - `DmaEngine::URMA` (todo)
-
-> **Important (SDMA path)**
-> `TPUT_ASYNC` with `DmaEngine::SDMA` currently supports **only flat contiguous logical 1D tensors**.
-> Non-1D or non-contiguous layouts are not supported by the current SDMA async implementation.
-
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`.
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA,
-          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
-PTO_INST AsyncEvent TPUT_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                               const AsyncSession &session, WaitEvents &... events);
-```
-
-`AsyncSession` is an engine-agnostic session object. Build once with
-`BuildAsyncSession<engine>()`, then pass to all async calls and event waits.
-The template `engine` parameter selects the DMA backend at compile time, making the
-code forward-compatible with future engines (URMA, CCU, etc.).
-
-## AsyncSession Construction
-
-Use `BuildAsyncSession` from `include/pto/comm/async/async_event_impl.hpp`:
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
-PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
-                                    __gm__ uint8_t *workspace,
-                                    AsyncSession &session,
-                                    uint32_t syncId = 0,
-                                    const sdma::SdmaBaseConfig &baseConfig = {1024 * 1024, 0, 1},
-                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
-```
-
-The engine template parameter selects the backend (currently only SDMA).
-
-Parameters with defaults:
-
-| Parameter | Default | Description |
-|---|---|---|
-| `syncId` | `0` | MTE3/MTE2 pipe sync event id (0-7). Override if kernel uses other pipe barriers on the same id. |
-| `baseConfig` | `{1024*1024, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`. Suitable for most single-queue transfers. |
-| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA channel group index. Default uses `get_block_idx()` internally, mapping to current AI core. Override for multi-block or custom channel mapping scenarios. |
-
-## Constraints
-
-- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
-- `GlobalSrcData::layout == GlobalDstData::layout`
-- SDMA path requires source tensor to be **flat contiguous logical 1D only**
-- workspace must be a valid GM pointer allocated by host-side `SdmaWorkspaceManager`
-
-If the 1D contiguous requirement is not met, current implementation returns an invalid async event (`handle == 0`).
-
-## scratchTile Role
-
-`scratchTile` is **not** the payload staging buffer for user data.
-It is converted to `TmpBuffer` and used as temporary UB workspace for:
-
-- writing/reading SDMA control words (flag, sq_tail, channel_info)
-- polling event completion flags
-- committing queue tail during completion
-
-Data payload moves between GM buffers directly; `scratchTile` only supports control and synchronization metadata.
-
-## scratchTile Type and Size Constraints
-
-- must be a `pto::Tile` type
-- must be UB/Vec tile (`ScratchTile::Loc == TileType::Vec`)
-- available bytes must be at least `sizeof(uint64_t)` (8 bytes)
-
-Recommended: `Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>` (256B).
-
-## Completion Semantics (Quiet Semantics)
-
-`TPUT_ASYNC` only submits data transfer SQEs without submitting a flag SQE. The flag SQE submission is deferred to the `Wait` call.
-
-- `event.Wait(session)` — submits a flag SQE and blocks until **all async operations issued since the last Wait** are complete
-
-This means after multiple `TPUT_ASYNC` calls, a single `Wait` on the last returned `AsyncEvent` drains all pending operations (similar to shmem's quiet semantics).
-
-After wait succeeds, all issued writes to `dstGlobalData` are complete.
-
-## Example
-
-### Single Transfer
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
-
-using namespace pto;
-
-template <typename T>
-__global__ AICORE void SimplePut(__gm__ T *remoteDst, __gm__ T *localSrc,
-                                 __gm__ uint8_t *sdmaWorkspace)
-{
-    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT dstG(remoteDst, shape, stride);
-    GT srcG(localSrc, shape, stride);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    auto event = comm::TPUT_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
-    (void)event.Wait(session);
-}
-```
-
-### Batch Transfer (Quiet Semantics)
-
-```cpp
-template <typename T>
-__global__ AICORE void BatchPut(__gm__ T *remoteDstBase, __gm__ T *localSrc,
-                                __gm__ uint8_t *sdmaWorkspace, int nranks)
-{
-    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT srcG(localSrc, shape, stride);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    comm::AsyncEvent lastEvent;
-    for (int rank = 0; rank < nranks; ++rank) {
-        GT dstG(remoteDstBase + rank * 1024, shape, stride);
-        lastEvent = comm::TPUT_ASYNC(dstG, srcG, session);
-    }
-    (void)lastEvent.Wait(session);  // single Wait drains all pending ops
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TPUT_ASYNC_zh.md b/docs/mkdocs/src/docs/isa/comm/TPUT_ASYNC_zh.md
deleted file mode 100644
index c32b87bb..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TPUT_ASYNC_zh.md
+++ /dev/null
@@ -1,164 +0,0 @@
-<!-- Generated from `docs/isa/comm/TPUT_ASYNC_zh.md` -->
-
-# TPUT_ASYNC
-
-## 简介
-
-`TPUT_ASYNC` 是异步远程写原语。它启动一次从本地 GM 到远端 GM 的传输，并立即返回 `AsyncEvent`。
-
-数据流：
-
-`srcGlobalData（本地 GM）` → DMA 引擎 → `dstGlobalData（远端 GM）`
-
-## 模板参数
-
-- `engine`：
-    - `DmaEngine::SDMA`（默认）
-    - `DmaEngine::URMA`（待实现）
-
-> **注意（SDMA 路径）**
-> `TPUT_ASYNC` 配合 `DmaEngine::SDMA` 目前**仅支持扁平连续的逻辑一维 tensor**。
-> 当前 SDMA 异步实现不支持非一维或非连续布局。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA,
-          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
-PTO_INST AsyncEvent TPUT_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                               const AsyncSession &session, WaitEvents &... events);
-```
-
-`AsyncSession` 是引擎无关的会话对象。使用 `BuildAsyncSession<engine>()` 构建一次后，传递给所有异步调用和事件等待。模板参数 `engine` 在编译期选择 DMA 后端，使代码对未来引擎（URMA、CCU 等）保持前向兼容。
-
-## AsyncSession 构建
-
-使用 `include/pto/comm/async/async_event_impl.hpp` 中的 `BuildAsyncSession`：
-
-```cpp
-template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
-PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
-                                    __gm__ uint8_t *workspace,
-                                    AsyncSession &session,
-                                    uint32_t syncId = 0,
-                                    const sdma::SdmaBaseConfig &baseConfig = {1024 * 1024, 0, 1},
-                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
-```
-
-带默认值的参数说明：
-
-| 参数 | 默认值 | 说明 |
-|---|---|---|
-| `syncId` | `0` | MTE3/MTE2 管道同步事件 ID（0-7）。若 kernel 在相同 ID 上使用了其他管道屏障，则需覆盖此值。|
-| `baseConfig` | `{1024*1024, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`。适用于大多数单队列传输场景。|
-| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA 通道组索引。默认内部使用 `get_block_idx()` 映射到当前 AI Core。多 block 或自定义通道映射场景下需覆盖此值。|
-
-## 约束
-
-- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
-- `GlobalSrcData::layout == GlobalDstData::layout`
-- SDMA 路径要求源 tensor 为**扁平连续的逻辑一维**
-- workspace 必须是由主机侧 `SdmaWorkspaceManager` 分配的有效 GM 指针
-
-若不满足一维连续要求，当前实现返回无效 async event（`handle == 0`）。
-
-## scratchTile 的作用
-
-`scratchTile` **不是**用于存放用户数据负载的暂存缓冲区。
-它被转换为 `TmpBuffer`，用作临时 UB 工作区，用于：
-
-- 写入/读取 SDMA 控制字（flag、sq_tail、channel_info）
-- 轮询事件完成标志
-- 完成时提交队列尾部
-
-实际数据负载直接在 GM 缓冲区之间传输；`scratchTile` 仅用于控制和同步元数据。
-
-## scratchTile 类型与大小约束
-
-- 必须是 `pto::Tile` 类型
-- 必须是 UB/Vec tile（`ScratchTile::Loc == TileType::Vec`）
-- 可用字节数至少为 `sizeof(uint64_t)`（8 字节）
-
-推荐使用：`Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>`（256B）。
-
-## 完成语义（Quiet 语义）
-
-`TPUT_ASYNC` 仅提交数据传输 SQE，不提交 flag SQE。flag SQE 的提交延迟到 `Wait` 调用时进行。
-
-- `event.Wait(session)` — 提交 flag SQE 并阻塞，直到**自上次 Wait 以来所有已发出的异步操作**全部完成
-
-这意味着多次 `TPUT_ASYNC` 调用后，只需对最后一个返回的 `AsyncEvent` 调用一次 `Wait`，即可等待所有 pending 操作完成（类似 shmem 的 quiet 语义）。
-
-wait 成功后，所有已发出的 `dstGlobalData` 写入均已全部完成。
-
-## 示例
-
-### 单次传输
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/common/pto_tile.hpp>
-
-using namespace pto;
-
-template <typename T>
-__global__ AICORE void SimplePut(__gm__ T *remoteDst, __gm__ T *localSrc,
-                                 __gm__ uint8_t *sdmaWorkspace)
-{
-    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT dstG(remoteDst, shape, stride);
-    GT srcG(localSrc,  shape, stride);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    auto event = comm::TPUT_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
-    (void)event.Wait(session);
-}
-```
-
-### 批量传输（Quiet 语义）
-
-```cpp
-template <typename T>
-__global__ AICORE void BatchPut(__gm__ T *remoteDstBase, __gm__ T *localSrc,
-                                __gm__ uint8_t *sdmaWorkspace, int nranks)
-{
-    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
-    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
-    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
-
-    ShapeDyn shape(1, 1, 1, 1, 1024);
-    StrideDyn stride(1024, 1024, 1024, 1024, 1);
-    GT srcG(localSrc, shape, stride);
-
-    ScratchTile scratchTile;
-    TASSIGN(scratchTile, 0x0);
-
-    comm::AsyncSession session;
-    if (!comm::BuildAsyncSession(scratchTile, sdmaWorkspace, session)) {
-        return;
-    }
-
-    comm::AsyncEvent lastEvent;
-    for (int rank = 0; rank < nranks; ++rank) {
-        GT dstG(remoteDstBase + rank * 1024, shape, stride);
-        lastEvent = comm::TPUT_ASYNC(dstG, srcG, session);
-    }
-    (void)lastEvent.Wait(session);  // 一次 Wait 等待所有 pending 操作
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TPUT_zh.md b/docs/mkdocs/src/docs/isa/comm/TPUT_zh.md
deleted file mode 100644
index 9609b534..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TPUT_zh.md
+++ /dev/null
@@ -1,129 +0,0 @@
-<!-- Generated from `docs/isa/comm/TPUT_zh.md` -->
-
-# TPUT
-
-## 简介
-
-远程写操作：将本地数据写入远端 NPU 的内存。数据通过 UB Tile 作为中间暂存缓冲区进行传输。
-
-当 GlobalTensor 超出 UB Tile 容量时，TPUT 将自动执行**二维滑动**——沿行（DIM_3）和列（DIM_4）分块以适配 Tile，并遍历所有外层维度（DIM_0、DIM_1、DIM_2）。
-
-## 数学语义
-
-对有效区域内每个元素 `(i, j)`：
-
-$$\mathrm{dst}^{\mathrm{remote}}_{i,j} = \mathrm{src}^{\mathrm{local}}_{i,j}$$
-
-数据流：`srcGlobalData（本地 GM）` → `stagingTileData（UB）` → `dstGlobalData（远端 GM）`
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-tput %dst_remote, %src_local : (!pto.memref<...>, !pto.memref<...>)
-```
-
-降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`
-
-### 单 Tile（自动分块）
-
-```cpp
-template <AtomicType atomicType = AtomicType::AtomicNone,
-          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                          TileData &stagingTileData, WaitEvents&... events);
-```
-
-### 乒乓双缓冲
-
-使用两个暂存 Tile，将相邻块的 TLOAD 与 TSTORE 重叠执行，隐藏 DMA 传输延迟。
-
-```cpp
-template <AtomicType atomicType = AtomicType::AtomicNone,
-          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                          TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-### 运行时原子类型
-
-```cpp
-template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
-                          TileData &stagingTileData, AtomicType atomicType, WaitEvents&... events);
-```
-
-## 约束
-
-- **类型约束**：
-    - `GlobalSrcData::RawDType` 必须等于 `GlobalDstData::RawDType`。
-    - `TileData::DType` 必须等于 `GlobalSrcData::RawDType`。
-    - `GlobalSrcData::layout` 必须等于 `GlobalDstData::layout`。
-- **内存约束**：
-    - `dstGlobalData` 必须指向远端地址（目标 NPU）。
-    - `srcGlobalData` 必须指向本地地址（当前 NPU）。
-    - `stagingTileData` / `pingTile` / `pongTile` 必须预先在统一缓冲区中分配。
-- **有效区域**：
-    - 传输大小由 `GlobalTensor` 的形状决定（自动分块以适配 Tile）。
-- **原子操作**：
-    - `atomicType` 支持 `AtomicNone` 和 `AtomicAdd`。
-- **乒乓约束**：
-    - `pingTile` 和 `pongTile` 必须具有相同的类型和维度。
-    - 必须位于不重叠的 UB 偏移处。
-
-## 示例
-
-### 基础用法
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void example_tput(__gm__ T* local_data, __gm__ T* remote_addr) {
-    using TileT   = Tile<TileType::Vec, T, 16, 16>;
-    using GShape  = Shape<1, 1, 1, 16, 16>;
-    using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-    using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-    GTensor srcG(local_data);
-    GTensor dstG(remote_addr);
-    TileT stagingTile;
-    TASSIGN(stagingTile, 0);
-
-    // 基础远程写
-    comm::TPUT(dstG, srcG, stagingTile);
-
-    // 带原子加的远程写
-    comm::TPUT<AtomicType::AtomicAdd>(dstG, srcG, stagingTile);
-}
-```
-
-### 乒乓双缓冲
-
-```cpp
-constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
-TileT pingTile(64, 64);
-TileT pongTile(64, 64);
-TASSIGN(pingTile, 0);
-TASSIGN(pongTile, tileUBBytes);  // 不重叠的 UB 区域
-
-// 将 TLOAD[i+1] 与 TSTORE[i] 重叠执行以提升流水线利用率
-comm::TPUT(dstG, srcG, pingTile, pongTile);
-```
-
-### 运行时原子类型
-
-```cpp
-// 在运行时而非编译期模板参数中选择原子类型
-comm::TPUT(dstG, srcG, stagingTile, AtomicType::AtomicAdd);
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TREDUCE.md b/docs/mkdocs/src/docs/isa/comm/TREDUCE.md
deleted file mode 100644
index 567fe264..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TREDUCE.md
+++ /dev/null
@@ -1,120 +0,0 @@
-<!-- Generated from `docs/isa/comm/TREDUCE.md` -->
-
-# TREDUCE
-
-## Introduction
-
-Reduce operation: gather data from multiple remote NPUs and perform element-wise reduction locally.
-
-
-Only the root needs to execute `TREDUCE`. Non-root ranks only need to ensure their source buffers are ready and remain valid for the duration of the operation. Calling `TREDUCE` on non-root ranks is undefined behavior.
-
-**Large Tile Support**: When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, the reduction is automatically chunked via 2D sliding.
-
-## Math Interpretation
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}^{\mathrm{local}}_{i,j} = \bigoplus_{r=0}^{N-1} \mathrm{src}^{(r)}_{i,j} $$
-
-where $N$ is the number of ranks and $\oplus$ is the reduction operation (sum, max, min, etc.).
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-treduce %group, %dst {op = #pto.reduce_op<Sum>} : (!pto.group<...>, !pto.memref<...>)
-treduce %group, %dst {op = #pto.reduce_op<Max>} : (!pto.group<...>, !pto.memref<...>)
-```
-Lowering introduces internal accumulator and receive tiles for the reduce pipeline; the C++ intrinsic requires explicit `accTileData`, `recvTileData` (or `accTileData`, `pingTileData`, `pongTileData`) operand(s).
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`:
-
-```cpp
-// Basic reduce (accumulator + receive tile)
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                              TileData &accTileData, TileData &recvTileData, ReduceOp op, WaitEvents&... events);
-
-// Ping-pong reduce (accumulator + ping + pong tiles for double buffering)
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                              TileData &accTileData, TileData &pingTileData, TileData &pongTileData,
-                              ReduceOp op, WaitEvents&... events);
-```
-
-## Constraints
-
-- **Type constraints**:
-    - `ParallelGroup::value_type::RawDType` must equal `GlobalDstData::RawDType`.
-    - `TileData::DType` must equal `GlobalDstData::RawDType`.
-- **Memory constraints**:
-    - `dstGlobalData` must point to local address (on current NPU).
-    - `accTileData`, `recvTileData` (or `accTileData`, `pingTileData`, `pongTileData`) must be pre-allocated UB tiles.
-- **ParallelGroup constraints**:
-    - `parallelGroup.tensors[r]` must refer to rank `r`'s source buffer (remote GM as seen by the root).
-    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the reduce root.
-    - All source tensors are assumed to have the same shape and strides.
-- **Chunked mode constraints** (when data exceeds a single UB tile):
-    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
-    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
-
-## Examples
-
-### Basic Reduce Sum
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int SIZE, int NRANKS>
-void reduce_sum(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    using TileT = Tile<TileType::Vec, T, 1, SIZE>;
-    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
-                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
-
-    // Stack-allocated tensors
-    GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
-        tensors[i] = GTensor(group_addrs[i]);
-    }
-
-    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
-    GTensor dstG(result);
-    TileT accTile, recvTile;
-
-    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Sum);
-}
-```
-
-### Max Reduce
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int SIZE, int NRANKS>
-void reduce_max(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    using TileT = Tile<TileType::Vec, T, 1, SIZE>;
-    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
-                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
-
-    GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
-        tensors[i] = GTensor(group_addrs[i]);
-    }
-
-    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
-    GTensor dstG(result);
-    TileT accTile, recvTile;
-
-    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Max);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TREDUCE_zh.md b/docs/mkdocs/src/docs/isa/comm/TREDUCE_zh.md
deleted file mode 100644
index 6caed36b..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TREDUCE_zh.md
+++ /dev/null
@@ -1,113 +0,0 @@
-<!-- Generated from `docs/isa/comm/TREDUCE_zh.md` -->
-
-# TREDUCE
-
-## 简介
-
-Reduce 操作：从多个远端 NPU 收集数据并在本地执行逐元素归约。
-
-只有根节点需要执行 `TREDUCE`。非根节点只需确保在操作期间其源缓冲区已就绪且保持有效。在非根节点上调用 `TREDUCE` 属于未定义行为。
-
-**大 Tile 支持**：当 GlobalTensor 在行和/或列方向超出 UB Tile 容量时，归约操作将通过二维滑动自动分块。
-
-## 数学语义
-
-对有效区域内每个元素 `(i, j)`：
-
-$$\mathrm{dst}^{\mathrm{local}}_{i,j} = \bigoplus_{r=0}^{N-1} \mathrm{src}^{(r)}_{i,j}$$
-
-其中 $N$ 为 rank 总数，$\oplus$ 为归约运算（求和、取最大值、取最小值等）。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-treduce %group, %dst {op = #pto.reduce_op<Sum>} : (!pto.group<...>, !pto.memref<...>)
-treduce %group, %dst {op = #pto.reduce_op<Max>} : (!pto.group<...>, !pto.memref<...>)
-```
-
-降级时会为 reduce 流水线引入内部累加 Tile 和接收 Tile；C++ 内建接口需要显式传入 `accTileData`、`recvTileData`（或 `accTileData`、`pingTileData`、`pongTileData`）操作数。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-// 基础 reduce（累加 Tile + 接收 Tile）
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                              TileData &accTileData, TileData &recvTileData, ReduceOp op, WaitEvents&... events);
-
-// 乒乓 reduce（累加 Tile + ping/pong Tile 实现双缓冲）
-template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
-                              TileData &accTileData, TileData &pingTileData, TileData &pongTileData,
-                              ReduceOp op, WaitEvents&... events);
-```
-
-## 约束
-
-- **类型约束**：
-    - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalDstData::RawDType`。
-    - `TileData::DType` 必须等于 `GlobalDstData::RawDType`。
-- **内存约束**：
-    - `dstGlobalData` 必须指向本地内存（当前 NPU）。
-    - `accTileData`、`recvTileData`（或 `accTileData`、`pingTileData`、`pongTileData`）必须为预先分配的 UB Tile。
-- **ParallelGroup 约束**：
-    - `parallelGroup.tensors[r]` 必须指向 rank `r` 的源缓冲区（从根节点视角看到的远端 GM）。
-    - `parallelGroup.GetRootIdx()` 标识调用方 NPU 为 reduce 根节点。
-    - 所有源 tensor 假定具有相同的形状和步幅。
-- **分块模式约束**（数据超出单个 UB Tile 时）：
-    - 若 `TileData` 具有静态 `ValidRow`，则 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足一行的情况，请使用 `DYNAMIC` ValidRow 的 Tile。
-    - 若 `TileData` 具有静态 `ValidCol`，则 `GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足一列的情况，请使用 `DYNAMIC` ValidCol 的 Tile。
-
-## 示例
-
-### 基础求和归约
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int SIZE, int NRANKS>
-void reduce_sum(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    using TileT   = Tile<TileType::Vec, T, 1, SIZE>;
-    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
-                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
-
-    GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GTensor(group_addrs[i]);
-
-    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
-    GTensor dstG(result);
-    TileT accTile, recvTile;
-    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Sum);
-}
-```
-
-### 最大值归约
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int SIZE, int NRANKS>
-void reduce_max(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
-    using TileT   = Tile<TileType::Vec, T, 1, SIZE>;
-    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
-                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
-
-    GTensor tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GTensor(group_addrs[i]);
-
-    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
-    GTensor dstG(result);
-    TileT accTile, recvTile;
-    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Max);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TSCATTER.md b/docs/mkdocs/src/docs/isa/comm/TSCATTER.md
deleted file mode 100644
index dbf03def..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TSCATTER.md
+++ /dev/null
@@ -1,128 +0,0 @@
-<!-- Generated from `docs/isa/comm/TSCATTER.md` -->
-
-# TSCATTER
-
-## Introduction
-
-Scatter operation: the calling NPU (root) distributes data to all ranks in the parallel group by splitting the local source tensor along **DIM_3** (row dimension). This is the inverse of `TGATHER`.
-
-
-Only the root needs to execute `TSCATTER`. Non-root ranks only need to ensure their destination buffers are allocated and writable for the duration of the operation. Calling `TSCATTER` on non-root ranks is undefined behavior.
-
-**Large Tile Support**: When the per-rank data exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding.
-
-## Math Interpretation
-
-The local source tensor has shape $(D_0, D_1, D_2, N \times H, W)$, where $N$ is the number of ranks and each rank receives $H$ rows. After the operation:
-
-$$\mathrm{dst}^{(r)}_{d_0, d_1, d_2,\; i,\; j} = \mathrm{src}^{\mathrm{local}}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-tscatter %group, %src : (!pto.group<...>, !pto.memref<...>)
-```
-Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`:
-
-```cpp
-// Basic scatter (single staging tile)
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                              TileData &stagingTileData, WaitEvents&... events);
-
-// Ping-pong scatter (double buffering with two staging tiles)
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                              TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-## Constraints
-
-- **Type constraints**:
-    - `ParallelGroup::value_type::RawDType` must equal `GlobalSrcData::RawDType`.
-    - `TileData::DType` must equal `GlobalSrcData::RawDType`.
-- **Memory constraints**:
-    - `srcGlobalData` must point to local memory (current NPU) and be large enough to hold data for all ranks. Specifically, `srcGlobalData.GetShape(DIM_3)` must be $\geq N \times H$ where $H$ is each rank's `GetShape(DIM_3)`.
-    - If `srcGlobalData.GetShape(DIM_3) > N × H`, only the first `N × H` rows are read; remaining rows are ignored.
-    - `stagingTileData` (or `pingTile` / `pongTile`) must be pre-allocated in UB.
-- **ParallelGroup constraints**:
-    - `parallelGroup.tensors[r]` must refer to rank `r`'s destination buffer (remote GM as seen by the root).
-    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the scatter root.
-    - All destination tensors are assumed to have the same shape and strides; behavior is undefined if they differ.
-- **Chunked mode constraints** (when per-rank data exceeds a single UB tile):
-    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's destination must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
-    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
-
-## Examples
-
-### Basic Scatter (Single Staging Tile)
-
-Root has `NRANKS * ROWS` rows of width `COLS`. Each rank receives `ROWS × COLS`, split along DIM_3.
-The tile size (`TILE_ROWS × TILE_COLS`) can be smaller than the per-rank data — when it is, the implementation automatically chunks the transfer along both DIM_3 and DIM_4 via 2D sliding.
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void scatter(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
-    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GSource = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
-        tensors[i] = GPerRank(group_addrs[i]);
-    }
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GSource srcG(local_data);
-    TileT stagingTile(TILE_ROWS, TILE_COLS);
-
-    comm::TSCATTER(group, srcG, stagingTile);
-}
-```
-
-### Ping-Pong Scatter (Double Buffering)
-
-Uses two UB tiles to overlap TLOAD of the next chunk (MTE2) with TSTORE of the current chunk (MTE3).
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void scatter_pingpong(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
-    // Tile can be smaller than the data in both dimensions
-    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GSource = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) {
-        tensors[i] = GPerRank(group_addrs[i]);
-    }
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GSource srcG(local_data);
-    TileT pingTile(TILE_ROWS, TILE_COLS);
-    TileT pongTile(TILE_ROWS, TILE_COLS);
-
-    // Ping-pong: overlaps TLOAD and TSTORE for better throughput
-    comm::TSCATTER(group, srcG, pingTile, pongTile);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TSCATTER_zh.md b/docs/mkdocs/src/docs/isa/comm/TSCATTER_zh.md
deleted file mode 100644
index e09b64b0..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TSCATTER_zh.md
+++ /dev/null
@@ -1,121 +0,0 @@
-<!-- Generated from `docs/isa/comm/TSCATTER_zh.md` -->
-
-# TSCATTER
-
-## 简介
-
-Scatter 操作：调用方 NPU（根节点）将本地源 tensor 沿 **DIM_3**（行维度）拆分后分发到并行组中所有 rank。该操作是 `TGATHER` 的逆操作。
-
-只有根节点需要执行 `TSCATTER`。非根节点只需确保在操作期间其目标缓冲区已分配且可写。在非根节点上调用 `TSCATTER` 属于未定义行为。
-
-**大 Tile 支持**：当每 rank 的数据在行和/或列方向超出 UB Tile 容量时，传输将通过二维滑动自动分块。
-
-## 数学语义
-
-本地源 tensor 的形状为 $(D_0, D_1, D_2, N \times H, W)$，其中 $N$ 为 rank 总数，每个 rank 接收 $H$ 行。操作完成后：
-
-$$\mathrm{dst}^{(r)}_{d_0, d_1, d_2,\; i,\; j} = \mathrm{src}^{\mathrm{local}}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-tscatter %group, %src : (!pto.group<...>, !pto.memref<...>)
-```
-
-降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-// 基础 scatter（单暂存 Tile）
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                              TileData &stagingTileData, WaitEvents&... events);
-
-// 乒乓 scatter（使用两个暂存 Tile 实现双缓冲）
-template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
-                              TileData &pingTile, TileData &pongTile, WaitEvents&... events);
-```
-
-## 约束
-
-- **类型约束**：
-    - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalSrcData::RawDType`。
-    - `TileData::DType` 必须等于 `GlobalSrcData::RawDType`。
-- **内存约束**：
-    - `srcGlobalData` 必须指向本地内存（当前 NPU），且足够容纳所有 rank 的数据。具体要求：`srcGlobalData.GetShape(DIM_3)` 必须 $\geq N \times H$，其中 $H$ 为每个 rank 的 `GetShape(DIM_3)`。
-    - 若 `srcGlobalData.GetShape(DIM_3) > N × H`，则只读取前 `N × H` 行，其余行被忽略。
-    - `stagingTileData`（或 `pingTile` / `pongTile`）必须预先在 UB 中分配。
-- **ParallelGroup 约束**：
-    - `parallelGroup.tensors[r]` 必须指向 rank `r` 的目标缓冲区（从根节点视角看到的远端 GM）。
-    - `parallelGroup.GetRootIdx()` 标识调用方 NPU 为 scatter 根节点。
-    - 所有目标 tensor 假定具有相同的形状和步幅；否则行为未定义。
-- **分块模式约束**（每 rank 数据超出单个 UB Tile 时）：
-    - 若 `TileData` 具有静态 `ValidRow`，则每个 rank 目标数据的 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足一行的情况，请使用 `DYNAMIC` ValidRow 的 Tile。
-    - 若 `TileData` 具有静态 `ValidCol`，则 `GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足一列的情况，请使用 `DYNAMIC` ValidCol 的 Tile。
-
-## 示例
-
-### 基础 Scatter（单暂存 Tile）
-
-根节点拥有 `NRANKS * ROWS` 行、宽度为 `COLS` 的数据，每个 rank 接收 `ROWS × COLS`，沿 DIM_3 拆分。
-Tile 大小可小于每 rank 的数据——此时实现会自动通过二维滑动进行分块传输。
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void scatter(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
-    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GSource  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GSource srcG(local_data);
-    TileT stagingTile(TILE_ROWS, TILE_COLS);
-    comm::TSCATTER(group, srcG, stagingTile);
-}
-```
-
-### 乒乓 Scatter（双缓冲）
-
-使用两个 UB Tile，将下一块的 TLOAD（MTE2）与当前块的 TSTORE（MTE3）重叠执行。
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
-void scatter_pingpong(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
-    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
-    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
-                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
-    using GSource  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
-                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
-
-    GPerRank tensors[NRANKS];
-    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
-
-    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
-    GSource srcG(local_data);
-    TileT pingTile(TILE_ROWS, TILE_COLS);
-    TileT pongTile(TILE_ROWS, TILE_COLS);
-    // 乒乓模式：将 TLOAD 与 TSTORE 重叠执行以提升吞吐量
-    comm::TSCATTER(group, srcG, pingTile, pongTile);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TTEST.md b/docs/mkdocs/src/docs/isa/comm/TTEST.md
deleted file mode 100644
index 50350e7f..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TTEST.md
+++ /dev/null
@@ -1,152 +0,0 @@
-<!-- Generated from `docs/isa/comm/TTEST.md` -->
-
-# TTEST
-
-## Introduction
-
-Non-blocking test if signal(s) meet comparison condition. Returns `true` if condition is satisfied, `false` otherwise. Used for polling-based synchronization with timeout or interleaved work.
-
-Supports single signal or multi-dimensional signal tensor (up to 5-D, shape derived from GlobalTensor). For tensor, returns `true` only if ALL signals meet the condition.
-
-## Math Interpretation
-
-Test and return result:
-
-Single signal:
-
-$$ \mathrm{result} = (\mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue}) $$
-
-Signal tensor (all must satisfy):
-
-$$ \mathrm{result} = \bigwedge_{d_0, d_1, d_2, d_3, d_4} (\mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue}) $$
-
-where `cmp` ∈ {`EQ`, `NE`, `GT`, `GE`, `LT`, `LE`}
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-```text
-%result = ttest %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32) -> i1
-%result = ttest %signal_matrix, %cmp_value {cmp = #pto.cmp<GE>} : (!pto.memref<i32, MxN>, i32) -> i1
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`:
-
-```cpp
-template <typename GlobalSignalData, typename... WaitEvents>
-PTO_INST bool TTEST(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
-```
-
-## Constraints
-
-- **Type constraints**:
-    - `GlobalSignalData::DType` must be `int32_t` (32-bit signal).
-- **Memory constraints**:
-    - `signalData` must point to local address (on current NPU).
-- **Return value**:
-    - Returns `true` if condition is satisfied, `false` otherwise.
-    - For signal tensor, returns `true` only if ALL signals satisfy the condition.
-- **Shape semantics**:
-    - For single signal: Shape is `<1,1,1,1,1>`.
-    - For signal tensor: Shape determines the multi-dimensional region (up to 5-D) to test.
-- **Comparison operators** (WaitCmp):
-  | Value | Condition |
-  |-------|-----------|
-  | `EQ` | `signal == cmpValue` |
-  | `NE` | `signal != cmpValue` |
-  | `GT` | `signal > cmpValue` |
-  | `GE` | `signal >= cmpValue` |
-  | `LT` | `signal < cmpValue` |
-  | `LE` | `signal <= cmpValue` |
-
-## Examples
-
-### Basic Test
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-bool check_ready(__gm__ int32_t* local_signal) {
-    comm::Signal sig(local_signal);
-
-    // Check if signal == 1
-    return comm::TTEST(sig, 1, comm::WaitCmp::EQ);
-}
-```
-
-### Test Signal Matrix
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-// Test if all signals from a 4x8 dense grid of workers are ready
-bool check_worker_grid(__gm__ int32_t* signal_matrix) {
-    comm::Signal2D<4, 8> grid(signal_matrix);
-
-    // Returns true only if all 32 signals == 1
-    return comm::TTEST(grid, 1, comm::WaitCmp::EQ);
-}
-```
-
-### Polling with Timeout
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-bool poll_with_timeout(__gm__ int32_t* local_signal, int max_iterations) {
-    comm::Signal sig(local_signal);
-
-    for (int i = 0; i < max_iterations; ++i) {
-        if (comm::TTEST(sig, 1, comm::WaitCmp::EQ)) {
-            return true;  // Signal received
-        }
-        // Could do other work here between polls
-    }
-    return false;  // Timeout
-}
-```
-
-### Progress-Based Polling
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void process_with_progress(__gm__ int32_t* local_counter, int expected_count) {
-    comm::Signal counter(local_counter);
-
-    while (!comm::TTEST(counter, expected_count, comm::WaitCmp::GE)) {
-        // Do some useful work while waiting
-        // ...
-    }
-    // All expected signals received
-}
-```
-
-### Compare TWAIT vs TTEST
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void compare_wait_test(__gm__ int32_t* local_signal) {
-    comm::Signal sig(local_signal);
-
-    // Blocking: spins until signal == 1
-    comm::TWAIT(sig, 1, comm::WaitCmp::EQ);
-
-    // Non-blocking: returns immediately with result
-    bool ready = comm::TTEST(sig, 1, comm::WaitCmp::EQ);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TTEST_zh.md b/docs/mkdocs/src/docs/isa/comm/TTEST_zh.md
deleted file mode 100644
index 244720c2..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TTEST_zh.md
+++ /dev/null
@@ -1,152 +0,0 @@
-<!-- Generated from `docs/isa/comm/TTEST_zh.md` -->
-
-# TTEST
-
-## 简介
-
-非阻塞检测信号是否满足比较条件。满足则返回 `true`，否则返回 `false`。适用于基于轮询的同步（含超时）或与其他工作交错执行的场景。
-
-支持单个信号或多维信号 tensor（最高 5 维，形状由 GlobalTensor 决定）。对于 tensor，仅当**所有**信号均满足条件时才返回 `true`。
-
-## 数学语义
-
-检测并返回结果：
-
-单个信号：
-
-$$\mathrm{result} = (\mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue})$$
-
-信号 tensor（所有元素均须满足）：
-
-$$\mathrm{result} = \bigwedge_{d_0, d_1, d_2, d_3, d_4} (\mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue})$$
-
-其中 `cmp` ∈ {`EQ`, `NE`, `GT`, `GE`, `LT`, `LE`}
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-```text
-%result = ttest %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32) -> i1
-%result = ttest %signal_matrix, %cmp_value {cmp = #pto.cmp<GE>} : (!pto.memref<i32, MxN>, i32) -> i1
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-template <typename GlobalSignalData, typename... WaitEvents>
-PTO_INST bool TTEST(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
-```
-
-## 约束
-
-- **类型约束**：
-    - `GlobalSignalData::DType` 必须为 `int32_t`（32 位信号）。
-- **内存约束**：
-    - `signalData` 必须指向本地地址（当前 NPU）。
-- **返回值**：
-    - 条件满足时返回 `true`，否则返回 `false`。
-    - 对于信号 tensor，仅当所有信号均满足条件时才返回 `true`。
-- **形状语义**：
-    - 单个信号：形状为 `<1,1,1,1,1>`。
-    - 信号 tensor：形状决定要检测的多维区域（最高 5 维）。
-- **比较运算符**（WaitCmp）：
-  | 值 | 条件 |
-  |-------|--------|
-  | `EQ` | `signal == cmpValue` |
-  | `NE` | `signal != cmpValue` |
-  | `GT` | `signal > cmpValue` |
-  | `GE` | `signal >= cmpValue` |
-  | `LT` | `signal < cmpValue` |
-  | `LE` | `signal <= cmpValue` |
-
-## 示例
-
-### 基础检测
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-bool check_ready(__gm__ int32_t* local_signal) {
-    comm::Signal sig(local_signal);
-
-    // 检测 signal == 1
-    return comm::TTEST(sig, 1, comm::WaitCmp::EQ);
-}
-```
-
-### 检测信号矩阵
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-// 检测 4x8 网格中所有 worker 的信号是否就绪
-bool check_worker_grid(__gm__ int32_t* signal_matrix) {
-    comm::Signal2D<4, 8> grid(signal_matrix);
-
-    // 仅当所有 32 个信号均为 1 时返回 true
-    return comm::TTEST(grid, 1, comm::WaitCmp::EQ);
-}
-```
-
-### 带超时的轮询
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-bool poll_with_timeout(__gm__ int32_t* local_signal, int max_iterations) {
-    comm::Signal sig(local_signal);
-
-    for (int i = 0; i < max_iterations; ++i) {
-        if (comm::TTEST(sig, 1, comm::WaitCmp::EQ)) {
-            return true;  // 收到信号
-        }
-        // 两次轮询之间可执行其他工作
-    }
-    return false;  // 超时
-}
-```
-
-### 基于进度的轮询
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void process_with_progress(__gm__ int32_t* local_counter, int expected_count) {
-    comm::Signal counter(local_counter);
-
-    while (!comm::TTEST(counter, expected_count, comm::WaitCmp::GE)) {
-        // 等待期间执行其他有用工作
-        // ...
-    }
-    // 所有预期信号均已收到
-}
-```
-
-### TWAIT 与 TTEST 对比
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void compare_wait_test(__gm__ int32_t* local_signal) {
-    comm::Signal sig(local_signal);
-
-    // 阻塞：自旋直到 signal == 1
-    comm::TWAIT(sig, 1, comm::WaitCmp::EQ);
-
-    // 非阻塞：立即返回结果
-    bool ready = comm::TTEST(sig, 1, comm::WaitCmp::EQ);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TWAIT.md b/docs/mkdocs/src/docs/isa/comm/TWAIT.md
deleted file mode 100644
index 2e793897..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TWAIT.md
+++ /dev/null
@@ -1,133 +0,0 @@
-<!-- Generated from `docs/isa/comm/TWAIT.md` -->
-
-# TWAIT
-
-## Introduction
-
-Blocking wait until signal(s) meet comparison condition. Used in conjunction with `TNOTIFY` for flag-based synchronization.
-
-Supports single signal or multi-dimensional signal tensor (up to 5-D, shape derived from GlobalTensor).
-
-
-## Math Interpretation
-
-Wait (spin) until the following condition is satisfied:
-
-Single signal:
-
-$$ \mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue} $$
-
-Signal tensor (all elements must satisfy):
-
-$$ \forall d_0, d_1, d_2, d_3, d_4: \mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue} $$
-
-where `cmp` ∈ {`EQ`, `NE`, `GT`, `GE`, `LT`, `LE`}
-
-## Assembly Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-```text
-twait %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32)
-twait %signal_matrix, %cmp_value {cmp = #pto.cmp<GE>} : (!pto.memref<i32, MxN>, i32)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/comm/pto_comm_inst.hpp`:
-
-```cpp
-template <typename GlobalSignalData, typename... WaitEvents>
-PTO_INST void TWAIT(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
-```
-
-## Constraints
-
-- **Type constraints**:
-    - `GlobalSignalData::DType` must be `int32_t` (32-bit signal).
-- **Memory constraints**:
-    - `signalData` must point to local address (on current NPU).
-- **Shape semantics**:
-    - For single signal: Shape is `<1,1,1,1,1>`.
-    - For signal tensor: Shape determines the multi-dimensional region (up to 5-D) to wait on. All signals in the tensor must satisfy the condition.
-- **Comparison operators** (WaitCmp):
-  | Value | Condition |
-  |-------|-----------|
-  | `EQ` | `signal == cmpValue` |
-  | `NE` | `signal != cmpValue` |
-  | `GT` | `signal > cmpValue` |
-  | `GE` | `signal >= cmpValue` |
-  | `LT` | `signal < cmpValue` |
-  | `LE` | `signal <= cmpValue` |
-
-## Examples
-
-### Wait for Single Signal
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void wait_for_ready(__gm__ int32_t* local_signal) {
-    comm::Signal sig(local_signal);
-
-    // Wait until signal == 1
-    comm::TWAIT(sig, 1, comm::WaitCmp::EQ);
-}
-```
-
-### Wait for Signal Matrix
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-// Wait for signals from a 4x8 dense grid of workers
-void wait_worker_grid(__gm__ int32_t* signal_matrix) {
-    comm::Signal2D<4, 8> grid(signal_matrix);
-
-    // Wait until all 32 signals == 1
-    comm::TWAIT(grid, 1, comm::WaitCmp::EQ);
-}
-```
-
-### Wait for Counter Threshold
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void wait_for_count(__gm__ int32_t* local_counter, int expected_count) {
-    comm::Signal counter(local_counter);
-
-    // Wait until counter >= expected_count
-    comm::TWAIT(counter, expected_count, comm::WaitCmp::GE);
-}
-```
-
-### Producer-Consumer Pattern
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-// Producer: notify when data is ready
-void producer(__gm__ int32_t* remote_flag) {
-    // ... produce data ...
-
-    comm::Signal flag(remote_flag);
-    comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
-}
-
-// Consumer: wait for data
-void consumer(__gm__ int32_t* local_flag) {
-    comm::Signal flag(local_flag);
-    comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
-
-    // ... consume data ...
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/comm/TWAIT_zh.md b/docs/mkdocs/src/docs/isa/comm/TWAIT_zh.md
deleted file mode 100644
index 18e7d795..00000000
--- a/docs/mkdocs/src/docs/isa/comm/TWAIT_zh.md
+++ /dev/null
@@ -1,132 +0,0 @@
-<!-- Generated from `docs/isa/comm/TWAIT_zh.md` -->
-
-# TWAIT
-
-## 简介
-
-阻塞等待，直到信号满足比较条件。与 `TNOTIFY` 配合使用，实现基于标志的同步。
-
-支持单个信号或多维信号 tensor（最高 5 维，形状由 GlobalTensor 决定）。
-
-## 数学语义
-
-自旋等待，直到以下条件满足：
-
-单个信号：
-
-$$\mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue}$$
-
-信号 tensor（所有元素均须满足）：
-
-$$\forall d_0, d_1, d_2, d_3, d_4: \mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue}$$
-
-其中 `cmp` ∈ {`EQ`, `NE`, `GT`, `GE`, `LT`, `LE`}
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-```text
-twait %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32)
-twait %signal_matrix, %cmp_value {cmp = #pto.cmp<GE>} : (!pto.memref<i32, MxN>, i32)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/comm/pto_comm_inst.hpp`：
-
-```cpp
-template <typename GlobalSignalData, typename... WaitEvents>
-PTO_INST void TWAIT(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
-```
-
-## 约束
-
-- **类型约束**：
-    - `GlobalSignalData::DType` 必须为 `int32_t`（32 位信号）。
-- **内存约束**：
-    - `signalData` 必须指向本地地址（当前 NPU）。
-- **形状语义**：
-    - 单个信号：形状为 `<1,1,1,1,1>`。
-    - 信号 tensor：形状决定要等待的多维区域（最高 5 维）。tensor 中所有信号必须满足条件。
-- **比较运算符**（WaitCmp）：
-  | 值 | 条件 |
-  |-------|--------|
-  | `EQ` | `signal == cmpValue` |
-  | `NE` | `signal != cmpValue` |
-  | `GT` | `signal > cmpValue` |
-  | `GE` | `signal >= cmpValue` |
-  | `LT` | `signal < cmpValue` |
-  | `LE` | `signal <= cmpValue` |
-
-## 示例
-
-### 等待单个信号
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void wait_for_ready(__gm__ int32_t* local_signal) {
-    comm::Signal sig(local_signal);
-
-    // 等待 signal == 1
-    comm::TWAIT(sig, 1, comm::WaitCmp::EQ);
-}
-```
-
-### 等待信号矩阵
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-// 等待 4x8 网格中所有 worker 的信号就绪
-void wait_worker_grid(__gm__ int32_t* signal_matrix) {
-    comm::Signal2D<4, 8> grid(signal_matrix);
-
-    // 等待所有 32 个信号均为 1
-    comm::TWAIT(grid, 1, comm::WaitCmp::EQ);
-}
-```
-
-### 等待计数器阈值
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-void wait_for_count(__gm__ int32_t* local_counter, int expected_count) {
-    comm::Signal counter(local_counter);
-
-    // 等待 counter >= expected_count
-    comm::TWAIT(counter, expected_count, comm::WaitCmp::GE);
-}
-```
-
-### 生产者-消费者模式
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-
-using namespace pto;
-
-// 生产者：数据就绪后发送通知
-void producer(__gm__ int32_t* remote_flag) {
-    // ... 生产数据 ...
-
-    comm::Signal flag(remote_flag);
-    comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
-}
-
-// 消费者：等待数据就绪
-void consumer(__gm__ int32_t* local_flag) {
-    comm::Signal flag(local_flag);
-    comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
-
-    // ... 消费数据 ...
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/conventions.md b/docs/mkdocs/src/docs/isa/conventions.md
deleted file mode 100644
index f446bb9c..00000000
--- a/docs/mkdocs/src/docs/isa/conventions.md
+++ /dev/null
@@ -1,43 +0,0 @@
-<!-- Generated from `docs/isa/conventions.md` -->
-
-# PTO ISA Conventions
-
-This page defines shared conventions used by the per-instruction ISA reference pages in `docs/isa/` and the corresponding C++ intrinsics in `include/pto/common/pto_instr.hpp`.
-
-## Notation
-
-- **Tile**: A fixed-size on-chip tile object (e.g., `pto::Tile<...>`). Many instructions operate on tiles and use the tile’s valid region (`GetValidRow()`, `GetValidCol()`).
-- **GM (global memory)**: Off-chip memory accessed via `pto::GlobalTensor<...>`.
-- **Scalar / immediate**: A host-side scalar value or an encoded immediate used by `*S` / `*C` variants.
-
-For the detailed C++ programming model behind these terms, see:
-
-- Tiles: `docs/coding/Tile.md`
-- GlobalTensor: `docs/coding/GlobalTensor.md`
-- Scalars and enums: `docs/coding/Scalar.md`
-
-## Shapes and layouts
-
-- **Row-major vs. column-major**: Unless stated otherwise, CPU simulator kernels assume row-major tiles. Instructions that support multiple layouts will state supported layouts explicitly.
-- **Valid region**: The runtime compute region of a tile, expressed as `(valid_row, valid_col)` and queried via `GetValidRow()` / `GetValidCol()`.
-
-### Valid Region Semantics
-
-For instruction pages, when we say “for each element `(i, j)` in the valid region”, we mean:
-
-- `valid_row = dst.GetValidRow()` and `valid_col = dst.GetValidCol()` unless the instruction explicitly defines a different domain (e.g., some ops may use the source tile’s valid region).
-- The math interpretation defines `dst[i, j]` only for indices where `0 <= i < valid_row` and `0 <= j < valid_col`.
-- Elements outside the valid region are **unspecified** unless the instruction explicitly states otherwise (do not assume they are zeroed or preserved).
-
-For multi-operand instructions (e.g., `src0`, `src1`), the docs assume the input tiles are compatible with the iteration domain unless the constraints section states stricter requirements.
-
-## Types
-
-- The instruction page lists supported data types (e.g., `fp16`, `fp32`, `int8`, `int16`, `int32`, `uint8`, `uint16`, `uint32`). CPU simulator support may be a subset and is documented in `include/README.md`.
-
-## Events and synchronization
-
-- Instructions may require ordering between memory and vector pipelines. When examples show events (e.g., `set_flag(...)` / `wait_flag(...)`), they indicate the required ordering constraints on the target backend.
-- `TSYNC` is used for explicit synchronization when needed by a sequence of instructions.
-
-See `docs/coding/Event.md` for the event model used by PTO Tile Lib.
diff --git a/docs/mkdocs/src/docs/isa/conventions_zh.md b/docs/mkdocs/src/docs/isa/conventions_zh.md
deleted file mode 100644
index 030358bb..00000000
--- a/docs/mkdocs/src/docs/isa/conventions_zh.md
+++ /dev/null
@@ -1,44 +0,0 @@
-<!-- Generated from `docs/isa/conventions_zh.md` -->
-
-# PTO ISA 通用约定
-
-本页定义 `docs/isa/` 指令参考文档中使用的通用术语与写法，并与 `include/pto/common/pto_instr.hpp` 中的 C++ 内建接口保持一致。
-
-## 记号
-
-- **Tile**：片上二维操作数对象（例如 `pto::Tile<...>`）。大量指令以 Tile 作为输入/输出，并通过 `GetValidRow()` / `GetValidCol()` 使用 Tile 的有效区域（valid region）。
-- **GM（全局内存）**：通过 `pto::GlobalTensor<...>` 访问的片外内存视图。
-- **标量 / 立即数**：主机侧标量值，或在 `*S` / `*C` 等变体中编码的立即数参数。
-
-关于这些对象的 C++ 编程模型（类型、布局、枚举、约束等），可参考：
-
-- Tile：`docs/coding/Tile_zh.md`
-- GlobalTensor：`docs/coding/GlobalTensor_zh.md`
-- 标量与枚举：`docs/coding/Scalar_zh.md`
-
-## 形状与布局
-
-- **行主序 / 列主序**：除非指令页明确声明支持多种布局，否则示例与参考实现默认假设为行主序 Tile。支持多布局的指令会在约束小节中列出具体要求。
-- **有效区域（valid region）**：Tile 运行时计算域，通常写作 `(valid_row, valid_col)`，并通过 `GetValidRow()` / `GetValidCol()` 查询。
-
-### 有效区域语义
-
-在指令页中，当我们写“对有效区域内的每个元素 `(i, j)`”，含义为：
-
-- 除非指令显式定义不同的迭代域，否则默认使用 `valid_row = dst.GetValidRow()`、`valid_col = dst.GetValidCol()`。
-- 数学语义仅对 `0 <= i < valid_row` 且 `0 <= j < valid_col` 的 `dst[i, j]` 做出定义。
-- 有效区域之外元素的值为**未指定**，除非指令页明确说明（不要假设一定清零或保持不变）。
-
-对多输入指令（例如 `src0`、`src1`），除非约束小节有更严格的要求，文档默认输入 Tile 与迭代域在形状/有效区域上是兼容的。
-
-## 数据类型
-
-每条指令页会列出支持的数据类型（例如 `fp16`、`fp32`、`int8`、`int16`、`int32`、`uint8`、`uint16`、`uint32` 等）。
-不同后端/目标对数据类型与布局支持可能不同，具体以对应实现与编译期检查为准。
-
-## 事件与同步
-
-- 某些指令序列需要建立内存与向量流水线之间的顺序关系。示例中出现的事件（例如 `set_flag(...)` / `wait_flag(...)`）用于表达后端需要满足的顺序约束。
-- 在需要显式同步的场景，使用 `TSYNC` 建立阶段间的顺序关系。
-
-事件模型可参考：`docs/coding/Event_zh.md`。
diff --git a/docs/mkdocs/src/docs/isa/instruction-families/README.md b/docs/mkdocs/src/docs/isa/instruction-families/README.md
deleted file mode 100644
index cbecbd17..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-families/README.md
+++ /dev/null
@@ -1,80 +0,0 @@
-<!-- Generated from `docs/isa/instruction-families/README.md` -->
-
-# Instruction Families
-
-Family pages describe shared contracts that apply across related PTO operations. They sit between the model chapters and the per-op reference pages. For how individual opcode pages are structured, see [format of instruction descriptions](../reference/format-of-instruction-descriptions.md).
-
-## Overview
-
-PTO ISA organizes its instruction set into four families, each corresponding to one instruction surface:
-
-| Family | Prefix | Surface | Description |
-|--------|--------|---------|-------------|
-| [Tile Families](./tile-families.md) | `pto.t*` | Tile | Primary tile-oriented compute, data movement, layout operations |
-| [Vector Families](./vector-families.md) | `pto.v*` | Vector | Micro-instructions for vector pipeline execution |
-| [Scalar And Control Families](./scalar-and-control-families.md) | `pto.*` | Scalar/Control | Configuration, synchronization, DMA, predicate operations |
-| [Other Families](./other-families.md) | `pto.*` | Communication | Collective communication and runtime support |
-
-## What A Family Contract Must State
-
-Each family page provides the following:
-
-1. **Mechanism** — What the family is for, explained in one short section.
-2. **Shared operand model** — Common input/output roles and how they interact.
-3. **Common side effects** — Synchronization, ordering, or configuration effects shared by all ops in the family.
-4. **Shared constraints** — Legality rules that apply across the family.
-5. **Cases that are not allowed** — Conditions that are illegal for all ops in the family.
-6. **Target-profile narrowing** — Where A2/A3 and A5 differ in what the family accepts.
-7. **Operation list** — Pointers to each per-op page under `ops/`.
-
-Family pages do not repeat per-op details; they set the contract for the group.
-
-## Navigation Map
-
-```
-Instruction Families
-├── Tile Families
-│   ├── Sync and Config            → pto.tassign, pto.tsync, pto.tsethf32mode, pto.tsetfmatrix, etc.
-│   ├── Elementwise Tile-Tile      → pto.tadd, pto.tmul, pto.tcmp, pto.tcvt, pto.tsel, etc.
-│   ├── Tile-Scalar and Immediate  → pto.tadds, pto.tmuls, pto.tmins, pto.texpands, etc.
-│   ├── Reduce and Expand          → pto.trowsum, pto.tcolmax, pto.trowexpand, pto.tcolexpand, etc.
-│   ├── Memory and Data Movement   → pto.tload, pto.tstore, pto.tstore_fp, pto.mgather, pto.mscatter
-│   ├── Matrix and Matrix-Vector    → pto.tgemv, pto.tgemv_mx, pto.tmatmul, pto.tmatmul_acc, pto.tmatmul_bias, etc.
-│   ├── Layout and Rearrangement   → pto.tmov, pto.ttrans, pto.textract, pto.tinsert, pto.timg2col, etc.
-│   └── Irregular and Complex      → pto.tmrgsort, pto.tsort32, pto.tquant, pto.tprint, pto.tci, pto.ttri, etc.
-│
-├── Vector Families
-│   ├── Vector Load Store          → pto.vlds, pto.vldas, pto.vgather2, pto.vsld, pto.vsst, pto.vscatter, etc.
-│   ├── Predicate and Materialization → pto.vbr, pto.vdup
-│   ├── Unary Vector Ops          → pto.vabs, pto.vneg, pto.vexp, pto.vsqrt, pto.vrec, pto.vrelu, pto.vnot, etc.
-│   ├── Binary Vector Ops          → pto.vadd, pto.vsub, pto.vmul, pto.vmax, pto.vmin, pto.vand, pto.vor, etc.
-│   ├── Vec-Scalar Ops            → pto.vadds, pto.vmuls, pto.vshls, pto.vlrelu, etc.
-│   ├── Conversion Ops             → pto.vci, pto.vcvt, pto.vtrc
-│   ├── Reduction Ops              → pto.vcadd, pto.vcmax, pto.vcmin, pto.vcgadd, pto.vcgmax, pto.vcpadd, etc.
-│   ├── Compare and Select         → pto.vcmp, pto.vcmps, pto.vsel, pto.vselr, pto.vselrv2
-│   ├── Data Rearrangement         → pto.vintlv, pto.vdintlv, pto.vslide, pto.vshift, pto.vpack, pto.vzunpack, etc.
-│   └── SFU and DSA Ops           → pto.vprelu, pto.vexpdiff, pto.vaxpy, pto.vtranspose, pto.vsort32, etc.
-│
-├── Scalar And Control Families
-│   ├── Control and Configuration  → pto.nop, pto.barrier, pto.yield, etc.
-│   ├── Pipeline Sync             → pto.set_flag, pto.wait_flag, pto.pipe_barrier, pto.mem_bar, etc.
-│   ├── DMA Copy                  → pto.copy_gm_to_ubuf, pto.copy_ubuf_to_gm, pto.copy_ubuf_to_ubuf, etc.
-│   ├── Predicate Load Store       → pto.pld, pto.plds, pto.pldi, pto.pst, pto.psts, pto.psti, pto.pstu
-│   ├── Predicate Generation       → pto.pset_b8, pto.pge_b8, pto.plt_b8, pto.pand, pto.por, pto.pxor, pto.pnot, etc.
-│   ├── Shared Arithmetic          → Scalar arithmetic ops shared across surfaces
-│   └── Shared SCF               → Scalar structured control flow
-│
-└── Other Families
-    ├── Communication and Runtime  → pto.tbroadcast, pto.tget, pto.tput, pto.treduce, pto.tscatter, pto.tgather, pto.tnotify, pto.ttest, pto.twait, etc.
-    └── Non-ISA Supporting Ops    → pto.talias, pto.tconcat, pto.tfree, pto.tquant, pto.tdequant, pto.tpack, pto.thistogram, pto.tpop, pto.tpush, pto.trandom, etc.
-```
-
-## Normative Language
-
-Family pages use **MUST**, **SHOULD**, and **MAY** only for rules that a test, verifier, or review can check. Prefer plain language for explanation.
-
-## See Also
-
-- [Instruction surfaces](./instruction-surfaces/README.md) — High-level surface descriptions
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page format standard
-- [Diagnostics and illegal cases](../reference/diagnostics-and-illegal-cases.md) — What makes a PTO program illegal
diff --git a/docs/mkdocs/src/docs/isa/instruction-families/README_zh.md b/docs/mkdocs/src/docs/isa/instruction-families/README_zh.md
deleted file mode 100644
index 615269c5..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-families/README_zh.md
+++ /dev/null
@@ -1,32 +0,0 @@
-<!-- Generated from `docs/isa/instruction-families/README_zh.md` -->
-
-# 指令族
-
-本章描述 PTO ISA 的指令族（Instruction Family）——共享约束和行为的指令分组。每个族定义了该族所有指令共同遵循的规则。
-
-## 本章内容
-
-- [指令族总览](instruction-families/README.md) — 完整导航地图和族规范模板
-- [Tile 指令族](instruction-families/tile-families.md) — Tile 表面的 8 个指令族（逐元素、归约、布局等）
-- [Vector 指令族](instruction-families/vector-families.md) — Vector 表面的 9 个指令族
-- [标量与控制指令族](instruction-families/scalar-and-control-families.md) — 标量、控制和配置的 6 个指令族
-- [其他指令族](instruction-families/other-families.md) — 通信和其他支持指令族
-
-## 族与表面的关系
-
-- **表面（Surface）** 按功能角色分类指令（Tile / Vector / Scalar&Control / Other）
-- **族（Family）** 共享约束、行为模式和规范语言；同一族的指令共享家族概览页中的共同约束
-
-## 每个族必须定义的内容
-
-1. **Mechanism** — 族的用途说明
-2. **Shared Operand Model** — 共同的操作数模型和交互方式
-3. **Common Side Effects** — 所有族内操作共享的副作用
-4. **Shared Constraints** — 适用于全族的合法性规则
-5. **Cases That Are Not Allowed** — 全族禁止的条件
-6. **Target-Profile Narrowing** — A2/A3 和 A5 的差异
-7. **Operation List** — 指向各 per-op 页面的链接
-
-## 章节定位
-
-本章属于手册第 7 章（指令集）的一部分。族文档是 per-op 页面的上一层抽象，同一族的指令共享家族概览页中的共同约束。
diff --git a/docs/mkdocs/src/docs/isa/instruction-families/other-families.md b/docs/mkdocs/src/docs/isa/instruction-families/other-families.md
deleted file mode 100644
index 38ab0b1a..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-families/other-families.md
+++ /dev/null
@@ -1,53 +0,0 @@
-<!-- Generated from `docs/isa/instruction-families/other-families.md` -->
-
-# Other Families
-
-Other-family documentation covers communication and residual supporting behavior that is architecture-visible but does not fit cleanly into the main tile, vector, or scalar/control buckets.
-
-## Overview
-
-| Family | Description | Availability |
-|--------|-------------|------------|
-| [Communication and Runtime](./other/communication-and-runtime.md) | Inter-NPU collective communication | A2/A3, A5 |
-| [Non-ISA Supporting Ops](./other/non-isa-and-supporting-ops.md) | Convenience operations over tile sequences | All profiles |
-
-### Communication and Runtime
-
-These operations span multiple NPUs in a parallel group and require a `ParallelGroup` handle:
-
-| Category | Operations |
-|----------|-----------|
-| Collective broadcast | `tbroadcast`, `tscatter`, `tgather` |
-| Point-to-point | `tget`, `tget_async`, `tput`, `tput_async` |
-| Collective reduction | `treduce` |
-| Notification | `tnotify`, `ttest`, `twait` |
-
-**CPU simulator**: These ops are **not available** on the CPU simulator. Programs using them on CPU will produce a runtime error.
-
-### Non-ISA Supporting Operations
-
-These provide higher-level semantics over tile sequences or memory management. Some are convenience wrappers that expand to multiple core ISA operations:
-
-| Category | Operations |
-|----------|-----------|
-| Tile sequence | `talias`, `tconcat`, `taxpy` |
-| Memory management | `tfree` |
-| Quantization | `tquant`, `tdequant` |
-| Counting | `tpop`, `tpush` |
-| A5-only | `thistogram`, `tpack`, `trandom` |
-
-## Shared Constraints
-
-- **Communication ops** require all participating NPUs to call the operation with matching `ParallelGroup` handles.
-- **Non-root ranks** for collective ops must have destination buffers allocated and writable for the operation duration.
-- **CPU simulator** does not support communication ops.
-
-## Navigation
-
-See the [Other ISA reference](./other/README.md) for the full per-op reference.
-
-## See Also
-
-- [Other instruction surface](./instruction-surfaces/other-instructions.md) — High-level surface description
-- [Instruction families](./README.md) — All family groups
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page standard
diff --git a/docs/mkdocs/src/docs/isa/instruction-families/other-families_zh.md b/docs/mkdocs/src/docs/isa/instruction-families/other-families_zh.md
deleted file mode 100644
index ab343cbc..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-families/other-families_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Other Families
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](other-families.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-- [中文章节手册指令集概述](../../../manual/07-instructions_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/instruction-families/scalar-and-control-families.md b/docs/mkdocs/src/docs/isa/instruction-families/scalar-and-control-families.md
deleted file mode 100644
index e51a27e7..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-families/scalar-and-control-families.md
+++ /dev/null
@@ -1,61 +0,0 @@
-<!-- Generated from `docs/isa/instruction-families/scalar-and-control-families.md` -->
-
-# Scalar And Control Families
-
-Scalar and control family documentation covers the state-setting and control-shell parts of PTO that surround tile and vector payload execution.
-
-## Overview
-
-| Family | Description | Examples |
-|--------|-------------|----------|
-| [Control and Configuration](./scalar/control-and-configuration.md) | NOP, barrier, yield, and control setup | `nop`, `barrier`, `yield` |
-| [Pipeline Sync](./scalar/pipeline-sync.md) | Event and barrier synchronization between pipelines | `set_flag`, `wait_flag`, `pipe_barrier`, `mem_bar` |
-| [DMA Copy](./scalar/dma-copy.md) | GM↔UB memory transfer configuration and initiation | `copy_gm_to_ubuf`, `copy_ubuf_to_gm`, `set_loop_size_outtoub` |
-| [Predicate Load/Store](./scalar/predicate-load-store.md) | Mask-based scalar memory access | `pld`, `plds`, `pdi`, `pst`, `psts`, `psti`, `pstu` |
-| [Predicate Generation](./scalar/predicate-generation-and-algebra.md) | Predicate construction and algebra | `pset_b8`, `pge_b8`, `plt_b8`, `pand`, `por`, `pxor`, `pnot` |
-| Shared Arithmetic | Scalar arithmetic ops shared across surfaces | Scalar integer/float ops |
-| Shared SCF | Scalar structured control flow | Loops, conditionals |
-
-## Shared Constraints
-
-All scalar/control families must state:
-
-1. **Architectural state produced or consumed** — What state the operations create or modify.
-2. **Pipe and event spaces** — Which pipe/event identifiers are supported by the target profile.
-3. **Target-profile narrowing** — Where A2/A3 and A5 differ from the portable ISA contract.
-4. **Cases that are not allowed** — Conditions that are illegal across the family.
-
-## Shared Dialect Surfaces
-
-Some scalar/control ops belong to shared dialect surfaces (e.g., `scf.if`, `scf.for`) that extend the core ISA with structured control flow. These ops are marked as part of the documented PTO source surface, not as PTO-specific mnemonics.
-
-## Pipe Spaces by Profile
-
-| Pipe | CPU Sim | A2/A3 | A5 |
-|------|:-------:|:------:|:--:|
-| `PIPE_MTE1` | Simulated | Supported | Supported |
-| `PIPE_MTE2` | Simulated | Supported | Supported |
-| `PIPE_MTE3` | Simulated | Supported | Supported |
-| `PIPE_V` | Emulated | Emulated | Native |
-| `PIPE_M` | Simulated | Supported | Supported |
-
-## Event Ordering
-
-Scalar/control sync ops use a matching `set_flag`/`wait_flag` protocol:
-
-```
-Producer:  set_flag(src_pipe=PIPE_MTE2, dst_pipe=PIPE_V, event_id=EID0)
-Consumer:  wait_flag(src_pipe=PIPE_MTE2, dst_pipe=PIPE_V, event_id=EID0)
-```
-
-Waiting on an event that was never established by a matching producer is **illegal**.
-
-## Navigation
-
-See the [Scalar ISA reference](./scalar/README.md) for the full per-op reference under `scalar/ops/`.
-
-## See Also
-
-- [Scalar and control instruction surface](./instruction-surfaces/scalar-and-control-instructions.md) — High-level surface description
-- [Instruction families](./README.md) — All family groups
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page standard
diff --git a/docs/mkdocs/src/docs/isa/instruction-families/scalar-and-control-families_zh.md b/docs/mkdocs/src/docs/isa/instruction-families/scalar-and-control-families_zh.md
deleted file mode 100644
index 1a319e1a..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-families/scalar-and-control-families_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Scalar And Control Families
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](scalar-and-control-families.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-- [中文章节手册指令集概述](../../../manual/07-instructions_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/instruction-families/tile-families.md b/docs/mkdocs/src/docs/isa/instruction-families/tile-families.md
deleted file mode 100644
index ed8be2f4..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-families/tile-families.md
+++ /dev/null
@@ -1,77 +0,0 @@
-<!-- Generated from `docs/isa/instruction-families/tile-families.md` -->
-
-# Tile Families
-
-Tile-family documentation explains how `pto.t*` groups behave. Each family describes the shared mechanism, operand model, constraints, and target-profile narrowing before the reader drops into the standalone per-op pages under `tile/ops/`.
-
-## Overview
-
-| Family | Prefix | Description |
-|--------|--------|-------------|
-| [Sync and Config](./sync-and-config.md) | `pto.tassign`, `pto.tsync`, `pto.tset*` | Resource binding, event setup, mode control |
-| [Elementwise Tile-Tile](./elementwise-tile-tile.md) | `pto.tadd`, `pto.tmul`, `pto.tcmp`, `pto.tcvt` | Lane-wise binary and unary operations |
-| [Tile-Scalar and Immediate](./tile-scalar-and-immediate.md) | `pto.tadds`, `pto.tmuls`, `pto.tmins` | Tile combined with scalar or immediate operand |
-| [Reduce and Expand](./reduce-and-expand.md) | `pto.trowsum`, `pto.tcolmax`, `pto.trowexpand` | Row/column reductions and expansions |
-| [Memory and Data Movement](./memory-and-data-movement.md) | `pto.tload`, `pto.tstore`, `pto.mgather` | GM↔tile transfer, gather/scatter |
-| [Matrix and Matrix-Vector](./matrix-and-matrix-vector.md) | `pto.tgemv`, `pto.tmatmul`, `pto.tmatmul_bias` | GEMV, matmul, and variants |
-| [Layout and Rearrangement](./layout-and-rearrangement.md) | `pto.tmov`, `pto.ttrans`, `pto.textract` | Reshape, transpose, extract, insert |
-| [Irregular and Complex](./irregular-and-complex.md) | `pto.tmrgsort`, `pto.tquant`, `pto.tprint` | Sort, quantize, histogram, print |
-
-## Shared Constraints
-
-All tile families must state:
-
-1. **Valid-region interaction** — How the family interprets source tile valid regions relative to the destination.
-2. **Layout and role restrictions** — Which tile layouts, TileTypes, and roles the family accepts.
-3. **Target-profile restrictions** — Where A2/A3 and A5 differ from each other and from the portable ISA contract.
-4. **Cases that are not allowed** — Conditions that are illegal across the family.
-
-## Valid Region Compatibility
-
-All elementwise tile-tile operations iterate over the **destination tile's valid region**. For each lane `(r, c)` in the destination's valid region:
-
-- The corresponding lane `(r, c)` from each source tile is read, **regardless of whether that lane is within the source tile's own valid region**
-- Source tiles whose valid region does not cover `(r, c)` read **implementation-defined values**
-- Programs MUST NOT rely on any particular value being read from an out-of-region source lane unless the operation explicitly documents the behavior
-
-Producers that need defined behavior when valid regions differ SHOULD either:
-
-- Ensure all operands have matching valid regions, or
-- Use a fill/pad operation to expand the source before the elementwise operation
-
-## Saturating Variants
-
-Operations with the `_c` suffix perform saturating arithmetic:
-
-| Variant | Base Op | Overflow/Underflow Behavior |
-|---------|---------|---------------------------|
-| `TADD` | Addition | Wrapping: result wraps around the type's representable range |
-| `TADDC` | Addition | Saturating: result is clamped to the type's min/max representable value |
-| `TSUB` | Subtraction | Wrapping: result wraps around the type's representable range |
-| `TSUBC` | Subtraction | Saturating: result is clamped to the type's min/max representable value |
-
-Programs MUST NOT assume that `TADDC` and `TADD` produce identical results when overflow does not occur; they MAY differ even for in-range values due to implementation precision choices.
-
-## Type Support by Profile
-
-| Element Type | CPU Simulator | A2/A3 | A5 |
-|------------|:-------------:|:------:|:--:|
-| f32 (float) | Yes | Yes | Yes |
-| f16 (half) | Yes | Yes | Yes |
-| bf16 (bfloat16_t) | Yes | Yes | Yes |
-| i8/u8 | Yes | Yes | Yes |
-| i16/u16 | Yes | Yes | Yes |
-| i32/u32 | Yes | Yes | Yes |
-| i64/u64 | Yes | Yes | Yes |
-| f8e4m3 / f8e5m2 | No | No | Yes |
-| hifloat8_t / float4_e* | No | No | Yes |
-
-## Navigation
-
-See the [Tile ISA reference](./tile/README.md) for the full per-op reference under `tile/ops/`.
-
-## See Also
-
-- [Tile instruction surface](./instruction-surfaces/tile-instructions.md) — High-level surface description
-- [Instruction families](./README.md) — All family groups
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page standard
diff --git a/docs/mkdocs/src/docs/isa/instruction-families/tile-families_zh.md b/docs/mkdocs/src/docs/isa/instruction-families/tile-families_zh.md
deleted file mode 100644
index 0d7acd5e..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-families/tile-families_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Tile Families
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tile-families.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-- [中文章节手册指令集概述](../../../manual/07-instructions_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/instruction-families/vector-families.md b/docs/mkdocs/src/docs/isa/instruction-families/vector-families.md
deleted file mode 100644
index aabaf0d6..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-families/vector-families.md
+++ /dev/null
@@ -1,76 +0,0 @@
-<!-- Generated from `docs/isa/instruction-families/vector-families.md` -->
-
-# Vector Families
-
-Vector-family documentation follows the PTOAS VPTO family shape. Vector families cover the complete set of `pto.v*` micro-instructions that operate on vector registers.
-
-## Overview
-
-| Family | Description | Examples |
-|--------|-------------|----------|
-| [Vector Load/Store](./vector/vector-load-store.md) | UB↔vector register transfer with distribution modes | `vlds`, `vldas`, `vgather2`, `vsld`, `vsst`, `vscatter` |
-| [Predicate and Materialization](./vector/predicate-and-materialization.md) | Vector broadcast and duplication | `vbr`, `vdup` |
-| [Unary Vector Ops](./vector/unary-vector-ops.md) | Single-operand lane-wise operations | `vabs`, `vneg`, `vexp`, `vsqrt`, `vrec`, `vrelu`, `vnot` |
-| [Binary Vector Ops](./vector/binary-vector-ops.md) | Two-operand lane-wise operations | `vadd`, `vsub`, `vmul`, `vdiv`, `vmax`, `vmin` |
-| [Vec-Scalar Ops](./vector/vec-scalar-ops.md) | Vector combined with scalar operand | `vadds`, `vmuls`, `vshls`, `vlrelu` |
-| [Conversion Ops](./vector/conversion-ops.md) | Type conversion between numeric types | `vci`, `vcvt`, `vtrc` |
-| [Reduction Ops](./vector/reduction-ops.md) | Cross-lane reductions | `vcadd`, `vcmax`, `vcmin`, `vcgadd`, `vcgmax` |
-| [Compare and Select](./vector/compare-select.md) | Comparison and conditional lane selection | `vcmp`, `vcmps`, `vsel`, `vselr`, `vselrv2` |
-| [Data Rearrangement](./vector/data-rearrangement.md) | Lane permutation, interleaving, packing | `vintlv`, `vdintlv`, `vslide`, `vshift`, `vpack`, `vzunpack` |
-| [SFU and DSA Ops](./vector/sfu-and-dsa-ops.md) | Special function units and DSA-style operations | `vprelu`, `vexpdiff`, `vaxpy`, `vtranspose`, `vsort32` |
-
-## Shared Constraints
-
-All vector families must state:
-
-1. **Vector width** — `N` is determined by the element type (f32: N=64, f16/bf16: N=128, i8/u8: N=256).
-2. **Predicate behavior** — How masked-off lanes behave in compute and load/store ops.
-3. **Pointer space** — All vector load/store addresses are in UB space (`!pto.ptr<T, ub>`).
-4. **Pipeline handoff** — How data moves between DMA (GM↔UB) and vector register ops.
-5. **Target-profile narrowing** — A5-only ops (FP8 types, unaligned store, pair select).
-
-## Vector Width and Type Support by Profile
-
-| Element Type | Vector Width N | CPU Sim | A2/A3 | A5 |
-|-------------|:-------------:|:-------:|:------:|:--:|
-| f32 | 64 | Yes | Yes | Yes |
-| f16 | 128 | Yes | Yes | Yes |
-| bf16 | 128 | Yes | Yes | Yes |
-| i16 / u16 | 128 | Yes | Yes | Yes |
-| i8 / u8 | 256 | Yes | Yes | Yes |
-| f8e4m3 / f8e5m2 | 256 | No | No | Yes |
-| hifloat8_t / float4_e* | 256 | No | No | Yes |
-
-## Mask Behavior
-
-Vector ops gated by a predicate mask follow these rules:
-
-- Lanes where mask bit = **1**: operation executes normally.
-- Lanes where mask bit = **0**: result is defined but operation-specific; programs MUST NOT rely on any specific value.
-
-## Pipeline Handoff
-
-Vector surface data movement requires explicit synchronization between DMA and vector register ops:
-
-```
-copy_gm_to_ubuf ──(set_flag)──► wait_flag ──► vlds ──► vadd ──► vsts ──(set_flag)──► wait_flag ──► copy_ubuf_to_gm
-```
-
-## A5-Only Operations
-
-The following ops require A5 profile:
-
-- `vstu`, `vstus`, `vstur` — Vector unaligned store with alignment state
-- `vselr`, `vselrv2` — Pair select operations
-- `thistogram`, `tpack`, `trandom` — Tile ops but listed under SFU category
-- All ops using FP8 element types (e4m3, e5m2)
-
-## Navigation
-
-See the [Vector ISA reference](./vector/README.md) for the full per-op reference under `vector/ops/`.
-
-## See Also
-
-- [Vector instruction surface](./instruction-surfaces/vector-instructions.md) — High-level surface description
-- [Instruction families](./README.md) — All family groups
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page standard
diff --git a/docs/mkdocs/src/docs/isa/instruction-families/vector-families_zh.md b/docs/mkdocs/src/docs/isa/instruction-families/vector-families_zh.md
deleted file mode 100644
index 0071bc3a..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-families/vector-families_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Vector Families
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vector-families.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-- [中文章节手册指令集概述](../../../manual/07-instructions_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/instruction-surfaces/README.md b/docs/mkdocs/src/docs/isa/instruction-surfaces/README.md
deleted file mode 100644
index b898b4f5..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-surfaces/README.md
+++ /dev/null
@@ -1,121 +0,0 @@
-<!-- Generated from `docs/isa/instruction-surfaces/README.md` -->
-
-# Instruction Surfaces
-
-PTO ISA is organized into four instruction surfaces, each representing a distinct mechanism, programming model, and operand domain. Understanding the surface distinction is essential before reading the per-op reference pages.
-
-## Overview
-
-| Surface | Prefix | Pipeline | Primary Role | Operands |
-|---------|--------|----------|-------------|----------|
-| [Tile Instructions](./tile-instructions.md) | `pto.t*` | All (via tile buffers) | Tile-oriented compute, data movement, layout transforms, synchronization | `!pto.tile<...>`, `!pto.tile_buf<...>`, `!pto.partition_tensor_view<...>` |
-| [Vector Instructions](./vector-instructions.md) | `pto.v*` | Vector Pipe (V) | Vector micro-instructions: lane-level compute, masking, alignment state | `!pto.vreg<NxT>`, `!pto.mask`, `!pto.ptr<T, ub>` |
-| [Scalar And Control](./scalar-and-control-instructions.md) | `pto.*` | Scalar Unit, DMA | Configuration, control flow, DMA setup, synchronization, predicates | Scalar regs, pipe ids, event ids, buffer ids |
-| [Other Instructions](./other-instructions.md) | `pto.*` | Inter-NPU | Collective communication, runtime support, tile sequence operations | `!pto.group<N>`, tile sequences, allocation handles |
-
-## Why Surfaces Exist
-
-PTO has four surfaces because different parts of the architecture expose different kinds of state. Mixing tile-level and vector-level state in one opcode space would blur the ISA contract.
-
-### Tile Surface (`pto.t*`)
-
-Tile instructions reason about tiles: bounded multi-dimensional arrays with architecturally visible shape, layout, role, and valid-region metadata. The primary operands are tile registers (`!pto.tile<T, R, C>` or `!pto.tile_buf<...>`). Tile instructions produce destination tiles, change valid-region interpretations, or establish synchronization edges.
-
-```
-Input:   Tile operands, scalar modifiers, GlobalTensor views
-Output:  Tile payload, synchronization edges
-Domain:  Valid regions, tile layouts, tile shapes, location intents
-```
-
-### Vector Surface (`pto.v*`)
-
-Vector instructions expose the vector pipeline directly. Operands are vector registers (`!pto.vreg<NxT>`), scalar values, and predicate masks. Vector instructions are the fine-grained compute layer beneath tile instructions. The full register width is always meaningful — there is no valid-region abstraction at the vector level.
-
-```
-Input:   Vector registers, scalar registers, predicates, memory addresses
-Output:  Vector registers, scalar registers, memory writes
-Domain:  Vector length N, lane masks, alignment state, distribution modes
-```
-
-### Scalar And Control Surface (`pto.*`)
-
-Scalar/control instructions handle configuration, control flow, synchronization, DMA setup, and predicate state. They set up the execution shell around tile and vector payload regions. Most do not produce tile or vector payloads; they produce control effects, event tokens, or predicate masks.
-
-```
-Input:   Scalar registers, pipe ids, event ids, buffer ids, DMA loop parameters
-Output:  Control state, event tokens, predicate masks, configured DMA state
-Domain:  Configuration tuples, pipe/event spaces, DMA loop sizes and strides
-```
-
-### Other Surface (`pto.*`)
-
-Communication and supporting operations carry their own side effects and ordering rules that do not fit into the tile/vector/scalar model. Examples include collective broadcasts across NPUs and alias/concatenation operations on tile sequences.
-
-```
-Input:   Collective groups, tile sequences, allocation handles
-Output:  Collective results, modified tile sequences, allocation state
-Domain:  Parallel groups, tile sequences, memory allocation
-```
-
-## Surface Data Flow
-
-The four surfaces form a layered execution model:
-
-```
-┌─────────────────────────────────────────────────────────────┐
-│  GM (off-chip device memory)                                │
-└──────────┬──────────────────────────────────────┬───────────┘
-           │                                      │
-           │  Tile Surface: TLOAD/TSTORE          │
-           │  Vector Surface: copy_gm_to_ubuf / copy_ubuf_to_gm
-           ▼                                      ▼
-┌─────────────────────────────────────────────────────────────┐
-│  Unified Buffer (UB, 256 KB on-chip)                      │
-│  !pto.ptr<T, ub> — shared staging area                    │
-└──────┬──────────────────────────────────────────┬──────────┘
-       │                                      │
-       │  Tile Surface: implicit tile↔UB       │
-       │  Vector Surface: vlds / vsts          │
-       ▼                                      ▼
-┌─────────────────┐              ┌─────────────────────────────┐
-│  Tile Buffers   │              │  Vector Registers          │
-│  !pto.tile_buf  │              │  !pto.vreg<NxT>           │
-│  (Vec/Mat/Acc/  │              │  (N lanes)                │
-│   Left/Right)   │              │                           │
-└────────┬─────────┘              └──────────────┬────────────┘
-         │                                     │
-         │  Tile Surface: pto.t* ops           │  Vector Surface: pto.v* ops
-         │  (TMATMUL via Mat/Acc slots)       │  (vadd, vmul, etc.)
-         │                                     │
-         │  ◄── Matrix Multiply Unit (M)       │  ◄── Vector Pipeline (V)
-         └─────────────────────────────────────┘
-                       │
-                       │  Tile Surface: TSTORE
-                       │  Vector Surface: vsts → copy_ubuf_to_gm
-                       ▼
-                  [UB → GM]
-```
-
-## Instruction Count Summary
-
-| Surface | Families | Operations | Notes |
-|---------|----------|------------|-------|
-| Tile | 8 | ~120 | Full matmul, elementwise, reduce, layout |
-| Vector | 9 | ~99 | Full vector compute, load/store, SFU |
-| Scalar/Control | 6 | ~60 | Sync, DMA, predicates |
-| Other/Communication | 2 | ~24 | Collective ops, supporting ops |
-
-## Normative Language
-
-Instruction text always means what happens in the declared valid region unless the page explicitly defines behavior outside it. PTO is **tile-first** and **valid-region-first**.
-
-Use **MUST**, **SHOULD**, and **MAY** only for rules that a test, verifier, or review can check. Prefer plain language for explanation.
-
-## See Also
-
-- [Instruction families](./instruction-families/README.md) — Group-level contracts
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page format standard
-- [Tile ISA reference](./tile/README.md) — Tile surface per-op reference
-- [Vector ISA reference](./vector/README.md) — Vector surface per-op reference
-- [Scalar ISA reference](./scalar/README.md) — Scalar surface per-op reference
-- [Other ISA reference](./other/README.md) — Communication and supporting ops
diff --git a/docs/mkdocs/src/docs/isa/instruction-surfaces/README_zh.md b/docs/mkdocs/src/docs/isa/instruction-surfaces/README_zh.md
deleted file mode 100644
index 7f528160..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-surfaces/README_zh.md
+++ /dev/null
@@ -1,24 +0,0 @@
-<!-- Generated from `docs/isa/instruction-surfaces/README_zh.md` -->
-
-# 指令表面
-
-本章描述 PTO ISA 的指令表面（Instruction Surface）——按功能角色组织的指令分类。表面是对应不同执行路径的操作集合。
-
-## 本章内容
-
-- [指令表面总览](instruction-surfaces/README.md) — 四层表面的整体说明、数据流图、操作数对照表
-- [Tile 指令表面](instruction-surfaces/tile-instructions.md) — `pto.t*` 逐 tile 操作表面
-- [Vector 指令表面](instruction-surfaces/vector-instructions.md) — `pto.v*` 向量微指令表面
-- [标量与控制指令表面](instruction-surfaces/scalar-and-control-instructions.md) — 标量、控制和配置操作表面
-- [其他指令表面](instruction-surfaces/other-instructions.md) — 通信、调试和其他支持操作
-
-## 阅读建议
-
-建议按以下顺序阅读：
-
-1. 先读 [指令表面总览](instruction-surfaces/README.md)，理解 Tile / Vector / Scalar&Control / Other 四层表面的整体结构和数据流关系
-2. 再根据需要深入具体表面页，了解该表面的操作数类型、约束和规范语言
-
-## 章节定位
-
-本章属于手册第 7 章（指令集）的一部分。表面是介于编程模型与具体指令之间的中层抽象，帮助读者按功能角色定位指令。
diff --git a/docs/mkdocs/src/docs/isa/instruction-surfaces/other-instructions.md b/docs/mkdocs/src/docs/isa/instruction-surfaces/other-instructions.md
deleted file mode 100644
index 9e4f2f2e..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-surfaces/other-instructions.md
+++ /dev/null
@@ -1,139 +0,0 @@
-<!-- Generated from `docs/isa/instruction-surfaces/other-instructions.md` -->
-
-# Other Instruction Surface
-
-The "other" surface covers operations that do not fit cleanly into the tile, vector, or scalar/control buckets. This includes inter-NPU communication, collective operations, and supporting operations that extend the core ISA.
-
-## Surface Overview
-
-Communication and supporting operations carry their own side effects and ordering rules that differ from the standard tile/vector/scalar model. These operations are architecturally visible but serve a different role:
-
-- **Communication operations** express inter-NPU data exchange and collective reduction across parallel groups.
-- **Supporting operations** provide convenience semantics over tile sequences or memory allocation.
-
-These operations are **NOT available on the CPU simulator**. They require A2/A3 or A5 profiles with inter-NPU interconnect hardware.
-
-## Instruction Classes
-
-| Class | Description | Availability |
-|-------|-------------|------------|
-| Communication and Runtime | Inter-NPU collective communication | A2/A3, A5 |
-| Non-ISA Supporting Ops | Convenience operations over tile sequences | All profiles |
-
-### Communication And Runtime
-
-These operations span multiple NPUs in a parallel group. They require a `ParallelGroup` handle and involve network or interconnect traffic.
-
-| Operation | Description |
-|-----------|-------------|
-| `tbroadcast` | Broadcast data from root NPU to all ranks in parallel group |
-| `tget` | Get data from a remote NPU |
-| `tget_async` | Asynchronous variant of `tget` |
-| `tput` | Put data to a remote NPU |
-| `tput_async` | Asynchronous variant of `tput` |
-| `treduce` | Collective reduction across all ranks in parallel group |
-| `tscatter` | Scatter data from root NPU to all ranks |
-| `tgather` | Gather data from all ranks to root NPU |
-| `tnotify` | Notify other ranks of an event |
-| `ttest` | Test if a notification has been received |
-| `twait` | Wait for a notification |
-
-### Non-ISA Supporting Operations
-
-These operations provide higher-level semantics over tile sequences or memory management. Some are convenience wrappers that expand to multiple core ISA operations.
-
-| Operation | Description | Target Profile |
-|-----------|-------------|------------|
-| `talias` | Create an alias view of a tile without copying data | All |
-| `taxpy` | Fused multiply-add: `dst = src0 * scalar + src1` | All |
-| `tconcat` | Concatenate two tiles along a specified dimension | All |
-| `tdequant` | Dequantize a tile from quantized format | All |
-| `tfree` | Free a previously allocated tile or buffer | All |
-| `thistogram` | Compute histogram of tile values | A5 |
-| `tpack` | Pack multiple tiles into a single tile buffer | A5 |
-| `tpop` | Population count of predicate mask | All |
-| `tpush` | Push count of predicate mask | All |
-| `trandom` | Fill tile with random values | A5 |
-| `tquant` | Quantize a tile to quantized format | All |
-
-## Inputs
-
-Other instructions consume combinations of:
-
-- Parallel group handles (`!pto.group<N>`)
-- Tile operands or tile sequences
-- Scalar parameters (reduction operator, axis, scale/zero-point for quant, etc.)
-- Allocation handles
-
-## Expected Outputs
-
-Other instructions produce:
-
-- Modified tiles or tile sequences
-- Scalar results (e.g., population count from `tpop`)
-- Allocation state changes (e.g., freed buffers from `tfree`)
-
-## Side Effects
-
-| Class | Side Effects |
-|-------|-------------|
-| Communication And Runtime | Network/interconnect traffic; ordering across NPUs |
-| Non-ISA Supporting Ops | May copy, allocate, or free memory; `tquant`/`tdequant` modify numeric representation |
-
-## Constraints
-
-- **Communication operations** require all participating NPUs to call the operation with matching `ParallelGroup` handles.
-- **Non-root ranks** for collective operations (broadcast, scatter) must ensure destination buffers are allocated and writable for the duration of the operation.
-- **Tile shape compatibility** for `tconcat` requires compatible dimensions along the concatenation axis.
-- **Quantization parameters** for `tquant`/`tdequant` must be valid scale/zero-point values.
-- **CPU simulator** does not support communication operations; using them produces a runtime error.
-
-## Cases That Are Not Allowed
-
-- Calling collective operations with mismatched `ParallelGroup` handles across ranks.
-- Using uninitialized or improperly sized destination buffers for communication operations.
-- Calling `tfree` on a tile that is still in use.
-- Relying on `taxpy` being expanded to separate `tmul`/`tadd` on backends that do not implement it natively.
-- Using A5-only operations (`thistogram`, `tpack`, `trandom`) on CPU or A2/A3 profiles.
-
-## Syntax
-
-### Assembly Form (PTO-AS)
-
-```asm
-tbroadcast %group, %src : (!pto.group<8>, !pto.tile<f32, 16, 16>)
-treduce %group, %src, %dst : (!pto.group<8>, !pto.tile<f32, 16, 16>, !pto.tile<f32, 16, 16>) {op = "sum"}
-```
-
-### SSA Form (AS Level 1)
-
-```mlir
-%result = pto.tbroadcast %group, %src
-    : (!pto.group<8>, !pto.tile<f32, 16, 16>) -> !pto.tile<f32, 16, 16>
-```
-
-See [Assembly Spelling And Operands](../syntax-and-operands/assembly-model.md) for the full syntax specification.
-
-## C++ Intrinsic
-
-Communication and supporting operations are declared in `include/pto/comm/pto_comm_inst.hpp` and `include/pto/common/pto_instr.hpp`:
-
-```cpp
-#include <pto/comm/pto_comm_inst.hpp>
-using namespace pto::comm;
-
-// Broadcast across a parallel group
-template <typename GroupType, typename GlobalData, typename TileData>
-PTO_INST RecordEvent TBROADCAST(GroupType& group, GlobalData& src, TileData& stagingTile);
-
-// Collective reduction
-template <typename GroupType, typename GlobalData, typename TileData, ReduceOp Op>
-PTO_INST RecordEvent TREDUCE(GroupType& group, GlobalData& src, TileData& stagingTile);
-```
-
-## See Also
-
-- [Other ISA reference](./other/README.md) — Full communication and supporting ops reference
-- [Other families](./instruction-families/other-families.md) — Family-level contracts
-- [Instruction families](./instruction-families/README.md) — All family groups
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page standard
diff --git a/docs/mkdocs/src/docs/isa/instruction-surfaces/other-instructions_zh.md b/docs/mkdocs/src/docs/isa/instruction-surfaces/other-instructions_zh.md
deleted file mode 100644
index 95beed41..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-surfaces/other-instructions_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Other Instruction Surface
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](other-instructions.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-- [中文章节手册指令集概述](../../../manual/07-instructions_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/instruction-surfaces/scalar-and-control-instructions.md b/docs/mkdocs/src/docs/isa/instruction-surfaces/scalar-and-control-instructions.md
deleted file mode 100644
index 95192ec5..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-surfaces/scalar-and-control-instructions.md
+++ /dev/null
@@ -1,166 +0,0 @@
-<!-- Generated from `docs/isa/instruction-surfaces/scalar-and-control-instructions.md` -->
-
-# Scalar And Control Instruction Surface
-
-`pto.*` (scalar/control) is the configuration, synchronization, and scalar-orchestration surface of PTO ISA. It sets up the execution shell around tile and vector payload regions.
-
-## Surface Overview
-
-Scalar/control instructions do not produce tile or vector payloads. Instead, they produce:
-
-- Control effects (pipeline barriers, control flow)
-- Event tokens for explicit producer-consumer ordering
-- Predicate masks for conditional execution
-- DMA configuration state for memory transfer
-- Scalar register updates
-
-**Scalar operands** are single-value registers. Scalar instructions are the glue that orchestrates tile and vector work.
-
-## Instruction Classes
-
-| Class | Description | Examples |
-|-------|-------------|----------|
-| Control and Configuration | NOP, barrier, yield, and control setup | `nop`, `barrier`, `yield` |
-| Pipeline Sync | Event and barrier synchronization between pipelines | `set_flag`, `wait_flag`, `pipe_barrier` |
-| DMA Copy | GM↔UB memory transfer configuration and initiation | `copy_gm_to_ubuf`, `copy_ubuf_to_gm`, `set_loop_size_outtoub` |
-| Predicate Load/Store | Mask-based scalar memory access | `pld`, `plds`, `pdi`, `pst`, `psts`, `psti`, `pstu` |
-| Predicate Generation | Predicate construction and algebra | `pset_b8`, `pge_b8`, `plt_b8`, `pand`, `por`, `pxor`, `pnot`, `pintlv_b16` |
-
-## Inputs
-
-Scalar/control instructions consume combinations of:
-
-- Scalar registers (`!pto.scalar<T>` or built-in C++ types)
-- Pipe identifiers: `PIPE_MTE1`, `PIPE_MTE2`, `PIPE_MTE3`, `PIPE_V`, `PIPE_M`
-- Event identifiers: `EVENT_ID0`–`EVENT_ID15` (profile-specific range)
-- Buffer identifiers: UB buffer slots
-- Memory addresses: `!pto.ptr<T, ub>` or `!pto.ptr<T, gm>`
-- DMA loop sizes and stride values
-
-## Expected Outputs
-
-Scalar/control instructions produce:
-
-- Control state changes (pipeline barriers, control flow)
-- Event tokens for explicit synchronization (`RecordEvent`)
-- Predicate masks (`!pto.mask`)
-- Configured DMA state ready for transfer
-- UB buffer handles
-
-## Side Effects
-
-Scalar/control instructions may have significant architectural side effects:
-
-| Class | Side Effects |
-|-------|-------------|
-| Pipeline Sync | Establishes producer-consumer ordering; may stall pipeline stages |
-| DMA Copy | Initiates memory transfer between GM and UB; may stall DMA engine |
-| Predicate Load/Store | Reads from or writes to scalar memory locations in UB |
-
-## Event Model
-
-Scalar/control sync operations use an event-based model. Events are identified by a triple `(src_pipe, dst_pipe, event_id)`:
-
-| Field | Values | Meaning |
-|-------|--------|---------|
-| `src_pipe` | `PIPE_MTE1`, `PIPE_MTE2`, `PIPE_MTE3`, `PIPE_V`, `PIPE_M` | Source pipeline that produces the event |
-| `dst_pipe` | `PIPE_MTE1`, `PIPE_MTE2`, `PIPE_MTE3`, `PIPE_V`, `PIPE_M` | Destination pipeline that consumes the event |
-| `event_id` | 0–15 (profile-specific) | Event slot identifier |
-
-```
-Producer pipeline                          Consumer pipeline
-  │                                          │
-  │  issue DMA or compute                    │
-  │  ▼                                       │
-  │  set_flag(src_pipe, dst_pipe, EVENT_ID)  │
-  │  (produces the event)                    │
-  │                                          │
-  │                            wait_flag(src_pipe, dst_pipe, EVENT_ID)
-  │                            (consumes the event)
-  │                                          │
-  │  data/result available                   │
-  ▼                                          ▼
-```
-
-## Pipe Spaces by Target Profile
-
-| Pipe | CPU Sim | A2/A3 | A5 |
-|------|:-------:|:------:|:--:|
-| `PIPE_MTE1` | Simulated | Supported | Supported |
-| `PIPE_MTE2` | Simulated | Supported | Supported |
-| `PIPE_MTE3` | Simulated | Supported | Supported |
-| `PIPE_V` | Emulated | Emulated | Native |
-| `PIPE_M` | Simulated | Supported | Supported |
-
-## Constraints
-
-- **Pipe/event spaces** differ between A2/A3 and A5 profiles; portable code must use the documented PTO contract plus the selected profile.
-- **Event ordering** requires matching `set_flag`/`wait_flag` pairs; waiting on an unestablished event is illegal.
-- **DMA parameters** must be configured before initiating transfer; incorrect loop sizes or strides produce undefined results.
-- **Predicate width** must match the expected mask width for the target profile.
-- **Pipe identifiers** not supported by the target profile produce verification errors.
-
-## Cases That Are Not Allowed
-
-- Using pipe or event identifiers not supported by the target profile.
-- Waiting on an event that was never established by a matching producer.
-- Configuring DMA with inconsistent loop sizes and strides.
-- Mixing predicate widths that do not match the target operation.
-- Issuing a vector load before `copy_gm_to_ubuf` completes without an intervening `wait_flag`.
-- Issuing `copy_ubuf_to_gm` before vector store completes without an intervening `wait_flag`.
-
-## Syntax
-
-### Assembly Form (PTO-AS)
-
-```asm
-set_flag PIPE_MTE2, PIPE_V, EVENT_ID0
-wait_flag PIPE_MTE2, PIPE_V, EVENT_ID0
-pipe_barrier PIPE_V
-```
-
-### SSA Form (AS Level 1)
-
-```mlir
-pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-pto.pipe_barrier["PIPE_V"]
-```
-
-See [Assembly Spelling And Operands](../syntax-and-operands/assembly-model.md) for the full syntax specification.
-
-## C++ Intrinsic
-
-Scalar/control instructions are available as C++ intrinsics declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Set synchronization flag
-PTO_INST void set_flag(pipe_t src_pipe, pipe_t dst_pipe, event_t event_id);
-
-// Wait on synchronization flag
-PTO_INST void wait_flag(pipe_t src_pipe, pipe_t dst_pipe, event_t event_id);
-
-// Pipeline barrier
-PTO_INST void pipe_barrier(pipe_t pipe);
-
-// DMA copy: GM → UB
-PTO_INST void copy_gm_to_ubuf(ub_ptr dst, gm_ptr src, uint64_t sid,
-                              uint64_t n_burst, uint64_t len_burst,
-                              uint64_t dst_stride, uint64_t src_stride);
-
-// DMA copy: UB → GM
-PTO_INST void copy_ubuf_to_gm(gm_ptr dst, ub_ptr src, uint64_t sid,
-                              uint64_t n_burst, uint64_t len_burst,
-                              uint64_t reserved, uint64_t dst_stride, uint64_t src_stride);
-```
-
-## See Also
-
-- [Scalar ISA reference](./scalar/README.md) — Full scalar family reference
-- [Scalar families](./instruction-families/scalar-and-control-families.md) — Family-level contracts
-- [Instruction families](./instruction-families/README.md) — All family groups
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page standard
-- [Ordering and Synchronization](../machine-model/ordering-and-synchronization.md) — PTO memory and execution ordering model
diff --git a/docs/mkdocs/src/docs/isa/instruction-surfaces/scalar-and-control-instructions_zh.md b/docs/mkdocs/src/docs/isa/instruction-surfaces/scalar-and-control-instructions_zh.md
deleted file mode 100644
index 39d33979..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-surfaces/scalar-and-control-instructions_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Scalar And Control Instruction Surface
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](scalar-and-control-instructions.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-- [中文章节手册指令集概述](../../../manual/07-instructions_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/instruction-surfaces/tile-instructions.md b/docs/mkdocs/src/docs/isa/instruction-surfaces/tile-instructions.md
deleted file mode 100644
index 3ab2a6aa..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-surfaces/tile-instructions.md
+++ /dev/null
@@ -1,144 +0,0 @@
-<!-- Generated from `docs/isa/instruction-surfaces/tile-instructions.md` -->
-
-# Tile Instruction Surface
-
-`pto.t*` is the tile-oriented surface of PTO ISA. It defines how tile payloads are loaded from global memory, transformed element-wise, reduced, expanded, synchronized, and written back to global memory.
-
-## Surface Overview
-
-Tile instructions operate over tiles whose shapes, layouts, roles, and valid regions are architecturally visible. The result is usually another tile, a changed valid-region interpretation, or a synchronization edge.
-
-**Tile operands** (`!pto.tile<T, R, C>` or `!pto.tile_buf<...>`) are the primary operands of this surface. Unlike vector registers or scalar registers, tiles carry explicit metadata about which elements are valid and what layout the data uses.
-
-### Data Flow
-
-```
-GlobalMemory
-    │
-    │  TLOAD: GM → UB → Tile Buffer
-    ▼
-Tile Buffer ──► Tile Compute ──► Tile Buffer ──► TSTORE: Tile Buffer → UB → GM
-(Vec/Mat/Acc)    (pto.tadd, TMATMUL, etc.)       (Vec/Mat/Acc)
-```
-
-## Instruction Classes
-
-| Class | Description | Examples |
-|-------|-------------|----------|
-| Sync and Config | Resource binding, event setup, mode control | `tassign`, `tsync`, `tsethf32mode`, `tsetfmatrix` |
-| Elementwise Tile-Tile | Lane-wise binary and unary operations | `tadd`, `tmul`, `tcmp`, `tcvt`, `tsel`, `trelu` |
-| Tile-Scalar and Immediate | Tile combined with scalar or immediate operands | `tadds`, `tmuls`, `tlrelu`, `tcmps` |
-| Reduce and Expand | Row/column reductions and expansions | `trowsum`, `tcolmax`, `trowexpand`, `tcolexpand` |
-| Memory and Data Movement | GM↔tile transfer, gather/scatter | `tload`, `tstore`, `tstore_fp`, `mgather`, `mscatter` |
-| Matrix and Matrix-Vector | GEMV, matmul, and variants | `tgemv`, `tgemv_mx`, `tmatmul`, `tmatmul_acc`, `tmatmul_bias` |
-| Layout and Rearrangement | Reshape, transpose, extract, insert | `tmov`, `ttrans`, `treshape`, `textract`, `tinsert`, `timg2col` |
-| Irregular and Complex | Sort, quantize, histogram, print | `tmrgsort`, `tsort32`, `tquant`, `thistogram`, `tprint` |
-
-## Inputs
-
-Tile instructions consume combinations of:
-
-- Source tiles (read-only operands): `!pto.tile<...>`
-- Destination tiles (write-only or read-write operands): `!pto.tile_buf<...>`
-- Scalar modifiers or immediate operands
-- GM-facing views: `!pto.partition_tensor_view<...>`
-- Optional `RecordEvent` event tokens or `WaitEvents...` for chaining
-
-## Expected Outputs
-
-Tile instructions produce:
-
-- Destination tile payloads carrying the result
-- Changed valid-region interpretations
-- Explicit state updates (e.g., assigned addresses) or synchronization edges (`RecordEvent`)
-
-## Side Effects
-
-Some tile instructions have architectural side effects beyond the destination tile:
-
-| Class | Side Effects |
-|-------|-------------|
-| Memory and Data Movement | Reads from or writes to GM-visible storage (`TLOAD`, `TSTORE`) |
-| Sync and Config | Establishes synchronization edges or binds tile addresses (`TASSIGN`, `TSYNC`) |
-| Irregular | May produce debug output (`TPRINT`) or modify allocation state |
-
-## Valid Region Model
-
-All tile elementwise operations iterate over the **destination tile's valid region**. For each lane `(r, c)` in the destination's valid region:
-
-- The corresponding lane `(r, c)` from each source tile is read, **regardless of whether that lane is within the source tile's own valid region**
-- Source tiles whose valid region does not cover `(r, c)` read **implementation-defined values**
-- Programs MUST NOT rely on any particular value being read from an out-of-region source lane unless the operation explicitly documents the behavior
-
-See [Tiles And Valid Regions](../programming-model/tiles-and-valid-regions.md) for the full model.
-
-## Constraints
-
-- **Tile legality** depends on more than dtype; shape, layout, role, and valid region all matter.
-- Operations with multiple tiles must define valid-region interaction explicitly.
-- Some tile families are profile-gated: MX format tiles (`Left`/`Right`) are A5-only; FP8 types are A5-only.
-- Tile instructions do **not** inherit vector-register semantics; they operate on architecturally visible tile state.
-- No implicit broadcasting: all source tiles must have shapes compatible with the destination tile.
-
-## Cases That Are Not Allowed
-
-- Reading undefined out-of-valid-region data as if it were meaningful.
-- Assuming tile instructions inherit vector-register semantics.
-- Relying on target-specific support gaps as universal architecture rules.
-- Assuming implicit broadcasting, reshaping, or valid-region repair unless documented.
-- Using MX format tiles (`TileType::Left`/`Right`) on CPU or A2/A3 profiles.
-- Using FP8 element types on CPU or A2/A3 profiles.
-
-## Syntax
-
-### Assembly Form (PTO-AS)
-
-```asm
-tadd %dst, %src0, %src1 : !pto.tile<f32, 16, 16>
-```
-
-### SSA Form (AS Level 1)
-
-```mlir
-%dst = pto.tadd %src0, %src1
-    : (!pto.tile<f32, 16, 16>, !pto.tile<f32, 16, 16>)
-    -> !pto.tile<f32, 16, 16>
-```
-
-### DPS Form (AS Level 2)
-
-```mlir
-pto.tadd ins(%src0, %src1 : !pto.tile_buf<f32, 16, 16>, !pto.tile_buf<f32, 16, 16>)
-          outs(%dst : !pto.tile_buf<f32, 16, 16>)
-```
-
-See [Assembly Spelling And Operands](../syntax-and-operands/assembly-model.md) for the full syntax specification.
-
-## C++ Intrinsic
-
-Tile instructions are available as C++ intrinsics declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Elementwise: tile = tile op tile
-template <typename TileDst, typename TileSrc0, typename TileSrc1>
-PTO_INST RecordEvent TADD(TileDst& dst, TileSrc0& src0, TileSrc1& src1);
-
-// Memory transfer: tile = GM view
-template <typename TileData, typename GlobalData, typename... WaitEvents>
-PTO_INST RecordEvent TLOAD(TileData& dst, GlobalData& src, WaitEvents&... events);
-
-// Matmul: acc = matmul(lhs, rhs)
-template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL(TileRes& cMatrix, TileLeft& aMatrix, TileRight& bMatrix,
-                             WaitEvents&... events);
-```
-
-## See Also
-
-- [Tile ISA reference](./tile/README.md) — Full tile family reference
-- [Tile families](./instruction-families/tile-families.md) — Family-level contracts
-- [Instruction families](./instruction-families/README.md) — All family groups
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page standard
diff --git a/docs/mkdocs/src/docs/isa/instruction-surfaces/tile-instructions_zh.md b/docs/mkdocs/src/docs/isa/instruction-surfaces/tile-instructions_zh.md
deleted file mode 100644
index 4a24b0dd..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-surfaces/tile-instructions_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Tile Instruction Surface
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tile-instructions.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-- [中文章节手册指令集概述](../../../manual/07-instructions_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/instruction-surfaces/vector-instructions.md b/docs/mkdocs/src/docs/isa/instruction-surfaces/vector-instructions.md
deleted file mode 100644
index 5b41b348..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-surfaces/vector-instructions.md
+++ /dev/null
@@ -1,170 +0,0 @@
-<!-- Generated from `docs/isa/instruction-surfaces/vector-instructions.md` -->
-
-# Vector Instruction Surface
-
-`pto.v*` is the vector micro-instruction surface of PTO ISA. It exposes the vector pipeline directly for fine-grained control over lane-level operations, vector registers, predicates, and alignment state.
-
-## Surface Overview
-
-Vector instructions are the fine-grained compute layer beneath tile instructions. While tile instructions operate on architecturally visible tiles with valid-region semantics, vector instructions operate on vector registers (`!pto.vreg<NxT>`), scalar values, and predicate masks. The full register width is always meaningful — there is no valid-region abstraction at the vector level.
-
-**Vector operands** (`!pto.vreg<NxT>`) represent fixed-length vector registers. The width `N` is determined by the element type:
-
-| Element Type | Vector Width N | Register Size |
-|-------------|:-------------:|:------------:|
-| f32 | 64 | 256 B |
-| f16, bf16 | 128 | 256 B |
-| i16, u16 | 128 | 256 B |
-| i8, u8 | 256 | 256 B |
-| f8e4m3, f8e5m2 | 256 | 256 B |
-
-### Data Flow
-
-```
-UnifiedBuffer (UB)
-    │
-    │  vlds / vsld / vgather2 (UB → vreg)
-    ▼
-Vector Registers (!pto.vreg<NxT>) ──► Vector Compute (pto.v*) ──► Vector Registers
-    │                                                         │
-    │  vsts / vsst / vscatter (vreg → UB)                   │
-    └─────────────────────────────────────────────────────────┘
-                    │
-                    ▼
-             UnifiedBuffer (UB) ──► copy_ubuf_to_gm ──► GlobalMemory
-```
-
-## Instruction Classes
-
-| Class | Description | Examples |
-|-------|-------------|----------|
-| Vector Load/Store | UB↔vector register transfer with distribution modes | `vlds`, `vldas`, `vldus`, `vgather2`, `vsld`, `vsst`, `vscatter` |
-| Predicate and Materialization | Vector broadcast and duplication | `vbr`, `vdup` |
-| Unary Vector Ops | Single-operand lane-wise operations | `vabs`, `vneg`, `vexp`, `vsqrt`, `vrec`, `vrelu`, `vnot` |
-| Binary Vector Ops | Two-operand lane-wise operations | `vadd`, `vsub`, `vmul`, `vdiv`, `vmax`, `vmin` |
-| Vec-Scalar Ops | Vector combined with scalar operand | `vadds`, `vmuls`, `vshls`, `vlrelu` |
-| Conversion Ops | Type conversion between numeric types | `vci`, `vcvt`, `vtrc` |
-| Reduction Ops | Cross-lane reductions (channelled) | `vcadd`, `vcmax`, `vcmin`, `vcgadd`, `vcgmax` |
-| Compare and Select | Comparison and conditional lane selection | `vcmp`, `vcmps`, `vsel`, `vselr`, `vselrv2` |
-| Data Rearrangement | Lane permutation, interleaving, packing | `vintlv`, `vdintlv`, `vslide`, `vshift`, `vpack`, `vzunpack` |
-| SFU and DSA Ops | Special function units and DSA-style operations | `vprelu`, `vexpdiff`, `vaxpy`, `vtranspose`, `vsort32` |
-
-## Inputs
-
-Vector instructions consume combinations of:
-
-- Vector registers (`!pto.vreg<NxT>`)
-- Scalar registers or immediate operands
-- Predicate masks (`!pto.mask`) — selects which lanes participate
-- Memory addresses (`!pto.ptr<T, ub>`) — for load/store ops
-- Rounding-mode or distribution-mode attributes
-
-## Expected Outputs
-
-Vector instructions produce:
-
-- Vector register payloads
-- Scalar register values (e.g., reduction results)
-- Memory writes (via vector store)
-- Predicate masks (from compare operations)
-
-## Side Effects
-
-Most vector instructions are pure compute operations with no architectural side effects. Side-effecting vector instructions include:
-
-| Class | Side Effects |
-|-------|-------------|
-| Vector Load/Store | Reads from or writes to UB-visible memory |
-| Compare and Select | Produces predicate mask consumed by subsequent ops |
-
-## Mask Behavior
-
-Vector operations can be gated by a predicate mask. A predicate mask (`!pto.mask`) with width equal to the vector length `N` selects which lanes participate:
-
-- Lanes where the mask bit is **1**: the operation executes normally.
-- Lanes where the mask bit is **0**: the operation produces a **defined result** but the specific value depends on the operation:
-  - Arithmetic ops: masked lanes produce the identity element (e.g., 0 for add, 1 for mul).
-  - Load ops: masked lanes leave the destination register unchanged.
-  - Store ops: masked lanes do not write to memory.
-
-Programs MUST NOT rely on the identity-element behavior of masked lanes unless the operation explicitly documents it.
-
-## Alignment State
-
-Vector unaligned store operations (`vstu`, `vstus`, `vstur`) maintain an alignment state that evolves across successive stores. The state consists of an alignment offset that is updated after each store:
-
-```
-%align_out, %offset_out = pto.vstu %align_in, %offset_in, %value, %base
-    : !pto.align, index, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align, index
-```
-
-A trailing flush form (`vstar` / `vstas`) is required to commit any buffered tail bytes. These ops are **A5-only**.
-
-## Constraints
-
-- **Vector length** (`N`) is determined by the element type (see table above); programs do not choose `N` directly.
-- **Predicate width** must match the vector length `N` for predicate-gated operations.
-- **Alignment requirements** vary by operation and target profile:
-  - `vlds` / `vsld`: A5 requires 32B alignment for NORM mode; other profiles may be more permissive.
-  - `vstu` / `vstus` / `vstur`: **A5-only**; not supported on CPU or A2/A3.
-- **Type combinations** for conversion and arithmetic operations are defined per-op.
-- No implicit type promotion: all operands must have compatible types.
-
-## Cases That Are Not Allowed
-
-- Using a predicate mask whose width does not match the target vector length.
-- Accessing memory with an illegal alignment for the target profile.
-- Relying on undefined lane behavior when predicates mask some lanes (must not depend on the identity-element value).
-- Using `vstu` / `vstus` / `vstur` on CPU or A2/A3 (A5-only ops).
-- Mixing vector types within a single operation unless the operation explicitly supports it.
-- Issuing a vector store before the corresponding DMA copy is complete without an intervening `wait_flag`.
-
-## Syntax
-
-### Assembly Form (PTO-AS)
-
-```asm
-vadd %vdst, %vsrc0, %vsrc1 : !pto.vreg<f32, 64>
-vlds %vreg, %ub_ptr[%offset] {dist = "NORM"} : !pto.ptr<f32, ub>
-```
-
-### SSA Form (AS Level 1)
-
-```mlir
-%vdst = pto.vadd %vsrc0, %vsrc1, %mask
-    : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32>
-```
-
-### DPS Form (AS Level 2)
-
-```mlir
-pto.vadd ins(%vsrc0, %vsrc1, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask)
-          outs(%vdst : !pto.vreg<64xf32>)
-```
-
-See [Assembly Spelling And Operands](../syntax-and-operands/assembly-model.md) for the full syntax specification.
-
-## C++ Intrinsic
-
-Vector instructions are available as C++ intrinsics declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Binary vector addition
-PTO_INST void VADD(VecDst& dst, VecSrc0& src0, VecSrc1& src1);
-
-// Vector load
-PTO_INST void VLDS(VecData& dst, PtrType addr);
-
-// Masked vector load
-PTO_INST void VLDS(VecData& dst, PtrType addr, MaskType pred);
-```
-
-## See Also
-
-- [Vector ISA reference](./vector/README.md) — Full vector family reference
-- [Vector families](./instruction-families/vector-families.md) — Family-level contracts
-- [Instruction families](./instruction-families/README.md) — All family groups
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page standard
diff --git a/docs/mkdocs/src/docs/isa/instruction-surfaces/vector-instructions_zh.md b/docs/mkdocs/src/docs/isa/instruction-surfaces/vector-instructions_zh.md
deleted file mode 100644
index 6aa3a16a..00000000
--- a/docs/mkdocs/src/docs/isa/instruction-surfaces/vector-instructions_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Vector Instruction Surface
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vector-instructions.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-- [中文章节手册指令集概述](../../../manual/07-instructions_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/introduction/README_zh.md b/docs/mkdocs/src/docs/isa/introduction/README_zh.md
deleted file mode 100644
index 28cf0528..00000000
--- a/docs/mkdocs/src/docs/isa/introduction/README_zh.md
+++ /dev/null
@@ -1,26 +0,0 @@
-<!-- Generated from `docs/isa/introduction/README_zh.md` -->
-
-# 引言
-
-本章回答三个根本问题：PTO 是什么、为何存在、以及 PTO ISA 的版本基线与规范边界。
-
-## 本章内容
-
-- [什么是 PTO 虚拟 ISA](what-is-pto-visa.md) — PTO 的定位、tile-first 设计理念与层级抽象
-- [PTO 的设计目标](goals-of-pto.md) — PTO 追求的设计目标
-- [PTO ISA 版本 1.0](pto-isa-version-1-0.md) — 版本 1.0 基线的关键决策
-- [范围与边界](design-goals-and-boundaries.md) — PTO ISA 的范围、与 PTO-AS 和 PTOBC 的边界划分
-- [文档结构](document-structure.md) — 章节地图与阅读顺序
-
-## 阅读建议
-
-建议按以下顺序阅读：
-
-1. 先读[什么是 PTO 虚拟 ISA](what-is-pto-visa.md)，理解 PTO 的定位
-2. 再读[PTO 的设计目标](goals-of-pto.md)，理解设计意图
-3. 然后读[文档结构](document-structure.md)，了解整个手册的章节划分
-4. 最后读[范围与边界](design-goals-and-boundaries.md)和[PTO ISA 版本 1.0](pto-isa-version-1-0.md)，了解版本基线和规范边界
-
-## 章节定位
-
-本章属于手册的第 1 章，建立 PTO 的概念基础，为后续章节的阅读做好准备。
diff --git a/docs/mkdocs/src/docs/isa/introduction/design-goals-and-boundaries.md b/docs/mkdocs/src/docs/isa/introduction/design-goals-and-boundaries.md
deleted file mode 100644
index 69e90ea2..00000000
--- a/docs/mkdocs/src/docs/isa/introduction/design-goals-and-boundaries.md
+++ /dev/null
@@ -1,70 +0,0 @@
-<!-- Generated from `docs/isa/introduction/design-goals-and-boundaries.md` -->
-
-# Scope And Boundaries
-
-This page defines the scope of the PTO ISA specification and the boundary between core ISA guarantees and neighboring layers.
-
-## What PTO ISA Defines
-
-PTO ISA defines the architecture-visible meaning of legal PTO programs. In this manual that includes:
-
-- the semantics of `pto.t*`, `pto.v*`, `pto.*`, and other architecture-visible operations
-- the programming model for tiles, GlobalTensor objects, events, and explicit synchronization
-- the machine model and memory-ordering rules that make execution visible to programmers, simulators, and backends
-- the legality surface that must remain stable across CPU simulation and supported Ascend NPU targets
-
-If two supported targets both accept the same legal PTO program, the architecture-visible meaning of that program shall come from PTO ISA and shall not be redefined by target-specific interpretation.
-
-## What Target Profiles May Narrow
-
-PTO ISA is stable, but it is not unlimited. Target profiles may narrow the accepted or efficient subset for a particular implementation.
-
-For example, a target profile may restrict:
-
-- tile shapes or tile ranks
-- data types and layout combinations
-- specific vector micro-instruction forms
-- synchronization variants or memory spaces
-- instruction subsets tied to a hardware generation
-
-Those restrictions narrow the accepted or efficient subset on that target. They do not change the meaning of PTO ISA itself.
-
-## What PTO-AS Adds
-
-PTO-AS is the textual syntax for PTO ISA. It adds exact spelling for:
-
-- instruction names
-- operand order
-- attributes and modifiers
-- textual conventions needed for parsing, assembly, and round-tripping
-
-PTO-AS is therefore part of the expression of PTO ISA. It is not a second architecture with different semantics.
-
-## What PTOBC Adds
-
-PTOBC is the distribution and transport form for PTO programs. It exists so PTO code can be packaged, cached, shipped in middleware, and handed between tools without collapsing directly to one hardware generation.
-
-PTOBC does not redefine the ISA. It carries PTO programs in serialized form.
-
-## What PTO ISA Does Not Freeze
-
-This manual does not define every compiler-internal stage or backend lowering step as part of the public contract. PTO ISA does not freeze:
-
-- compiler-internal IR structure
-- pass ordering
-- backend-specific scheduling strategy
-- hardware-private pipeline internals
-- binary encodings of native hardware instructions
-
-Those details belong to compilers, assemblers, runtime systems, and target-specific backend documentation.
-
-## Source Of Truth Order
-
-When the specification boundary is unclear, use the following order of authority:
-
-1. the PTO ISA manual and per-op ISA pages
-2. the legal instruction surface exposed by code and verification
-3. PTO-AS and PTOBC documentation for syntax and distribution rules
-4. backend-profile notes for target-specific narrowing
-
-If a backend depends on behavior that is not covered by that authority chain, that behavior is a backend requirement and not yet a PTO ISA guarantee.
diff --git a/docs/mkdocs/src/docs/isa/introduction/design-goals-and-boundaries_zh.md b/docs/mkdocs/src/docs/isa/introduction/design-goals-and-boundaries_zh.md
deleted file mode 100644
index 924d2b2a..00000000
--- a/docs/mkdocs/src/docs/isa/introduction/design-goals-and-boundaries_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Scope And Boundaries
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](design-goals-and-boundaries.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册概述](../../../manual/01-overview_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/introduction/document-structure.md b/docs/mkdocs/src/docs/isa/introduction/document-structure.md
deleted file mode 100644
index 6f72d225..00000000
--- a/docs/mkdocs/src/docs/isa/introduction/document-structure.md
+++ /dev/null
@@ -1,29 +0,0 @@
-<!-- Generated from `docs/isa/introduction/document-structure.md` -->
-
-# Document Structure
-
-This manual is organized as a layered architecture reference: establish the programming and machine models first, then syntax and types, then memory rules, then the instruction set. The chapter roles stay fixed so readers can locate model rules before opcode detail; the content remains specific to PTO's tile-first Ascend model.
-
-## Chapter Map
-
-| Chapter | PTO manual section | What it covers |
-| --- | --- | --- |
-| 1. Introduction | [Introduction](./what-is-pto-visa.md), [Goals Of PTO](./goals-of-pto.md), [PTO ISA Version 1.0](./pto-isa-version-1-0.md), [Scope And Boundaries](./design-goals-and-boundaries.md) | What PTO is, why it exists, version baseline, and specification boundaries versus PTO-AS and PTOBC. |
-| 2. Programming model | [Tiles and valid regions](../programming-model/tiles-and-valid-regions.md), [GlobalTensor and data movement](../programming-model/globaltensor-and-data-movement.md), [Auto vs Manual](../programming-model/auto-vs-manual.md) | Tiles, valid regions, global tensor views, Auto versus Manual—the objects authors reason about. |
-| 3. Machine model | [Machine model](../machine-model/execution-agents.md) | Execution agents, pipelines, target profiles, ordering vocabulary. |
-| 4. Syntax | [Assembly model](../syntax-and-operands/assembly-model.md), [operands and attributes](../syntax-and-operands/operands-and-attributes.md), [common conventions](../conventions.md) | Textual spelling, operand shapes, attributes, and shared naming conventions. |
-| 5. State, types, and location | [Type system](../state-and-types/type-system.md), [location intent](../state-and-types/location-intent-and-legality.md) | Types, tile roles, location intent, legality. |
-| 6. Memory consistency | [Memory model](../memory-model/consistency-baseline.md) | Visibility and ordering rules for producers and consumers. |
-| 7. Instruction set | [Instruction surfaces](../instruction-surfaces/README.md), [instruction families](../instruction-families/README.md), per-op reference under `tile/`, `vector/`, `scalar/`, `other/` | Surfaces and families first, then opcode-level pages. |
-| 8. Supporting reference | [Reference notes](../reference/README.md) | Glossary, diagnostics, portability, source-of-truth order, [format of instruction descriptions](../reference/format-of-instruction-descriptions.md). |
-
-Chapters 4–6 are the usual “read before you deep-dive opcodes” path. Chapter 7 is the bulk of the opcode reference.
-
-## PTO-Specific Reading Notes
-
-PTO is built around **tiles**, **valid regions**, and **explicit synchronization**. Read model chapters first, then syntax and state, then memory, then per-op pages. This keeps architecture guarantees separate from backend profile narrowing and avoids treating examples as standalone contracts.
-
-## See Also
-
-- [PTO ISA hub](../README.md)
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md)
diff --git a/docs/mkdocs/src/docs/isa/introduction/goals-of-pto.md b/docs/mkdocs/src/docs/isa/introduction/goals-of-pto.md
deleted file mode 100644
index 91a2834e..00000000
--- a/docs/mkdocs/src/docs/isa/introduction/goals-of-pto.md
+++ /dev/null
@@ -1,21 +0,0 @@
-<!-- Generated from `docs/isa/introduction/goals-of-pto.md` -->
-
-# Goals Of PTO
-
-This page lists what PTO is trying to achieve in the Ascend stack. It complements the narrative introduction in [What Is PTO VISA](./what-is-pto-visa.md) and the normative scope statement in [Scope And Boundaries](./design-goals-and-boundaries.md).
-
-PTO is meant to solve a small set of important problems in the Ascend software stack.
-
-- Keep the instruction set stable across multiple Ascend NPU generations. Hardware changes from generation to generation, but low-level software still needs one instruction language that does not have to be reinvented every time the machine changes.
-- Preserve performance that is comparable to native NPU software. PTO is not meant to hide the machine behind a generic compute API. It keeps tile shape, data movement, synchronization, vector micro-instructions, and scalar control visible because those details often decide whether a kernel is merely correct or actually fast.
-- Give C, C++, Python, and other frontends one machine-independent target. The same applies to tile-based systems and code generators such as TileLang and PyPTO. They should be able to target PTO instead of learning a separate low-level contract for each NPU generation.
-- Provide a distribution form through PTOBC. Applications and middleware need a way to cache, package, and transport PTO programs without collapsing them immediately into one target-specific binary format.
-- Give optimizing code generators and translators a common source-level ISA. PTO is the place where legalization, transformation, specialization, and verification can be shared before the final mapping to a particular hardware generation.
-- Support hand-written libraries, performance kernels, and architecture tests. PTO is not only for compiler output. It also needs to be explicit and readable enough for people who write or inspect low-level code directly.
-- Scale from a single NPU unit to many parallel units. Parallel execution, explicit synchronization, and machine-visible data movement are part of the model from the start, not features bolted on later.
-
-## See Also
-
-- [What Is PTO VISA](./what-is-pto-visa.md)
-- [Scope And Boundaries](./design-goals-and-boundaries.md)
-- [PTO ISA Version 1.0](./pto-isa-version-1-0.md)
diff --git a/docs/mkdocs/src/docs/isa/introduction/pto-isa-version-1-0.md b/docs/mkdocs/src/docs/isa/introduction/pto-isa-version-1-0.md
deleted file mode 100644
index 24c0e3ea..00000000
--- a/docs/mkdocs/src/docs/isa/introduction/pto-isa-version-1-0.md
+++ /dev/null
@@ -1,126 +0,0 @@
-<!-- Generated from `docs/isa/introduction/pto-isa-version-1-0.md` -->
-
-# PTO ISA Version 1.0
-
-This page records the instruction inventory and architecture surface defined for PTO ISA Version 1.0. It is the release baseline for this manual and is intended to serve as the reference point for future release notes.
-
-## Version 1.0 Scope
-
-PTO ISA Version 1.0 defines three named instruction surfaces with explicit per-op reference pages:
-
-- **Tile instructions**: `pto.t*` operations together with `pto.mgather` and `pto.mscatter`
-- **Vector micro instructions**: `pto.v*` operations
-- **Scalar and control instructions**: `pto.*` operations used for synchronization, DMA control, predicate construction, and related machine-visible control
-
-Version 1.0 also includes supporting reference material in the `other/` tree for communication/runtime and non-ISA supporting operations.
-
-## Version 1.0 Inventory Summary
-
-PTO ISA Version 1.0 currently documents:
-
-- **120** tile instructions
-- **99** vector micro instructions
-- **44** scalar and control instructions
-
-That yields **263 named instructions** in the Version 1.0 reference set, excluding non-ISA/supporting reference pages.
-
-## Tile Instruction Inventory
-
-### Sync And Config
-
-`tsync`, `tassign`, `tsethf32mode`, `tsettf32mode`, `tsetfmatrix`, `tset_img2col_rpt`, `tset_img2col_padding`, `tsubview`, `tget_scale_addr`
-
-### Elementwise Tile-Tile
-
-`tabs`, `tadd`, `taddc`, `tand`, `tcmp`, `tcvt`, `tdiv`, `texp`, `tfmod`, `tlog`, `tmax`, `tmin`, `tmul`, `tneg`, `tnot`, `tor`, `tprelu`, `trecip`, `trelu`, `trem`, `trsqrt`, `tsel`, `tshl`, `tshr`, `tsqrt`, `tsub`, `tsubc`, `txor`
-
-### Tile-Scalar And Immediate
-
-`tadds`, `taddsc`, `tands`, `tcmps`, `tdivs`, `texpands`, `tfmods`, `tlrelu`, `tmaxs`, `tmins`, `tmuls`, `tors`, `trems`, `tsels`, `tshls`, `tshrs`, `tsubs`, `tsubsc`, `txors`
-
-### Reduce And Expand
-
-`tcolexpand`, `tcolexpandadd`, `tcolexpanddiv`, `tcolexpandexpdif`, `tcolexpandmax`, `tcolexpandmin`, `tcolexpandmul`, `tcolexpandsub`, `tcolmax`, `tcolmin`, `tcolprod`, `tcolsum`, `trowargmax`, `trowargmin`, `trowexpand`, `trowexpandadd`, `trowexpanddiv`, `trowexpandexpdif`, `trowexpandmax`, `trowexpandmin`, `trowexpandmul`, `trowexpandsub`, `trowmax`, `trowmin`, `trowsum`
-
-### Memory And Data Movement
-
-`tload`, `tprefetch`, `tstore`, `tstore_fp`, `mgather`, `mscatter`
-
-### Matrix And Matrix-Vector
-
-`tgemv`, `tgemv_acc`, `tgemv_bias`, `tgemv_mx`, `tmatmul`, `tmatmul_acc`, `tmatmul_bias`, `tmatmul_mx`
-
-### Layout And Rearrangement
-
-`textract`, `textract_fp`, `tfillpad`, `tfillpad_expand`, `tfillpad_inplace`, `timg2col`, `tinsert`, `tinsert_fp`, `tmov`, `tmov_fp`, `treshape`, `ttrans`
-
-### Irregular And Complex
-
-`tci`, `tgather`, `tgatherb`, `tmrgsort`, `tpartadd`, `tpartmax`, `tpartmin`, `tpartmul`, `tprint`, `tquant`, `tscatter`, `tsort32`, `ttri`
-
-## Vector Micro-Instruction Inventory
-
-### Vector Load-Store
-
-`vgather2`, `vgather2_bc`, `vgatherb`, `vldas`, `vlds`, `vldus`, `vldx2`, `vscatter`, `vsld`, `vsldb`, `vsst`, `vsstb`, `vsta`, `vstar`, `vstas`, `vsts`, `vstu`, `vstur`, `vstus`, `vstx2`
-
-### Predicate And Materialization
-
-`vbr`, `vdup`
-
-### Unary Vector Operations
-
-`vabs`, `vbcnt`, `vcls`, `vexp`, `vln`, `vmov`, `vneg`, `vnot`, `vrec`, `vrelu`, `vrsqrt`, `vsqrt`
-
-### Binary Vector Operations
-
-`vadd`, `vaddc`, `vand`, `vdiv`, `vmax`, `vmin`, `vmul`, `vor`, `vshl`, `vshr`, `vsub`, `vsubc`, `vxor`
-
-### Vector-Scalar Operations
-
-`vaddcs`, `vadds`, `vands`, `vlrelu`, `vmaxs`, `vmins`, `vmuls`, `vors`, `vshls`, `vshrs`, `vsubcs`, `vsubs`, `vxors`
-
-### Conversion Operations
-
-`vci`, `vcvt`, `vtrc`
-
-### Reduction Operations
-
-`vcadd`, `vcgadd`, `vcgmax`, `vcgmin`, `vcmax`, `vcmin`, `vcpadd`
-
-### Compare And Select
-
-`vcmp`, `vcmps`, `vsel`, `vselr`, `vselrv2`
-
-### Data Rearrangement
-
-`vdintlv`, `vdintlvv2`, `vintlv`, `vintlvv2`, `vpack`, `vperm`, `vshift`, `vslide`, `vsqz`, `vsunpack`, `vusqz`, `vzunpack`
-
-### SFU And DSA Operations
-
-`vaddrelu`, `vaddreluconv`, `vaxpy`, `vexpdiff`, `vmrgsort`, `vmula`, `vmulconv`, `vmull`, `vprelu`, `vsort32`, `vsubrelu`, `vtranspose`
-
-## Scalar And Control Instruction Inventory
-
-### Pipeline Sync
-
-`get_buf`, `mem_bar`, `pipe_barrier`, `rls_buf`, `set_cross_core`, `set_flag`, `set_intra_block`, `wait_flag`, `wait_flag_dev`, `wait_intra_core`
-
-### DMA Copy
-
-`copy_gm_to_ubuf`, `copy_ubuf_to_gm`, `copy_ubuf_to_ubuf`, `set_loop_size_outtoub`, `set_loop_size_ubtoout`, `set_loop1_stride_outtoub`, `set_loop1_stride_ubtoout`, `set_loop2_stride_outtoub`, `set_loop2_stride_ubtoout`
-
-### Predicate Load-Store
-
-`pld`, `pldi`, `plds`, `pst`, `psti`, `psts`, `pstu`
-
-### Predicate Generation And Algebra
-
-`pand`, `pdintlv_b8`, `pge_b16`, `pge_b32`, `pge_b8`, `pintlv_b16`, `plt_b16`, `plt_b32`, `plt_b8`, `pnot`, `por`, `ppack`, `psel`, `pset_b16`, `pset_b32`, `pset_b8`, `punpack`, `pxor`
-
-## Supporting Reference Groups
-
-The Version 1.0 manual also includes the following supporting groups under `docs/isa/other/`:
-
-- `communication-and-runtime`
-- `non-isa-and-supporting-ops`
diff --git a/docs/mkdocs/src/docs/isa/introduction/what-is-pto-visa.md b/docs/mkdocs/src/docs/isa/introduction/what-is-pto-visa.md
deleted file mode 100644
index e648a987..00000000
--- a/docs/mkdocs/src/docs/isa/introduction/what-is-pto-visa.md
+++ /dev/null
@@ -1,411 +0,0 @@
-<!-- Generated from `docs/isa/introduction/what-is-pto-visa.md` -->
-
-# Parallel Tile Operation ISA
-
-## Overview
-
-**PTO ISA** (Parallel Tile Operation Instruction Set Architecture) defines a machine-independent ISA for Huawei Ascend NPU software. PTO ISA provides a stable low-level programming contract above generation-specific hardware instruction sets, serving as the assembly-language layer of the PTO software stack.
-
-PTO ISA is not the native binary ISA of any single Ascend implementation. It defines the architecture-visible meaning of legal PTO programs and the instruction vocabulary shared by frontends, code generators, verifiers, simulators, and target backends.
-
-## Why Tile-First
-
-Most Ascend kernels are authored in terms of **tiles** — bounded multi-dimensional array fragments with layout and valid-region metadata — not anonymous lanes or opaque buffers. A generic SIMD or SIMT model can describe the hardware eventually, but it pushes the important questions into backend-specific folklore:
-
-- Shape and layout legality
-- Which elements are meaningful (valid regions)
-- When two tiles may alias
-- Where synchronization must appear
-
-PTO lifts these questions into the ISA so programs, verifiers, and backends share one testable, portable contract.
-
-See [Goals Of PTO](./goals-of-pto.md) for product goals and [Tiles And Valid Regions](../programming-model/tiles-and-valid-regions.md) for how tiles work in programs.
-
-## Two Compilation Flows
-
-PTO programs can be compiled to hardware through two supported paths. Both paths share the same PTO instruction semantics; they differ in how the final binary is produced.
-
-### Flow A: High-Level Compile (ptoas → C++ → bisheng → binary)
-
-High-level frontends (TileLang, PyPTO, custom DSLs) emit PTO programs as `.pto` text files. The `ptoas` tool parses, validates, and lowers these to C++ code that calls the `pto-isa` C++ library. A backend compiler (bisheng) then compiles this C++ to the target binary.
-
-```
-High-level Frontend
-(TileLang, PyPTO, C/C++, ...)
-        │
-        ▼
-   .pto file
-   (PTO program text)
-        │
-        ▼
-   ptoas
-   (PTO assembler & optimizer)
-   ┌─────────────────────────────────────┐
-   │ Parse, validate, optimize            │
-   │ Lower PTO instructions to C++ calls  │
-   │ Insert synchronization (auto-sync)   │
-   └─────────────────────────────────────┘
-        │
-        ▼
-   C++ kernel code
-   (calls pto-isa C++ intrinsics)
-        │
-        ▼
-   bisheng (or backend C++ compiler)
-   ┌─────────────────────────────────────┐
-   │ Compile to target binary             │
-   │ Target: A2/A3 (Ascend 2/3-class)   │
-   │ Target: A5 (Ascend 9xx-class)       │
-   │ Target: CPU simulator                │
-   └─────────────────────────────────────┘
-        │
-        ▼
-   Binary
-```
-
-**Who uses this flow:** Compiler developers, library authors, high-level framework integrators. The `.pto` text format is portable and can be cached/distributed as bytecode.
-
-### Flow B: Direct Assemble (ptoas → binary)
-
-PTO programs can also be assembled directly to binary via `ptoas` with an appropriate backend target. This bypasses the C++ intermediate step.
-
-```
-High-level Frontend
-(TileLang, PyPTO, C/C++, ...)
-        │
-        ▼
-   .pto file
-   (PTO program text)
-        │
-        ▼
-   ptoas --target=a3|a5|cpu
-   ┌─────────────────────────────────────┐
-   │ Parse, validate, lower to binary     │
-   │ Directly emit target instructions    │
-   └─────────────────────────────────────┘
-        │
-        ▼
-   Binary
-```
-
-**Who uses this flow:** Performance engineers who need direct control over the final instruction stream, or toolchains that embed `ptoas` as a pure assembler without a full C++ toolchain.
-
-### Which Flow to Use
-
-| Criterion | Flow A (ptoas → C++ → bisheng) | Flow B (ptoas → binary) |
-|-----------|--------------------------------|--------------------------|
-| Debugging | Full C++ debugging available | Binary only |
-| Portability | C++ code is source portable | Binary is target-specific |
-| Integration | Easy with existing C++ codebases | Requires custom binary packaging |
-| Performance | Depends on C++ compiler | Direct, predictable instruction stream |
-| Typical user | Library authors, compiler devs | Kernel engineers, performance tuners |
-
-## A Minimal Example
-
-The smallest end-to-end PTO program loads two tiles from global memory, adds them element-wise, and stores the result:
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void vec_add(Tile<float, 16, 16>& c, const GlobalTensor<float>& ga,
-             const GlobalTensor<float>& gb) {
-    Tile<float, 16, 16> a, b;
-    TLOAD(a, ga);           // Load from global memory
-    TLOAD(b, gb);           // Load from global memory
-    TADD(c, a, b);          // Element-wise addition
-    TSTORE(gc, c);          // Store to global memory
-}
-```
-
-Even this fragment depends on valid regions, dtype and layout rules, and explicit data movement — ideas the manual unpacks in the programming model, machine model, and per-instruction reference.
-
-## Key Terms
-
-| Term | Definition |
-|------|------------|
-| **PTO** | The programming and instruction model built around tiles, explicit data movement, explicit synchronization, and machine-visible execution structure |
-| **PTO ISA** | The instruction set architecture defined by this manual |
-| **PTO-AS** | The textual assembly syntax for PTO ISA (e.g., `tadd %dst, %src0, %src1`) |
-| **ptoas** | The assembler and optimizer tool that parses `.pto` files and lowers them to C++ or directly to binary |
-| **PTOBC** | The bytecode representation used to package PTO programs for transport, caching, and distribution |
-| **Tile** | A bounded multi-dimensional array fragment with shape, layout, and valid-region metadata that is architecturally visible |
-| **Valid Region** | The subset of a tile's declared shape that contains meaningful data, expressed as `(Rv, Cv)` — valid rows and valid columns |
-| **Global Memory (GM)** | Off-chip device memory (`__gm__` address space) shared by all blocks and accessible via `GlobalTensor` views |
-| **Unified Buffer (UB)** | On-chip local memory (`!pto.ptr<T, ub>`) visible to a single AI Core; the staging ground for GM↔tile data movement |
-| **Tile Buffer** | On-chip storage for a single Tile, partitioned by `TileType`: `Vec` (vector compute), `Mat` (matrix/CUBE compute), `Acc` (accumulator), `Scalar` (scalar tile) |
-| **Location Intent** | The declared role of a tile operand: `Left` (LHS of matmul), `Right` (RHS), `Acc` (accumulator/output), `Vec` (general vector tile) |
-| **Block Layout (BLayout)** | The in-memory storage order of a tile: `RowMajor` (row-major, C-contiguous) or `ColMajor` (column-major, Fortran-contiguous) |
-| **Stripe Layout (SLayout)** | The layout of sub-elements within a tile: `NoneBox` (uniform rectangular), `RowMajor` (fractal/strided), `ColMajor` (fractal/strided) |
-| **Fractal Layout** | A strided layout encoding non-uniform strides for 2D tiles: `NZ` (row-major fractal), `ZN` (col-major fractal), `FR` (row-fractal), `RN` (row-N-fractal) |
-| **TileType** | Classification of tile buffer role: `Vec` (vector pipe), `Mat` (matrix/CUBE pipe), `Acc` (accumulator), `Scalar` (scalar tile), `Left`/`Right` (matmul operands) |
-| **MTE** | DMA engine sub-unit: `MTE1` (GM→UB), `MTE2` (UB→GM for loads), `MTE3` (tile→GM for stores) |
-| **Target Profile** | A concrete instantiation of PTO ISA for a specific backend: `CPU` (reference simulator), `A2/A3` (Ascend 2/3-class), `A5` (Ascend 9xx-class) |
-| **Instruction Surface** | One of the four ISA surfaces: `pto.t*` (tile surface), `pto.v*` (vector micro-instruction surface), `pto.*` (scalar/control surface), collective ops (communication surface) |
-| **pto.t*** | The tile compute surface (`pto.tadd`, `pto.tmul`, etc.) that operates on tile buffers |
-| **pto.v*** | The low-level vector micro-instruction surface (`pto.v*`) that operates on vector registers after an explicit GM→UB→vector data flow |
-| **Element Type** | The dtype of a tile's elements: floating-point (`f16`, `bf16`, `f32`, `f8e4m3`, `f8e5m2`), integer (`i8`–`i64`, `u8`–`u64`), or specialized (`hifloat8_t`, `float4_e*`) |
-| **Auto Mode** | Execution mode where the compiler/runtime automatically inserts `TASSIGN`, `TSYNC`, and data-movement operations |
-| **Manual Mode** | Execution mode where the author explicitly binds tile resources with `TASSIGN` and manages synchronization explicitly |
-| **pto.tget / TGET** | Inter-NPU remote read: reads data from a remote NPU's GM to local GM. Both spellings (`pto.tget` in IR, `TGET` in C++) refer to the same operation. |
-
-## Position In The Software Stack
-
-PTO ISA sits between source-level frontends and target-specific lowering. Frontends and code generators target PTO ISA; target backends lower PTO ISA to CPU simulation or to supported Ascend NPU targets.
-
-```
-Source Languages
-(C/C++, Python, TileLang, PyPTO, code generators)
-        │
-        ▼
-   PTO instructions (.pto text)
-        │
-        ├──► ptoas ──► C++ ──► bisheng ──► binary  (Flow A)
-        │
-        └──► ptoas ──────────────────► binary        (Flow B)
-
-Targets: CPU simulation / A2-A3 / A5 / future Ascend NPUs
-```
-
-This structure gives the software stack one versioned instruction language even when native hardware instruction sets and low-level programming rules change across generations.
-
-## Hierarchical Abstractions
-
-PTO ISA uses **hierarchical abstractions** rather than one flat opcode space. The ISA is organized into four instruction surfaces:
-
-```
-PTO ISA
-├── Tile Surface (pto.t*)              Primary tile-oriented compute surface
-│   ├── Sync and Config                Resource binding, event setup, mode control
-│   ├── Elementwise Tile-Tile           Lane-wise binary and unary operations
-│   ├── Tile-Scalar and Immediate       Tile combined with scalar or immediate
-│   ├── Reduce and Expand             Row/column reductions and expansions
-│   ├── Memory and Data Movement       GM↔tile transfer, gather/scatter
-│   ├── Matrix and Matrix-Vector       GEMV, matmul, and variants
-│   ├── Layout and Rearrangement       Reshape, transpose, extract, insert
-│   └── Irregular and Complex          Sort, quantize, print, and others
-│
-├── Vector Surface (pto.v*)             Micro-instruction surface for vector pipe
-│   ├── Vector Load/Store              Predicate-based vector memory access
-│   ├── Unary Vector Ops              abs, neg, exp, sqrt, rec, relu, not, etc.
-│   ├── Binary Vector Ops             add, sub, mul, div, max, min, shl, shr, etc.
-│   ├── Vec-Scalar Ops                Vector combined with scalar operands
-│   ├── Conversion Ops                Type conversion between numeric types
-│   ├── Reduction Ops                 Cross-lane reductions (cadd, cmax, etc.)
-│   ├── Compare and Select            Comparison and conditional selection
-│   ├── Data Rearrangement            Interleave, slide, shift, permute, pack
-│   └── SFU and DSA Ops              Special function units and DSA ops
-│
-├── Scalar and Control Surface (pto.*)  State setup and control shell
-│   ├── Pipeline Sync                 Event and barrier synchronization
-│   ├── DMA Copy                     GM↔UB memory transfer configuration
-│   ├── Predicate Load/Store         Mask-based scalar memory access
-│   ├── Predicate Generation         pset, pge, plt, pand, por, pxor, pnot, etc.
-│   └── Shared Arithmetic/SCF         Scalar arithmetic and structured control flow
-│
-└── Communication Surface (pto.*)       Collective and runtime operations
-    ├── Collective Communication        TBROADCAST, TGET, TPUT, TREDUCE, etc.
-    └── Supporting Operations          TALIAS, TAXPY, TCONCAT, TFREE, etc.
-```
-
-The **tile surface** is the primary programming surface. The **vector surface** exists for fine-grained vector-pipe control. The **scalar/control surface** sets up the execution shell around tile payload regions. The **communication surface** handles inter-rank communication and runtime support.
-
-## Machine Model
-
-PTO programs run on a hierarchical execution structure:
-
-```
-Grid (whole kernel invocation)
-└── Block  (AI Core / NPU)
-    ├── Host Interface
-    ├── Scalar Unit          (control flow, address calculation)
-    ├── Unified Buffer (UB)  256 KB on-chip SRAM
-    ├── Tile Registers       (16×16 tile slots, typed by TileType)
-    │   ├── Vec slots   ──► Vector Pipeline (V)
-    │   ├── Mat slots   ──► Matrix Multiply Unit (M / CUBE)
-    │   └── Acc slots   ──► Accumulator output
-    ├── DMA Engine
-    │   ├── MTE1: GM ──► UB  (GM→UB, prefetch)
-    │   ├── MTE2: GM ──► UB  (GM→UB, load staging)
-    │   └── MTE3: UB ──► GM  (UB→GM, store)
-    └── Vector Pipeline (V)  (unary/binary/reduce on vector regs)
-```
-
-### Execution Hierarchy
-
-| Level | Description | PTO Visibility |
-|-------|-------------|---------------|
-| **Grid** | Entire kernel invocation across all participating AI Cores | `GetBlockNum()`, `GetBlockIdx()` |
-| **Block** | Single AI Core with local UB, tile regs, and compute units | `GetSubBlockNum()`, `GetSubBlockIdx()` |
-| **Tile Buffer** | Per-core on-chip storage for one tile (typed by `TileType`) | `!pto.tile_buf<...>` |
-| **Vector Register** | Per-lane on-chip storage for vector compute (N lanes) | `!pto.vreg<NxT>` |
-| **Unified Buffer (UB)** | On-chip staging area shared by all tile buffers and vector regs | `!pto.ptr<T, ub>` |
-| **Global Memory (GM)** | Off-chip device memory shared by all AI Cores | `__gm__ T*`, `!pto.partition_tensor_view<...>` |
-
-### Target Profiles
-
-PTO ISA is instantiated by concrete **target profiles** that narrow the ISA to the capabilities of a specific backend. Profiles do NOT introduce new ISA semantics; they only restrict which subsets are available.
-
-| Feature | CPU Simulator | A2/A3 Profile | A5 Profile |
-|---------|:------------:|:-------------:|:----------:|
-| Tile surface (`pto.t*`) | Full | Full | Full |
-| Vector surface (`pto.v*`) | Emulated | Emulated | Full |
-| Matmul / CUBE ops | Software fallback | Hardware | Hardware |
-| MX format (int8→acc int32) | Not applicable | Not applicable | Supported |
-| Fractal layout (NZ/ZN/FR/RN) | Simulated | Simulated | Full |
-| UB size | Configurable | 256 KB/core | 256 KB/core |
-| Vector width (f32 / f16,bf16 / i8) | N=64 / N=128 / N=256 | N=64 / N=128 / N=256 | N=64 / N=128 / N=256 |
-| FP8 types (e4m3 / e5m2) | Not supported | Not supported | Supported |
-| Vector unaligned store (`vstu`) | Not supported | Not supported | Supported |
-| Block-scoped collective comm | Not supported | Supported | Supported |
-
-## Instruction Syntax Overview
-
-PTO instructions use a consistent textual syntax. Three forms are commonly shown:
-
-### Assembly Form (PTO-AS)
-
-The human-readable assembly spelling — the preferred form for documentation and portable pseudocode:
-
-```asm
-# Scalar operand suffix: immediate added to each tile element
-tadds %dst, %src, 0x3F800000  : !pto.tile<f32, 16, 16>
-
-# Saturating carry variant
-taddc %dst, %src0, %src1       : !pto.tile<f32, 16, 16>
-
-# Tile with explicit memory operand: load from GlobalTensor view
-tload %tile, %gtensor[%r, %c]  : (!pto.tile<f32,16,16>, !pto.memref<f32,1x16x16x16>) -> !pto.tile<f32,16,16>
-```
-
-### SSA Form (AS Level 1)
-
-MLIR-style SSA form with explicit types and a named result:
-
-```mlir
-// Tile compute: element-wise addition
-%dst = pto.tadd %src0, %src1 : (!pto.tile<f32, 16, 16>, !pto.tile<f32, 16, 16>) -> !pto.tile<f32, 16, 16>
-
-// Tile load: from GlobalTensor partition view
-%tile = pto.tload %mem : !pto.partition_tensor_view<1x1x1x16x16xf32> -> !pto.tile_buf<loc=vec, f32, 16, 16, RowMajor, NoneBox, None, Zero>
-
-// Scalar tile comparison
-%cmp = pto.tcmps %src, 0 : !pto.tile<f32, 16, 16>, i32 -> !pto.tile<predicate, 16, 16>
-```
-
-### DPS Form (AS Level 2)
-
-Functional-style form with explicit `ins(...)` and `outs(...)` blocks — closest to the C++ intrinsic surface:
-
-```mlir
-// Tile compute (DPS)
-pto.tadd ins(%src0, %src1 : !pto.tile_buf<f32, 16, 16>, !pto.tile_buf<f32, 16, 16>)
-          outs(%dst : !pto.tile_buf<f32, 16, 16>)
-
-// Tile load (DPS)
-pto.tload ins(%mem : !pto.partition_tensor_view<1x1x1x16x16xf32>)
-          outs(%tile : !pto.tile_buf<loc=vec, f32, 16, 16, RowMajor, NoneBox, None, Zero>)
-
-// Tile store (DPS)
-pto.tstore ins(%tile : !pto.tile_buf<f32, 16, 16>)
-          outs(%mem : !pto.partition_tensor_view<1x1x1x16x16xf32>)
-```
-
-See [Assembly Spelling And Operands](../syntax-and-operands/assembly-model.md) for the full syntax specification.
-
-## Tile Surface And Vector Surface
-
-PTO distinguishes two complementary data-flow paths from GM to computed result. Both are architecturally visible; neither is a backend-only detail.
-
-### Tile Surface (pto.t*)
-
-The tile surface operates on tile buffers directly. The complete data path is:
-
-```
-GM ──(MTE2)──► UB ──(implicit)──► Tile Buffer ──(Tile Compute)──► Tile Buffer ──(MTE3)──► GM
-                      │                                                              ▲
-                      └──(vlds/vsts on vector surface before/after tile ops)─────────┘
-```
-
-- `TLOAD` copies data from GM into a tile buffer (via MTE2 → UB → tile)
-- Tile compute (`TADD`, `TMATMUL`, etc.) operates directly on tile buffers
-- `TSTORE` copies data from a tile buffer to GM (via tile → MTE3 → UB → GM)
-- Valid regions, layout, and tile type are explicit at every step
-
-### Vector Surface (pto.v*)
-
-The vector surface operates on vector registers after an explicit UB staging step. The data path is:
-
-```
-GM ──(copy_gm_to_ubuf)──► UB ──(vlds)──► Vector Register ──(Vector Compute)──► Vector Register ──(vsts)──► UB ──(copy_ubuf_to_gm)──► GM
-```
-
-- `copy_gm_to_ubuf` / `copy_ubuf_to_gm`: DMA engine moves data between GM and UB
-- `vlds` / `vsld` / `vgather2`: Vector load brings data from UB into vector registers
-- Vector compute (`vadd`, `vmul`, etc.): operates on vector registers with predicate masking
-- `vsts` / `vsst` / `vscatter`: Vector store writes data from vector registers back to UB
-- An explicit `sync` or `set_flag` / `wait_flag` sequence establishes producer-consumer ordering between DMA and vector compute
-
-### When To Use Which Surface
-
-| Criteria | Tile Surface (`pto.t*`) | Vector Surface (`pto.v*`) |
-|----------|-------------------------|---------------------------|
-| Typical use | Dense tensor algebra, matmul, elementwise | Fine-grained vector-pipe control, per-lane masking |
-| Data movement | TLOAD/TSTORE (implicit tile↔UB) | copy_gm_to_ubuf / copy_ubuf_to_gm + vlds/vsts |
-| Synchronization | TSYNC, set_flag/wait_flag | set_flag/wait_flag on vector pipe, mem_bar |
-| Layout control | Via tile layout parameters | Via distribution mode (NORM, BRC, DS, etc.) |
-| Predicate support | No per-lane masking | Yes — `%mask : !pto.mask` on every vector op |
-| Target portability | All profiles | A5 hardware; emulated on CPU/A2/A3 |
-
-## Audience: Who Reads This Manual
-
-This manual serves two primary audiences with different needs:
-
-### Compiler Backend Developers
-
-You are building or maintaining a compiler that targets PTO ISA. You need to understand:
-
-- The complete instruction inventory and its legality rules
-- How PTO-AS maps to your backend's native instructions
-- Target profile restrictions (which ops are available on A2/A3 vs A5)
-- Layout constraints (which tile layouts are legal for which operations)
-- Synchronization contracts (when to insert `set_flag`/`wait_flag` pairs)
-- The two compilation flows and when to use each
-
-### Kernel Writers
-
-You are writing PTO programs directly, either in C++ (using `pto-isa` intrinsics) or in `.pto` text (using `ptoas`). You need to understand:
-
-- Tile and valid region semantics (what data is meaningful)
-- The tile surface programming model (TLOAD, TSTORE, TADD, TMATMUL, etc.)
-- GlobalTensor and memory layout (how data maps from GM to tiles)
-- Auto vs. Manual mode (when the compiler helps vs. when you control everything)
-- The synchronization model (TSYNC, set_flag/wait_flag, RecordEvent)
-- Collective communication (`pto.tbroadcast`, `pto.tget`, `pto.tput`) for multi-NPU kernels
-
-## Scope Of This Manual
-
-This manual defines:
-
-- The architecture-visible meaning of PTO instructions
-- The programming model, machine model, and memory model of PTO ISA
-- The distinction between tile, vector, scalar/control, and communication surfaces
-- The boundary between core ISA guarantees and target-profile restrictions
-
-This manual is written for:
-
-- Library and kernel authors
-- Compiler and code generator developers
-- Backend and runtime implementers
-- Performance engineers
-- Architecture and conformance test authors
-
-## See Also
-
-- [Document structure](./document-structure.md) — Full chapter map
-- [Goals Of PTO](./goals-of-pto.md) — Design objectives
-- [Scope And Boundaries](./design-goals-and-boundaries.md) — ISA scope and boundaries
-- [PTO ISA Version 1.0](./pto-isa-version-1-0.md) — Version baseline decisions
-- [Tiles And Valid Regions](../programming-model/tiles-and-valid-regions.md) — Tile semantics
-- [Auto Vs Manual](../programming-model/auto-vs-manual.md) — Execution modes
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — How individual opcode pages are structured
diff --git a/docs/mkdocs/src/docs/isa/introduction/what-is-pto-visa_zh.md b/docs/mkdocs/src/docs/isa/introduction/what-is-pto-visa_zh.md
deleted file mode 100644
index 7c551db7..00000000
--- a/docs/mkdocs/src/docs/isa/introduction/what-is-pto-visa_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Parallel Tile Operation ISA
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](what-is-pto-visa.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册概述](../../../manual/01-overview_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/machine-model/README_zh.md b/docs/mkdocs/src/docs/isa/machine-model/README_zh.md
deleted file mode 100644
index 4f2757ee..00000000
--- a/docs/mkdocs/src/docs/isa/machine-model/README_zh.md
+++ /dev/null
@@ -1,21 +0,0 @@
-<!-- Generated from `docs/isa/machine-model/README_zh.md` -->
-
-# 机器模型
-
-本章描述 PTO 的执行模型：执行代理（execution agent）、流水线（pipeline）、目标 Profile（A2/A3 vs A5）以及排序与同步的词汇表。
-
-## 本章内容
-
-- [执行代理](machine-model/execution-agents.md) — Host-Device-Core 三层执行架构、各 Profile 差异表、执行单元规格
-- [排序与同步](machine-model/ordering-and-synchronization.md) — Tile/Vector/DMA/Communication 四类同步原语、事件模型、流水线依赖图
-
-## 阅读建议
-
-建议按以下顺序阅读：
-
-1. 先读 [执行代理](machine-model/execution-agents.md)，理解 PTO 的三层执行层次（Host / Device / Core）和 Target Profile 差异
-2. 再读 [排序与同步](machine-model/ordering-and-synchronization.md)，理解同步原语、事件链和 Producer-Consumer 依赖图
-
-## 章节定位
-
-本章属于手册的第 3 章。理解机器模型是理解 PTO 程序在目标硬件上如何执行的前提。在进入指令集章节之前，应先理解同步和排序规则，因为许多指令的行为直接依赖于机器模型的执行语义。
diff --git a/docs/mkdocs/src/docs/isa/machine-model/execution-agents.md b/docs/mkdocs/src/docs/isa/machine-model/execution-agents.md
deleted file mode 100644
index e09bd491..00000000
--- a/docs/mkdocs/src/docs/isa/machine-model/execution-agents.md
+++ /dev/null
@@ -1,218 +0,0 @@
-<!-- Generated from `docs/isa/machine-model/execution-agents.md` -->
-
-# Execution Agents And Target Profiles
-
-PTO uses an architecture-visible three-level execution hierarchy: host, device, and core. This structure is not a direct hardware block diagram — it is an abstraction that makes explicit where work is prepared, dispatched, and executed, and where target profiles may differ in capability.
-
-## Execution Hierarchy
-
-```
-┌─────────────────────────────────────────────────────────┐
-│                        HOST                              │
-│  CPU: prepares kernel arguments, submits graphs,         │
-│  manages runtime orchestration and memory allocation       │
-└───────────────────────┬─────────────────────────────────┘
-                        │ RPC / AOE / custom transport
-                        ▼
-┌─────────────────────────────────────────────────────────┐
-│                       DEVICE                            │
-│  Scheduler: dispatches legal PTO work to cores in       │
-│  dependence order, manages device-level memory (GM)     │
-└───────────────────────┬─────────────────────────────────┘
-                        │ Block dispatch
-                        ▼
-┌─────────────────────────────────────────────────────────┐
-│          BLOCK / AI CORE (one per physical core)       │
-│                                                         │
-│  ┌────────────────────────────────────────────────────┐ │
-│  │  Scalar Unit                                       │ │
-│  │  - Control flow, address calculation               │ │
-│  │  - System query: GetBlockIdx, GetSubBlockIdx, ...│ │
-│  ├────────────────────────────────────────────────────┤ │
-│  │  Unified Buffer (UB) — 256 KB on-chip SRAM         │ │
-│  │  - GM↔tile DMA staging area                      │ │
-│  │  - Shared by all tile buffers and vector regs       │ │
-│  ├────────────────────────────────────────────────────┤ │
-│  │  Tile Register File                                │ │
-│  │  ┌──────────┬──────────┬──────────┬──────────┐   │ │
-│  │  │ Vec slots│ Mat slots│ Acc slots│Scalar slt│   │ │
-│  │  │ 16×16×N │ 16×16×N │ 16×16×N │   1×1    │   │ │
-│  │  └────┬─────┴────┬─────┴────┬─────┴──────────┘   │ │
-│  ├───────┼──────────┼──────────┼───────────────────┤ │
-│  │  ┌────▼────┐ ┌───▼───┐ ┌────▼────┐              │ │
-│  │  │ Vector  │ │Matrix │ │  DMA    │              │ │
-│  │  │Pipeline │ │  M /  │ │ Engine  │              │ │
-│  │  │   (V)   │ │ CUBE  │ │MTE1/2/3 │              │ │
-│  │  └────┬────┘ └───┬───┘ └─────────┘              │ │
-│  └───────┼──────────┼────────────────────────────────┘ │
-└──────────┼──────────┼────────────────────────────────────┘
-           │          │
-           ▼          ▼
-        GM (off-chip device memory, shared by all blocks)
-```
-
-## Host
-
-The **host** (typically a CPU or the host portion of a heterogeneous SoC):
-
-- Prepares kernel arguments and memory descriptors
-- Submits PTO programs to the device scheduler
-- Manages graph-level or runtime orchestration (stream queuing, event tracking)
-- Owns host-side memory used for argument staging
-
-The host does NOT execute PTO instructions directly. It prepares and submits.
-
-## Device
-
-The **device** is the architecture-visible scheduling layer. A backend may implement it differently, but it is responsible for:
-
-- Dispatching legal PTO work units to AI Core blocks
-- Maintaining device-level memory (GM) and coherency with host memory
-- Enforcing dependence order across blocks when required
-- Managing device-side memory allocation
-
-## Core (AI Core)
-
-The **core** (one physical AI Core / NPU) is where PTO instructions execute. It contains:
-
-| Component | Description | PTO Visibility |
-|-----------|-------------|---------------|
-| **Scalar Unit** | Control flow, address calculation, system queries | `GetBlockIdx()`, `GetBlockNum()`, `GetSubBlockIdx()` |
-| **Unified Buffer (UB)** | 256 KB on-chip SRAM; shared staging area for GM↔tile DMA | `!pto.ptr<T, ub>` |
-| **Tile Register File** | On-chip tile buffer storage, typed by `TileType` | `!pto.tile_buf<...>` |
-| **Vector Pipeline (V)** | Executes `pto.v*` vector micro-instructions on vector registers | `!pto.vreg<NxT>` |
-| **Matrix Multiply Unit (M/CUBE)** | Executes `pto.tmatmul` and `pto.tgemv` | Via `TileType::Mat`, `TileType::Left`, `TileType::Right`, `TileType::Acc` |
-| **DMA Engine (MTE1/MTE2/MTE3)** | Moves data between GM and UB; coordinates with pipelines | `copy_gm_to_ubuf`, `copy_ubuf_to_gm`, `TLOAD`, `TSTORE` |
-
-## Vector Register Architecture (VLane)
-
-On A5 (Ascend 9xx-class), the vector register is organized as **8 VLanes** of 32 bytes each. A VLane is the atomic unit for group reduction operations.
-
-```
-vreg (256 bytes total):
-┌─────────┬─────────┬─────────┬─────┬─────────┬─────────┐
-│ VLane 0 │ VLane 1 │ VLane 2 │ ... │ VLane 6 │ VLane 7 │
-│   32B   │   32B   │   32B   │     │   32B   │   32B   │
-└─────────┴─────────┴─────────┴─────┴─────────┴─────────┘
-```
-
-Elements per VLane by data type:
-
-| Data Type | Elements/VLane | Total Elements/vreg |
-|-----------|---------------|-------------------|
-| i8 / u8 | 32 | 256 |
-| i16 / u16 / f16 / bf16 | 16 | 128 |
-| i32 / u32 / f32 | 8 | 64 |
-| i64 / u64 | 4 | 32 |
-
-The VLane concept is architecturally visible: group reduction operations (`vcgadd`, `vcgmax`, `vcgmin`) reduce within each VLane independently, producing one result per VLane.
-
-## MTE Pipeline Detail
-
-The DMA engine uses three sub-units that operate concurrently with compute pipelines:
-
-| MTE | Direction | Role in Tile Surface | Role in Vector Surface |
-|-----|-----------|---------------------|----------------------|
-| `MTE1` | GM → UB | Optional: explicit prefetch | Pre-stage data before vector load |
-| `MTE2` | GM → UB | Load staging: GM→UB→tile buffer (via TLOAD) | DMA copy: GM→UB (via `copy_gm_to_ubuf`) |
-| `MTE3` | UB → GM | Store: tile→UB→GM (via TSTORE) | DMA copy: UB→GM (via `copy_ubuf_to_gm`) |
-
-MTE1, MTE2, and MTE3 can operate in parallel with the Vector Pipeline and Matrix Multiply Unit when proper `set_flag`/`wait_flag` synchronization is used.
-
-## System Query Operations
-
-The following operations query the position of the current block within the grid:
-
-| Operation | Return | Description |
-|-----------|--------|-------------|
-| `GetBlockIdx(dim)` | `i32` | 0-based index of current block along dimension `dim` |
-| `GetSubBlockIdx(dim)` | `i32` | 0-based index of current sub-block within its parent block |
-| `GetBlockNum(dim)` | `i32` | Total number of blocks along dimension `dim` |
-| `GetSubBlockNum(dim)` | `i32` | Total number of sub-blocks within the parent block |
-
-These are the only operations that depend on the grid topology. All other tile/vector/scalar operations are block-local.
-
-## Target Profiles
-
-PTO ISA is instantiated by **target profiles** that narrow the ISA to the capabilities of a specific backend. A profile does NOT introduce new ISA semantics — it only documents which subsets are available and may add implementation-defined variation points.
-
-Three target profiles are currently defined:
-
-### CPU Simulator
-
-The **CPU simulator** (also called the reference simulator) executes PTO programs on the host CPU. Its goals are correctness and debuggability, not performance.
-
-- All `pto.t*` tile surface operations are emulated in software
-- All `pto.v*` vector surface operations are emulated with scalar loops
-- Matmul operations use a reference GEMM implementation
-- Fractal layouts are simulated with strided memory access
-- UB is allocated from heap memory
-- The UB size is configurable via build flags
-
-### A2/A3 Profile
-
-The **A2/A3 profile** targets Ascend 2/3-class NPUs. These targets support:
-
-- Full `pto.t*` tile surface on hardware
-- `pto.v*` vector surface emulated through a tile-vector bridge (`SimdTileToMemrefOp`, `SimdVecScopeOp`)
-- Hardware matmul via the Matrix Multiply Unit (CUBE)
-- Fractal layout support on hardware, but with software fallback paths
-- UB: 256 KB per AI Core
-- Vector width: N=64 (f32), N=128 (f16/bf16), N=256 (i8)
-- Support for `textract` compact modes (ND2NZ, NZ2ND, ND, ND2NZ2)
-
-### A5 Profile
-
-The **A5 profile** targets Ascend 9xx-class NPUs (Ascend 910, 910B, 920, etc.). These targets support:
-
-- Full `pto.t*` tile surface on hardware
-- Full native `pto.v*` vector surface on the vector pipeline
-- Hardware matmul with MX format support (int8 input → int32 accumulator)
-- Full fractal layout support (NZ, ZN, FR, RN) on hardware
-- UB: 256 KB per AI Core
-- FP8 support: `float8_e4m3_t` (E4M3) and `float8_e5m3fn` (E5M2)
-- Native vector unaligned store (`vstu` / `vstus`) and alignment state threading
-- Block-scoped collective communication primitives (`TBROADCAST`, `TGET`, `TPUT`, etc.)
-- 8 VLanes per vector register (group reduction atomic unit)
-
-### Target Profile Comparison
-
-| Feature | CPU Simulator | A2/A3 Profile | A5 Profile |
-|---------|:-------------:|:-------------:|:----------:|
-| Tile surface (`pto.t*`) | Full (emulated) | Full (hardware) | Full (hardware) |
-| Vector surface (`pto.v*`) | Emulated (scalar loops) | Emulated (tile-vector bridge) | Full native |
-| Matmul (`TMATMUL`) | Software fallback | Hardware CUBE | Hardware CUBE |
-| MX format (int8→int32 acc) | Not applicable | Not applicable | Supported |
-| Fractal layouts (NZ/ZN/FR/RN) | Simulated | Simulated | Full hardware |
-| UB size | Configurable | 256 KB/core | 256 KB/core |
-| Vector width (f32 / f16,bf16 / i8) | N=64 / N=128 / N=256 | N=64 / N=128 / N=256 | N=64 / N=128 / N=256 |
-| FP8 types (e4m3 / e5m2) | Not supported | Not supported | Supported |
-| Vector unaligned store (`vstu`) | Not supported | Not supported | Supported |
-| Vector alignment state (`vstu`/`vstas`) | Not supported | Not supported | Supported |
-| `hifloat8_t`, `float4_e*` types | Not supported | Not supported | Supported |
-| Block-scoped collective comm | Not supported | Supported | Supported |
-| Atomic store variants | Not supported | Supported | Supported |
-| `vselr`, `vselrv2` (pair select) | Not supported | Not supported | Supported |
-| TEXTRACT compact modes | Simulated | Supported | Supported |
-| VLane group reduction | Not applicable | Not applicable | Supported |
-
-## Constraints
-
-- Architecture-visible dependence order MUST survive target scheduling
-- Target profiles may narrow support, but MUST NOT redefine legal PTO semantics
-- Machine-model documentation MUST state clearly which facts are portable and which are profile-specific
-- Programs that depend on profile-specific features (e.g., MX format, FP8, unaligned vector store) are NOT portable across profiles
-
-## Cases That Are Not Allowed
-
-- Documenting A5-only features as general PTO guarantees
-- Assuming the CPU simulator's emulation behavior matches hardware performance or cycle-accurate timing
-- Treating a profile restriction as a contradiction of the ISA (profiles only narrow, never contradict)
-
-## See Also
-
-- [Ordering And Synchronization](./ordering-and-synchronization.md)
-- [Vector Instruction Surface](../instruction-surfaces/vector-instructions.md)
-- [Tile Instruction Surface](../instruction-surfaces/tile-instructions.md)
-- [Portability And Target Profiles](../reference/portability-and-target-profiles.md)
-- [PTO ISA Version 1.0](../introduction/pto-isa-version-1-0.md)
diff --git a/docs/mkdocs/src/docs/isa/machine-model/execution-agents_zh.md b/docs/mkdocs/src/docs/isa/machine-model/execution-agents_zh.md
deleted file mode 100644
index bc7213e4..00000000
--- a/docs/mkdocs/src/docs/isa/machine-model/execution-agents_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Execution Agents And Target Profiles
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](execution-agents.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册执行模型](../../../manual/02-machine-model_zh.md)
-- [中文章节手册同步](../../../manual/05-synchronization_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/machine-model/ordering-and-synchronization.md b/docs/mkdocs/src/docs/isa/machine-model/ordering-and-synchronization.md
deleted file mode 100644
index 68b86fa8..00000000
--- a/docs/mkdocs/src/docs/isa/machine-model/ordering-and-synchronization.md
+++ /dev/null
@@ -1,198 +0,0 @@
-<!-- Generated from `docs/isa/machine-model/ordering-and-synchronization.md` -->
-
-# Ordering And Synchronization
-
-PTO does not assume that all execution resources are implicitly serialized. The machine model makes ordering visible where data or state moves across surfaces, pipelines, or shared resources. This page describes the synchronization primitives, the event model, and the producer-consumer ordering contracts.
-
-## Synchronization Primitives
-
-PTO defines four categories of synchronization primitives, one per instruction surface:
-
-### Tile Surface Primitives
-
-| Primitive | Syntax | Description |
-|-----------|--------|-------------|
-| `TSYNC` | `pto.tsync %events...` or `pto.tsync<Op>` | Wait on explicit `RecordEvent` tokens; or insert a pipeline barrier for a single op class |
-| `set_flag` | `pto.set_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]` | Signal an event from one pipeline to another |
-| `wait_flag` | `pto.wait_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]` | Wait for a previously-signaled event |
-
-`TSYNC` is the primary tile-surface synchronization. The event-wait form `TSYNC(events...)` establishes a **happens-before** edge on each `RecordEvent` token, ensuring all prior tile operations that produced those events are complete. The barrier form `TSYNC<Op>()` inserts a pipeline barrier for all operations of class `Op`.
-
-> **Note:** `pipe_barrier` (`pto.pipe_barrier`) is a scalar/control surface primitive, not a tile surface primitive. It appears in the [Scalar Pipeline Sync](../scalar/pipeline-sync.md) family.
-
-### Vector Surface Primitives
-
-| Primitive | Syntax | Description |
-|-----------|--------|-------------|
-| `set_flag` / `wait_flag` | `pto.set_flag[...]` / `pto.wait_flag[...]` | Event-based handoff between DMA and vector compute pipelines |
-| `mem_bar` | `pto.mem_bar` | Memory fence; ordering boundary for GM↔UB traffic |
-
-On the vector surface, `set_flag(PIPE_MTE2, PIPE_V, ID)` is issued by the DMA engine (MTE2) to signal the vector pipeline that data is ready. The vector pipeline issues `wait_flag(PIPE_MTE2, PIPE_V, ID)` before consuming the data.
-
-### DMA Primitives
-
-| Primitive | Syntax | Description |
-|-----------|--------|-------------|
-| `copy_gm_to_ubuf` | `pto.copy_gm_to_ubuf ...` | DMA: GM → UB |
-| `copy_ubuf_to_gm` | `pto.copy_ubuf_to_gm ...` | DMA: UB → GM |
-| `copy_ubuf_to_ubuf` | `pto.copy_ubuf_to_ubuf ...` | DMA: UB → UB (double-buffering) |
-
-DMA operations do not implicitly synchronize with the compute pipeline. Explicit `set_flag`/`wait_flag` pairs (or equivalent `RecordEvent` chaining) are required wherever a DMA transfer and a compute operation share data.
-
-### Communication Surface Primitives
-
-| Primitive | Description |
-|-----------|-------------|
-| `TBROADCAST` | Broadcast data to all participating blocks |
-| `TGET` / `TPUT` | Point-to-point communication between blocks |
-| `TWAIT` / `TTEST` | Barrier synchronization across blocks |
-| `TNOTIFY` / `TREDUCE` | Notification and reduction operations |
-
-## Event Model
-
-PTO uses an **event-based** synchronization model. Events carry ordering information between pipelines.
-
-### Event Lifecycle
-
-```
-Producer                                  Consumer
-  │                                         │
-  │  issue DMA / compute                    │
-  │  ▼                                      │
-  │  set_flag(SRC_PIPE, DST_PIPE, EVENT_ID)│
-  │  (produces the event)                   │
-  │                                         │
-  │                              wait_flag(SRC_PIPE, DST_PIPE, EVENT_ID)
-  │                              (consumes the event)
-  │                                         │
-  │  data/result available                  │
-  ▼                                         ▼
-```
-
-An **event** is identified by a triple `(src_pipe, dst_pipe, event_id)`:
-
-| Field | Values | Meaning |
-|-------|--------|---------|
-| `src_pipe` | `PIPE_MTE1`, `PIPE_MTE2`, `PIPE_MTE3`, `PIPE_V`, `PIPE_M` | Source pipeline that produces the event |
-| `dst_pipe` | `PIPE_MTE1`, `PIPE_MTE2`, `PIPE_MTE3`, `PIPE_V`, `PIPE_M` | Destination pipeline that consumes the event |
-| `event_id` | 0–15 (profile-specific) | Event slot identifier |
-
-Events are **fire-and-forget** in the ISA contract: producing a flag makes it available to all subsequent waiters on the same `(src_pipe, dst_pipe, event_id)` triple.
-
-### Events and RecordEvent
-
-The C++ intrinsics for tile operations (e.g., `TLOAD`, `TSTORE`, `TMATMUL`) return a `RecordEvent` value. This event can be passed as a `WaitEvents...` argument to subsequent operations, establishing a **happens-before** edge:
-
-```cpp
-RecordEvent e0 = TLOAD(a, ga);     // produces event
-RecordEvent e1 = TLOAD(b, gb);     // produces event
-TMATMUL(c, a, b, e0, e1);          // waits for both e0 and e1 before executing
-```
-
-The `RecordEvent` return value is the **ISA-visible mechanism** for chaining tile-surface dependencies. This is equivalent to inserting explicit `set_flag`/`wait_flag` pairs but expressed at a higher level.
-
-## Pipeline Dependency Graph
-
-The AI Core contains multiple execution units that operate concurrently. The following diagram shows the dependency relationships:
-
-```
-         ┌──────────────────────────────────────────────────────┐
-         │                   AI CORE                            │
-         │                                                      │
-  GM ────│─── MTE1 ──► UB ──┬─────────────────────────────┐    │
-         │                  │                             │    │
-         │                  │  MTE2 ──► UB ──┐            │    │
-         │                  │                  │            │    │
-         │    ┌─────────────┴──────────────────┴─────────┐  │    │
-         │    │                                       │  │    │
-         │    │   ┌───────────────────────────────────┘  │    │
-         │    │   │                                      │  │    │
-         │    │   │   Tile Register File                 │  │    │
-         │    │   │   ┌───────┐  ┌───────┐  ┌───────┐  │  │    │
-         │    │   │   │ Vec   │  │ Mat   │  │ Acc   │  │  │    │
-         │    │   │   └───────┘  └───────┘  └───────┘  │  │    │
-         │    │   │      │          │          │       │  │    │
-         │    │   │      ▼          ▼          ▼       │  │    │
-         │    │   │   ┌─────────────────────────┐      │  │    │
-         │    │   │   │     Vector Pipeline     │      │  │    │
-         │    │   │   │   (pto.v* ops)          │      │  │    │
-         │    │   │   └─────────────────────────┘      │  │    │
-         │    │   │            │                       │  │    │
-         │    │   │            │  ┌───────────────────┘  │    │
-         │    │   │            ▼  ▼                      │    │
-         │    │   │   ┌─────────────────────┐            │    │
-         │    │   │   │  Matrix Multiply (M)│            │    │
-         │    │   │   │  (pto.tmatmul*)     │            │    │
-         │    │   │   └─────────────────────┘            │    │
-         │    │   │              │                     │    │
-         │    │   └──────────────┼─────────────────────┘    │
-         │    │                  │                          │
-         │    │    ┌─────────────┴────────────┐             │
-         │    │    │                           │             │
-         │    │    ▼                           ▼             │
-         │    │ MTE3 ──► UB ────────────────────────────────┼───► GM
-         │    └──────────────────────────────────────────────┘
-         │
-         └── Scalar Unit (control flow, address gen, system queries)
-```
-
-### Dependency Types
-
-| Producer | Consumer | Synchronization Required |
-|----------|----------|------------------------|
-| MTE2 (DMA GM→UB) | Vector pipeline (vlds) | `set_flag(PIPE_MTE2, PIPE_V, ID)` → `wait_flag` |
-| Vector pipeline | MTE3 (store) | `set_flag(PIPE_V, PIPE_MTE3, ID)` → `wait_flag` |
-| TLOAD | Tile compute | `RecordEvent` chaining or `TSYNC` |
-| Tile compute | TSTORE | `RecordEvent` chaining or `TSYNC` |
-| TLOAD | TMATMUL | `RecordEvent` chaining or `set_flag`/`wait_flag` |
-| Tile compute (Mat) | Tile compute (Vec) | `set_flag`/`wait_flag` or `TSYNC` |
-
-## Ordering Rules
-
-### Tile Surface Ordering
-
-Tile-surface operations are ordered by **program order** within a single tile buffer, and by **event ordering** across tile buffers. The following rules apply:
-
-1. **Tile-local order**: Within a single tile buffer, operations execute in program order. `TSYNC` establishes a barrier within that tile's ordering stream.
-2. **Event ordering**: A `set_flag`/`wait_flag` pair establishes a **happens-before** edge between the producer pipeline and the consumer pipeline.
-3. **RecordEvent chaining**: When an operation's `WaitEvents...` arguments include events from prior operations, those prior operations must complete before the current operation begins.
-
-### Vector Surface Ordering
-
-Vector-surface ordering follows these rules:
-
-1. **DMA ordering**: `copy_gm_to_ubuf` must complete (via `set_flag`) before any `vlds` that consumes the copied data.
-2. **Compute ordering**: Vector operations within a `SimdVecScopeOp` execute in program order.
-3. **Store ordering**: `vsts` must complete (via `set_flag` to MTE3) before `copy_ubuf_to_gm` begins copying the data back to GM.
-
-### GM Visibility
-
-Data written to GM by `TSTORE` or `copy_ubuf_to_gm` is guaranteed visible to subsequent GM reads by other blocks only after:
-
-1. All prior store operations on that block have completed (program order).
-2. Any required `mem_bar` or `pipe_barrier` has been issued.
-3. The operation has been synchronized with the host runtime (event completion).
-
-## Constraints
-
-- Synchronization is required wherever the architecture does not already guarantee ordering.
-- A target may add stronger internal ordering, but the manual must not rely on undocumented strength.
-- Vector-pipe synchronization rules must be documented separately from tile-surface synchronization rules when the mechanisms differ.
-- Events are fire-and-forget; the ISA does not provide a "test-and-clear" event flag.
-- `TSYNC` is tile-buffer-scoped; it does not synchronize across tile buffers.
-
-## Cases That Are Not Allowed
-
-- Writing the manual as if synchronization were optional when the architecture requires it.
-- Assuming vector-pipe hazards are covered by tile-surface rules without saying so.
-- Documenting target-specific barriers as architecture-wide unless the PTO surface guarantees them.
-- Issuing `vlds` before `copy_gm_to_ubuf` completes without an intervening `wait_flag`.
-- Issuing `copy_ubuf_to_gm` before `vsts` completes without an intervening `wait_flag`.
-
-## See Also
-
-- [Consistency Baseline](../memory-model/consistency-baseline.md)
-- [Producer-Consumer Ordering](../memory-model/producer-consumer-ordering.md)
-- [Tile Families: Sync And Config](../tile/sync-and-config.md)
-- [Vector Pipeline Sync](../vector/pipeline-sync.md)
-- [Scalar Pipeline Sync](../scalar/pipeline-sync.md)
diff --git a/docs/mkdocs/src/docs/isa/machine-model/ordering-and-synchronization_zh.md b/docs/mkdocs/src/docs/isa/machine-model/ordering-and-synchronization_zh.md
deleted file mode 100644
index 72a84dbb..00000000
--- a/docs/mkdocs/src/docs/isa/machine-model/ordering-and-synchronization_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Ordering And Synchronization
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](ordering-and-synchronization.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册执行模型](../../../manual/02-machine-model_zh.md)
-- [中文章节手册同步](../../../manual/05-synchronization_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/memory-model/README_zh.md b/docs/mkdocs/src/docs/isa/memory-model/README_zh.md
deleted file mode 100644
index fa47358c..00000000
--- a/docs/mkdocs/src/docs/isa/memory-model/README_zh.md
+++ /dev/null
@@ -1,21 +0,0 @@
-<!-- Generated from `docs/isa/memory-model/README_zh.md` -->
-
-# 内存模型
-
-本章描述 PTO 的内存一致性模型：可见性与排序规则，涵盖生产者-消费者 Ordering 以及与其他 ISA 操作的关系。
-
-## 本章内容
-
-- [一致性基线](memory-model/consistency-baseline.md) — GM / UB / Tile Buffer 三层内存空间、Program Order / Event Order / Barrier Order 三级 Ordering 分类表、未定义/未指明/实现定义行为的精确区分
-- [生产者-消费者排序](memory-model/producer-consumer-ordering.md) — 完整状态机图（IDLE → IN_PROGRESS → COMPLETE）、Tile Surface 和 Vector Surface 的 Ordering 链、跨表面传递规则
-
-## 阅读建议
-
-建议按以下顺序阅读：
-
-1. 先读 [一致性基线](memory-model/consistency-baseline.md)，理解 PTO 的三层内存空间和三 level Ordering
-2. 再读 [生产者-消费者排序](memory-model/producer-consumer-ordering.md)，理解 Tile Surface（RecordEvent / TSYNC）和 Vector Surface（set_flag / wait_flag）的具体排序机制
-
-## 章节定位
-
-本章属于手册的第 6 章。建议在阅读指令集章节（第 7 章）之前，先理解内存模型，因为许多指令的行为与内存 Ordering 直接相关，特别是 `TLOAD` / `TSTORE` 和 `copy_gm_to_ubuf` / `copy_ubuf_to_gm` 的同步语义。
diff --git a/docs/mkdocs/src/docs/isa/memory-model/consistency-baseline.md b/docs/mkdocs/src/docs/isa/memory-model/consistency-baseline.md
deleted file mode 100644
index 815fe1e7..00000000
--- a/docs/mkdocs/src/docs/isa/memory-model/consistency-baseline.md
+++ /dev/null
@@ -1,128 +0,0 @@
-<!-- Generated from `docs/isa/memory-model/consistency-baseline.md` -->
-
-# Consistency Baseline
-
-PTO's memory model is built around **explicit movement** and **explicit ordering**. The baseline guarantee is intentionally narrower than "everything is globally ordered." PTO requires the program or the selected surface to express when data becomes visible across stages, surfaces, and blocks.
-
-## Memory Spaces
-
-PTO defines three architecturally distinct memory spaces:
-
-| Space | Address Qualifier | Scope | Visibility |
-|-------|-----------------|-------|-----------|
-| **Global Memory (GM)** | `__gm__` | All AI Cores | Shared |
-| **Unified Buffer (UB)** | `!pto.ptr<T, ub>` | Single AI Core | Core-local |
-| **Tile Buffer** | `!pto.tile_buf<...>` | Single AI Core | Core-local, pipeline-specific |
-
-These are NOT interchangeable. Data must be explicitly moved between them, and each space has different visibility semantics.
-
-## Ordering Levels
-
-PTO defines three levels of ordering guarantee:
-
-| Ordering Level | Description | Scope | How to Establish |
-|---------------|-------------|-------|-----------------|
-| **Program Order** | Operations within a single tile buffer or vector register file execute in program order | Single core | Implicit within a buffer |
-| **Event Order** | Ordering between operations on different pipelines or buffers | Within a block | `set_flag`/`wait_flag` or `RecordEvent` chaining |
-| **Barrier Order** | Ordering across multiple blocks | Grid-wide | `TBARRIER` / collective ops |
-
-### Program Order
-
-Within a single tile buffer or a single vector register, operations are ordered by program order:
-
-```c
-TLOAD(a, ga);  // 1. Load a
-TLOAD(b, gb);  // 2. Load b (ordered after 1, same buffer)
-TADD(c, a, b); // 3. Compute (ordered after 1 and 2, same buffer)
-```
-
-No explicit synchronization is needed between operations on the same tile buffer.
-
-### Event Order
-
-When data moves between different buffers or different pipelines, explicit event ordering is required:
-
-```c
-RecordEvent e0 = TLOAD(a, ga);     // produces event
-RecordEvent e1 = TLOAD(b, gb);     // produces event
-TMATMUL(c, a, b, e0, e1);          // waits for e0 and e1 before starting
-```
-
-The `RecordEvent` return value is a handle to the ordering guarantee. Passing it as a `WaitEvents` argument to a subsequent operation establishes a **happens-before** edge.
-
-### Barrier Order
-
-When multiple blocks must synchronize (e.g., after a collective operation), a grid-wide barrier is required:
-
-```mlir
-pto.tbroadcast %tensor, %src : !pto.tile<f32,16,16> -> ()
-pto.twait // block until all blocks have received the broadcast
-```
-
-## What PTO Does NOT Guarantee Automatically
-
-PTO does not automatically guarantee:
-
-| Guarantee NOT Given | Reason |
-|--------------------|--------|
-| That every cross-pipeline write is immediately visible to every consumer | Requires explicit `set_flag`/`wait_flag` |
-| That vector register, tile buffer, and GM traffic share one implicit fence model | Each space has distinct visibility rules |
-| That target-specific stronger ordering is portable | Stronger ordering on A5 does not apply to CPU/A2/A3 |
-| That UB writes are visible to GM reads without explicit `TSTORE` or `copy_ubuf_to_gm` | UB→GM requires explicit data movement |
-| That GM writes are visible to UB reads without explicit `TLOAD` or `copy_gm_to_ubuf` | GM→UB requires explicit data movement |
-
-## GM Visibility
-
-Data written to GM by `TSTORE` or `copy_ubuf_to_gm` is guaranteed visible to subsequent GM reads by other blocks only after:
-
-1. All prior store operations on that block have completed (program order within block).
-2. Any required `mem_bar`, `pipe_barrier`, or collective synchronization has been issued.
-3. The host runtime has confirmed completion (via event or runtime call).
-
-The exact moment of GM visibility across blocks is **implementation-defined** — the ISA guarantees that the ordering contract is satisfied, but the exact timing of cross-block visibility depends on the target profile.
-
-## UB Visibility
-
-UB is **core-local**: only the AI Core that owns the UB can read or write it. UB data is NOT visible to other cores.
-
-UB visibility within a core follows program order plus event order. The following are guaranteed by the ISA:
-
-- UB reads within a core see all prior UB writes within that core (program order).
-- UB reads see data from `copy_gm_to_ubuf` only after the corresponding `wait_flag` returns.
-
-## Undefined, Unspecified, and Implementation-Defined
-
-PTO uses these terms precisely:
-
-| Term | Meaning | Example |
-|------|---------|---------|
-| **Undefined** | Behavior is intentionally unspecified; any outcome is permitted | Reading a tile's out-of-valid-region element |
-| **Unspecified** | The ISA does not define the behavior; implementations may choose | Exact cycle count of an operation |
-| **Implementation-Defined** | The behavior is defined by the implementation but documented | FTZ behavior for denormals on A5 |
-
-## Target Refinement
-
-CPU, A2/A3, and A5 may differ in implementation detail and support subsets, but the baseline manual must still say clearly which ordering facts are:
-
-| Category | Portability |
-|----------|------------|
-| Program order within a tile buffer | Portable across all profiles |
-| Event order via `set_flag`/`wait_flag` | Portable across all profiles (same semantics, different pipe spaces) |
-| `RecordEvent` chaining | Portable across all profiles |
-| Barrier order via collective ops | Not available on CPU; available on A2/A3 and A5 |
-| UB→GM implicit visibility | Not portable; requires explicit `TSTORE`/`copy_ubuf_to_gm` |
-| A5-specific stronger ordering | A5-specific, not portable to CPU/A2/A3 |
-
-## Cases That Are Not Allowed
-
-- Documenting implementation detail as though it were the portable memory model.
-- Hiding visibility requirements inside vague words like "usually ordered."
-- Mixing memory-model guarantees with scheduling heuristics.
-- Claiming that data is visible across blocks without an explicit synchronization operation.
-- Assuming that "the hardware does it automatically" without specifying which operation provides the guarantee.
-
-## See Also
-
-- [Producer-Consumer Ordering](./producer-consumer-ordering.md)
-- [Ordering And Synchronization](../machine-model/ordering-and-synchronization.md)
-- [Portability And Target Profiles](../reference/portability-and-target-profiles.md)
diff --git a/docs/mkdocs/src/docs/isa/memory-model/consistency-baseline_zh.md b/docs/mkdocs/src/docs/isa/memory-model/consistency-baseline_zh.md
deleted file mode 100644
index 9c1d919c..00000000
--- a/docs/mkdocs/src/docs/isa/memory-model/consistency-baseline_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Consistency Baseline
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](consistency-baseline.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册内存顺序与一致性](../../../manual/11-memory-ordering-and-consistency_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/memory-model/producer-consumer-ordering.md b/docs/mkdocs/src/docs/isa/memory-model/producer-consumer-ordering.md
deleted file mode 100644
index 2ad5f830..00000000
--- a/docs/mkdocs/src/docs/isa/memory-model/producer-consumer-ordering.md
+++ /dev/null
@@ -1,191 +0,0 @@
-<!-- Generated from `docs/isa/memory-model/producer-consumer-ordering.md` -->
-
-# Producer-Consumer Ordering
-
-Producer-consumer ordering is the most useful way to explain PTO visibility rules. A program is legal when each consumer sees the writes or state changes its producer is required to make visible, using the synchronization and movement rules of the active surface.
-
-## Producer-Consumer State Machine
-
-Every data movement or compute operation participates in a producer-consumer chain. The state machine for each operation is:
-
-```
-┌─────────────────┐
-│   IDLE          │  Operation not yet issued
-└────────┬────────┘
-         │ issue
-         ▼
-┌─────────────────┐
-│  IN_PROGRESS    │  Operation executing (may be on different pipeline)
-└────────┬────────┘
-         │ completion (produces event)
-         ▼
-┌─────────────────┐
-│   COMPLETE      │  Result visible to consumers who have
-└────────┬────────┘    established the ordering edge
-         │ (consumed by next operation)
-         ▼
-    [Consumer]
-```
-
-An operation is **consumed** by a subsequent operation when the consumer either:
-
-1. Passes the producer's `RecordEvent` as a `WaitEvents` argument.
-2. Issues a `wait_flag` for the same event that the producer issued.
-
-## Tile Surface Ordering
-
-For `pto.t*` programs, the common pattern is:
-
-```
-┌─────────────────────────────────────────────────────┐
-│  TLOAD(tile, gtensor)                               │
-│  (produces tile state)                              │
-└─────────────────┬───────────────────────────────────┘
-                  │ RecordEvent or implicit TSYNC
-                  ▼
-┌─────────────────────────────────────────────────────┐
-│  Tile Compute (TADD, TMATMUL, etc.)                │
-│  (consumes tile state; produces tile state)         │
-└─────────────────┬───────────────────────────────────┘
-                  │ RecordEvent or explicit TSYNC
-                  ▼
-┌─────────────────────────────────────────────────────┐
-│  TSTORE(gtensor, tile)                              │
-│  (consumes tile state; produces GM write)           │
-└─────────────────────────────────────────────────────┘
-```
-
-### RecordEvent Chaining
-
-The `RecordEvent` return value of each tile operation can be passed to the next operation as a `WaitEvents...` argument:
-
-```cpp
-RecordEvent e0 = TLOAD(a, ga);     // e0: TLOAD has completed
-RecordEvent e1 = TLOAD(b, gb);     // e1: TLOAD has completed
-TMATMUL(c, a, b, e0, e1);         // waits for e0 and e1 before starting
-RecordEvent e2 = TMATMUL(...);
-TSTORE(gc, c, e2);                  // waits for e2 before starting
-```
-
-When an operation has multiple `WaitEvents...` arguments, it waits for ALL of them before beginning execution.
-
-### TSYNC
-
-`TSYNC` provides a lightweight tile-buffer-scoped barrier when fine-grained event chaining is not needed:
-
-```cpp
-TLOAD(a, ga);
-TLOAD(b, gb);
-TSYNC();        // ensures both loads are complete before compute
-TADD(c, a, b);
-TSYNC();        // ensures compute is complete before store
-TSTORE(gc, c);
-```
-
-`TSYNC` is equivalent to chaining all prior `RecordEvent` values for the same tile buffer.
-
-## Vector Surface Ordering
-
-For `pto.v*` programs, the ordering chain involves explicit DMA synchronization:
-
-```
-┌──────────────────────────────────────────────────────────┐
-│  copy_gm_to_ubuf(%ub, %gm, ...)                         │
-│  (DMA: GM → UB)                                          │
-└────────────────┬─────────────────────────────────────────┘
-                 │ set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0)
-                 ▼
-┌──────────────────────────────────────────────────────────┐
-│  wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0)                 │
-│  (UB data now visible to Vector pipeline)                 │
-└────────────────┬─────────────────────────────────────────┘
-                 │ (implicit on vlds)
-                 ▼
-┌──────────────────────────────────────────────────────────┐
-│  vlds %vreg, %ub[...] {dist = "NORM"}                   │
-│  (UB → Vector Register)                                  │
-└────────────────┬─────────────────────────────────────────┘
-                 │
-                 ▼
-┌──────────────────────────────────────────────────────────┐
-│  vadd %result, %vreg, %vreg                             │
-│  (Vector Compute on Vector Register)                     │
-└────────────────┬─────────────────────────────────────────┘
-                 │
-                 ▼
-┌──────────────────────────────────────────────────────────┐
-│  vsts %result, %ub[...]                                 │
-│  (Vector Register → UB)                                   │
-└────────────────┬─────────────────────────────────────────┘
-                 │ set_flag(PIPE_V, PIPE_MTE3, EVENT_ID1)
-                 ▼
-┌──────────────────────────────────────────────────────────┐
-│  wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID1)                 │
-│  (Vector result now staged for DMA)                      │
-└────────────────┬─────────────────────────────────────────┘
-                 │
-                 ▼
-┌──────────────────────────────────────────────────────────┐
-│  copy_ubuf_to_gm(%gm, %ub, ...)                         │
-│  (DMA: UB → GM)                                          │
-└──────────────────────────────────────────────────────────┘
-```
-
-### Vector Surface vs Tile Surface Ordering
-
-| Aspect | Tile Surface | Vector Surface |
-|--------|-------------|---------------|
-| Synchronization mechanism | `RecordEvent`, `TSYNC` | `set_flag`/`wait_flag` on pipe pairs |
-| Data path | GM ↔ Tile Buffer (via MTE2/MTE3) | GM ↔ UB ↔ Vector Register (via DMA + vlds/vsts) |
-| Visibility model | Producer-consumer chain via events | DMA signal → wait → vlds → compute → vsts → DMA signal |
-| Implicit ordering | Within same tile buffer | None — explicit flag required between DMA and compute |
-| Store path | Tile → MTE3 → GM | Vector Register → vsts → UB → MTE3 → GM |
-
-## Cross-Surface Handoff
-
-When a tile-surface result is consumed by a vector-surface operation (or vice versa), the handoff must go through UB:
-
-```
-Tile Surface                              Vector Surface
-    │                                         ▲
-    │  TLOAD/TSTORE handles GM ↔ Tile Buffer  │
-    │                                              │
-    └──── TSTORE → UB → copy_ubuf_to_gm ──────────┘
-         (via copy_gm_to_ubuf on vector side)
-```
-
-The cross-surface handoff goes through GM or through an explicit UB double-buffering pattern:
-
-```cpp
-// Tile surface produces result in tile c
-TSTORE(gc, c);
-
-// Vector surface consumes from gc
-copy_gm_to_ubuf(%ub, %gm_out, ...);
-set_flag(PIPE_MTE2, PIPE_V, ID);
-wait_flag(PIPE_MTE2, PIPE_V, ID);
-%v = pto.vlds %ub[...] {dist = "NORM"};
-```
-
-## Constraints
-
-- A consumer may only rely on visibility after the required producer-consumer edge is established.
-- The exact synchronization mechanism may vary by surface or target profile.
-- Family docs and per-op pages must state the relevant ordering expectations explicitly.
-- An operation's `RecordEvent` return value is only valid for chaining to operations that execute AFTER the current operation in program order.
-
-## Cases That Are Not Allowed
-
-- Describing a consumer as legal without saying how producer visibility is established.
-- Assuming a target's convenient scheduling behavior is the architecture contract.
-- Leaving cross-surface handoff rules implicit.
-- Issuing `vlds` before `copy_gm_to_ubuf` completes without an intervening `wait_flag`.
-- Issuing `copy_ubuf_to_gm` before `vsts` completes without an intervening `wait_flag`.
-- Passing a `RecordEvent` from a later operation to an earlier operation (wrong direction) — this is illegal and produces a verification error.
-
-## See Also
-
-- [Consistency Baseline](./consistency-baseline.md)
-- [Ordering And Synchronization](../machine-model/ordering-and-synchronization.md)
-- [Tile Instruction Surface](../instruction-surfaces/tile-instructions.md)
-- [Vector Instruction Surface](../instruction-surfaces/vector-instructions.md)
diff --git a/docs/mkdocs/src/docs/isa/memory-model/producer-consumer-ordering_zh.md b/docs/mkdocs/src/docs/isa/memory-model/producer-consumer-ordering_zh.md
deleted file mode 100644
index 2a420969..00000000
--- a/docs/mkdocs/src/docs/isa/memory-model/producer-consumer-ordering_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Producer-Consumer Ordering
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](producer-consumer-ordering.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册内存顺序与一致性](../../../manual/11-memory-ordering-and-consistency_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/other/README.md b/docs/mkdocs/src/docs/isa/other/README.md
deleted file mode 100644
index d896cd79..00000000
--- a/docs/mkdocs/src/docs/isa/other/README.md
+++ /dev/null
@@ -1,50 +0,0 @@
-<!-- Generated from `docs/isa/other/README.md` -->
-
-# Other Families
-
-This section covers operations that do not fit cleanly into the tile, vector, or scalar/control buckets.
-
-## Communication And Runtime
-
-Inter-NPU collective communication and synchronization.
-
-| Family | Description |
-|--------|-------------|
-| [TBROADCAST](./comm/TBROADCAST.md) | Broadcast data from root NPU to all ranks |
-| [TGET](./comm/TGET.md) | Get data from a remote NPU |
-| [TGET_ASYNC](./comm/TGET_ASYNC.md) | Asynchronous variant of TGET |
-| [TNOTIFY](./comm/TNOTIFY.md) | Notify other ranks of an event |
-| [TPUT](./comm/TPUT.md) | Put data to a remote NPU |
-| [TPUT_ASYNC](./comm/TPUT_ASYNC.md) | Asynchronous variant of TPUT |
-| [TREDUCE](./comm/TREDUCE.md) | Collective reduction across all ranks |
-| [TSCATTER](./comm/TSCATTER.md) | Scatter data from root NPU to all ranks |
-| [TGATHER](./comm/TGATHER.md) | Gather data from all ranks to root NPU |
-| [TTEST](./comm/TTEST.md) | Test if a notification has been received |
-| [TWAIT](./comm/TWAIT.md) | Wait for a notification |
-
-See [Communication and Runtime](./communication-and-runtime.md) for the family contract.
-
-## Non-ISA Supporting Operations
-
-Convenience operations over tile sequences or memory management.
-
-| Operation | Description | Category |
-|-----------|-------------|----------|
-| [TALIAS](./TALIAS.md) | Create an alias view of a tile without copying | Alias |
-| [TAXPY](./TAXPY.md) | Fused multiply-add: `dst = src0 * scalar + src1` | Fused compute |
-| [TCONCAT](./TCONCAT.md) | Concatenate two tiles along a dimension | Tile sequence |
-| [TDEQUANT](./TDEQUANT.md) | Dequantize a tile from quantized format | Quantize |
-| [TFREE](./TFREE.md) | Free a previously allocated tile or buffer | Memory |
-| [THISTOGRAM](./THISTOGRAM.md) | Compute histogram of tile values | Statistics |
-| [TPACK](./TPACK.md) | Pack multiple tiles into a single tile buffer | Tile sequence |
-| [TPOP](./TPOP.md) | Population count of predicate mask | Predicate |
-| [TPUSH](./TPUSH.md) | Push count of predicate mask | Predicate |
-| [TRANDOM](./TRANDOM.md) | Fill tile with random values | Generation |
-| [TQUANT](./TQUANT.md) | Quantize a tile to integer format | Quantize |
-
-See [Non-ISA and Supporting Ops](./non-isa-and-supporting-ops.md) for the family contract.
-
-## See Also
-
-- [Other instruction surface](../instruction-surfaces/other-instructions.md) — High-level surface description
-- [Instruction families](./README.md) — All family groups
diff --git a/docs/mkdocs/src/docs/isa/other/README_zh.md b/docs/mkdocs/src/docs/isa/other/README_zh.md
deleted file mode 100644
index c19c14bb..00000000
--- a/docs/mkdocs/src/docs/isa/other/README_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-<!-- Generated from `docs/isa/other/README_zh.md` -->
-
-# 其他与通信
-
-本节包含不属于 Tile、Vector 或标量/控制主干的残余指令和通信操作。
-
-## 本章内容
-
-- [通信与运行时](other/communication-and-runtime.md) — 点对点通信、集合操作和运行时支持
-- [非 ISA 与支持操作](other/non-isa-and-supporting-ops.md) — 边界外的支持操作
-
-## 章节定位
-
-本章属于手册第 7 章（指令集）的补充部分。当一个操作不属于 Tile/Vector/标量主干时，归入本节。
diff --git a/docs/mkdocs/src/docs/isa/other/communication-and-runtime.md b/docs/mkdocs/src/docs/isa/other/communication-and-runtime.md
deleted file mode 100644
index cd2d26a1..00000000
--- a/docs/mkdocs/src/docs/isa/other/communication-and-runtime.md
+++ /dev/null
@@ -1,109 +0,0 @@
-<!-- Generated from `docs/isa/other/communication-and-runtime.md` -->
-
-# Communication And Runtime
-
-Communication operations span multiple NPUs in a parallel group. They express inter-NPU data exchange and collective reduction using a `ParallelGroup` handle.
-
-> **Note on naming:** `pto.tget` (IR form) and `TGET` (C++ intrinsic form) refer to the same operation — remote read from a peer NPU's GM. Both spellings appear in this manual depending on context.
-
-## Operations
-
-| Operation | Description | Collective Type | IR Spelling | C++ Spelling |
-|-----------|-------------|----------------|-------------|--------------|
-| [TBROADCAST](./TBROADCAST.md) | Broadcast data from root NPU to all ranks | One-to-all | `pto.tbroadcast` | `TBROADCAST` |
-| [TGET](./TGET.md) | Get data from a remote NPU | Point-to-point | `pto.tget` | `TGET` |
-| [TGET_ASYNC](./TGET_ASYNC.md) | Asynchronous variant of TGET | Point-to-point | `pto.tget_async` | `TGET_ASYNC` |
-| [TNOTIFY](./TNOTIFY.md) | Notify other ranks of an event | Synchronization | `pto.tnotify` | `TNOTIFY` |
-| [TPUT](./TPUT.md) | Put data to a remote NPU | Point-to-point | `pto.tput` | `TPUT` |
-| [TPUT_ASYNC](./TPUT_ASYNC.md) | Asynchronous variant of TPUT | Point-to-point | `pto.tput_async` | `TPUT_ASYNC` |
-| [TREDUCE](./TREDUCE.md) | Collective reduction across all ranks | All-to-one | `pto.treduce` | `TREDUCE` |
-| [TSCATTER](./TSCATTER.md) | Scatter data from root to all ranks | One-to-all | `pto.tscatter` | `TSCATTER` |
-| [TGATHER](./TGATHER.md) | Gather data from all ranks to root | All-to-one | `pto.tgather` | `TGATHER` |
-| [TTEST](./TTEST.md) | Test if a notification has been received | Synchronization | `pto.ttest` | `TTEST` |
-| [TWAIT](./TWAIT.md) | Wait for a notification | Synchronization | `pto.twait` | `TWAIT` |
-
-## Mechanism
-
-Communication operations use a `ParallelGroup` handle (`!pto.group<N>`) to identify the set of participating NPUs. The group defines:
-
-- **Size**: Number of ranks `N` in the parallel group
-- **Root**: The designated NPU for broadcast/scatter operations (typically rank 0)
-- **Tensors**: Per-rank destination/source buffers
-
-### Data Flow
-
-All collective communication operations share a common data flow pattern:
-
-```
-Local GM ──► UB (staging tile) ──► Inter-NPU interconnect ──► UB ──► Local GM
-```
-
-A **staging tile** in UB is always required as an intermediate buffer. For large tensors that exceed the UB tile capacity, the operation automatically performs **2D sliding** — chunking along rows and columns to fit each chunk into the tile, iterating over all outer dimensions.
-
-### Broadcast
-
-All non-root NPUs receive data from the root:
-
-$$ \mathrm{dst}^{(k)} = \mathrm{src}^{(\text{root})} \quad \forall k \in [0, N) $$
-
-Only the root calls `TBROADCAST`. Non-root ranks must ensure their destination buffers are allocated and writable for the duration of the operation.
-
-### Reduce
-
-All ranks contribute data to a reduction operation, with the result delivered to the root:
-
-$$ \mathrm{result}^{(\text{root})} = \bigoplus_{k=0}^{N-1} \mathrm{src}^{(k)} $$
-
-where $\bigoplus$ is the reduction operator (sum, max, min, etc.).
-
-### Scatter/Gather
-
-Scatter distributes slices of the root's data to each rank. Gather collects per-rank data back to the root.
-
-### Point-to-Point (TGET/TPUT)
-
-Point-to-point operations transfer data between two specific NPUs without involving the entire group:
-
-- **`TGET`** (`pto.tget`): Read remote GM → local GM. Data flows from the source NPU to the current NPU.
-- **`TPUT`** (`pto.tput`): Write local GM → remote GM. Data flows from the current NPU to the destination NPU.
-
-Both use a staging tile in UB as the intermediate buffer. For `TGET`, the data path is: `remote GM → staging tile → local GM`. For `TPUT`, the data path is: `local GM → staging tile → remote GM`.
-
-## ParallelGroup Handle
-
-```mlir
-// Define a parallel group of 8 NPUs
-%tensors = "pto.make_group"(%addrs0, %addrs1, ..., %addrs7)
-    : (!pto.memref<f32, 16x16>, ..., !pto.memref<f32, 16x16>) -> !pto.group<8>
-```
-
-In C++, the `ParallelGroup<GTensor>` template manages the group handle. See the per-op pages for C++ usage examples.
-
-## Large Tile Support
-
-When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, transfers are automatically chunked via 2D sliding:
-
-- If `ValidRow` is static, `GetShape(DIM_3)` must be divisible by `ValidRow`
-- If `ValidCol` is static, `GetShape(DIM_4)` must be divisible by `ValidCol`
-- To handle non-divisible cases, use tiles with `DYNAMIC` valid row/column
-
-## Constraints
-
-- All participating NPUs must call the collective operation with matching `ParallelGroup` handles
-- Non-root ranks must not call broadcast/scatter operations
-- Root rank is identified by `parallelGroup.GetRootIdx()`
-- Destination/source tensors are assumed to have the same shape and strides across ranks
-- The staging tile must be pre-allocated in UB at non-overlapping offsets for ping-pong variants
-
-## Cases That Are Not Allowed
-
-- Calling collective operations with mismatched `ParallelGroup` handles across ranks
-- Calling broadcast/scatter on non-root ranks (undefined behavior)
-- Using uninitialized or improperly sized destination buffers
-- Using overlapping UB offsets for ping/pong staging tiles
-
-## See Also
-
-- [Other families](../instruction-families/other-families.md) — Family overview
-- [Other instruction surface](../instruction-surfaces/other-instructions.md) — Surface description
-- [Ordering and Synchronization](../machine-model/ordering-and-synchronization.md) — PTO synchronization model
diff --git a/docs/mkdocs/src/docs/isa/other/communication-and-runtime_zh.md b/docs/mkdocs/src/docs/isa/other/communication-and-runtime_zh.md
deleted file mode 100644
index 31595fb7..00000000
--- a/docs/mkdocs/src/docs/isa/other/communication-and-runtime_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Communication And Runtime
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](communication-and-runtime.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/other/non-isa-and-supporting-ops.md b/docs/mkdocs/src/docs/isa/other/non-isa-and-supporting-ops.md
deleted file mode 100644
index 4fda8bd6..00000000
--- a/docs/mkdocs/src/docs/isa/other/non-isa-and-supporting-ops.md
+++ /dev/null
@@ -1,73 +0,0 @@
-<!-- Generated from `docs/isa/other/non-isa-and-supporting-ops.md` -->
-
-# Non-ISA And Supporting Operations
-
-Supporting operations provide convenience semantics over tile sequences, memory allocation, quantization, and random generation. Some expand to multiple core ISA operations on backends that do not implement them natively.
-
-## Operations
-
-| Operation | Description | Category |
-|-----------|-------------|----------|
-| [TALIAS](./TALIAS.md) | Create an alias view of a tile without copying data | Alias |
-| [TAXPY](./TAXPY.md) | Fused multiply-add: `dst = src0 * scalar + src1` | Fused compute |
-| [TCONCAT](./TCONCAT.md) | Concatenate two tiles along a specified dimension | Tile sequence |
-| [TDEQUANT](./TDEQUANT.md) | Dequantize a tile from quantized format | Quantize |
-| [TFREE](./TFREE.md) | Free a previously allocated tile or buffer | Memory |
-| [THISTOGRAM](./THISTOGRAM.md) | Compute histogram of tile values | Statistics |
-| [TPACK](./TPACK.md) | Pack multiple tiles into a single tile buffer | Tile sequence |
-| [TPOP](./TPOP.md) | Population count of predicate mask | Predicate |
-| [TPUSH](./TPUSH.md) | Push count of predicate mask | Predicate |
-| [TRANDOM](./TRANDOM.md) | Fill tile with random values | Generation |
-| [TQUANT](./TQUANT.md) | Quantize a tile to integer format | Quantize |
-
-## Mechanism
-
-### Alias (TALIAS)
-
-Creates a new tile view that references the same underlying storage as the source tile, without copying data. The alias and source share the same UB buffer but may have different shapes, layouts, or valid regions.
-
-### Fused Compute (TAXPY)
-
-Fused multiply-add: `dst = src0 * scalar + src1`. This is a convenience operation that may be implemented as a single hardware instruction or expanded to `TMUL` + `TADD`.
-
-### Tile Sequence (TCONCAT, TPACK)
-
-`TCONCAT` concatenates two tiles along a specified axis. `TPACK` packs multiple tiles into a single buffer for storage.
-
-### Quantization (TQUANT, TDEQUANT)
-
-Convert between floating-point and quantized integer representations. Quantized formats include INT8, UINT8, INT4, UINT4, FP4, FP8, etc.
-
-Requires scale and zero-point tensors:
-
-$$ \mathrm{dst} = \mathrm{round}(\mathrm{src} \times \mathrm{scale} + \mathrm{zero\_point}) $$
-
-### Memory (TFREE)
-
-Free a previously allocated tile or buffer. The freed storage may be reused by subsequent allocations.
-
-### Predicate (TPOP, TPUSH)
-
-`TPOP` computes the population count (number of set bits) in a predicate mask. `TPUSH` computes the push count (number of leading zeros before the first set bit).
-
-### Generation (TRANDOM)
-
-Fill a tile with random values from a specified distribution.
-
-## Constraints
-
-- Quantization requires valid scale (non-zero) and zero-point within representable range
-- `TFREE` must not be called on a tile that is still in use
-- Tile concatenation requires compatible dimensions along the concatenation axis
-- `TAXPY` may be expanded to separate operations on backends that do not implement it natively
-
-## Cases That Are Not Allowed
-
-- **MUST NOT** use `TFREE` on a tile still in use by another operation
-- **MUST NOT** use invalid scale (zero) or out-of-range zero-point for quantization
-- **MUST NOT** rely on `TAXPY` being a single hardware instruction on all backends
-
-## See Also
-
-- [Other families](../instruction-families/other-families.md) — Family overview
-- [Other instruction surface](../instruction-surfaces/other-instructions.md) — Surface description
diff --git a/docs/mkdocs/src/docs/isa/other/non-isa-and-supporting-ops_zh.md b/docs/mkdocs/src/docs/isa/other/non-isa-and-supporting-ops_zh.md
deleted file mode 100644
index 832397e0..00000000
--- a/docs/mkdocs/src/docs/isa/other/non-isa-and-supporting-ops_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Non-ISA And Supporting Operations
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](non-isa-and-supporting-ops.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/programming-model/README_zh.md b/docs/mkdocs/src/docs/isa/programming-model/README_zh.md
deleted file mode 100644
index 03530b2e..00000000
--- a/docs/mkdocs/src/docs/isa/programming-model/README_zh.md
+++ /dev/null
@@ -1,23 +0,0 @@
-<!-- Generated from `docs/isa/programming-model/README_zh.md` -->
-
-# 编程模型
-
-本章描述程序员在 PTO 中推理和操作的对象：Tile、有效区域（valid region）、GlobalTensor 以及 Auto 模式与 Manual 模式的执行模型。
-
-## 本章内容
-
-- [Tile 与有效区域](programming-model/tiles-and-valid-regions.md) — Tile 的类型、角色、有效区域的概念与约束
-- [GlobalTensor 与数据移动](programming-model/globaltensor-and-data-movement.md) — GlobalTensor 视图以及 GM 与 Tile 之间的数据移动
-- [Auto 模式 vs Manual 模式](programming-model/auto-vs-manual.md) — 两种执行模式的对比与适用场景
-
-## 阅读建议
-
-建议按以下顺序阅读：
-
-1. 先读 [Tile 与有效区域](programming-model/tiles-and-valid-regions.md)，理解 PTO 的核心抽象
-2. 再读 [GlobalTensor 与数据移动](programming-model/globaltensor-and-data-movement.md)，理解数据如何流入和流出 Tile
-3. 最后读 [Auto 模式 vs Manual 模式](programming-model/auto-vs-manual.md)，选择合适的执行模式
-
-## 章节定位
-
-本章属于手册的第 2 章。在进入指令细节之前，应先理解编程模型，因为 PTO 的所有指令都围绕 Tile 和有效区域展开。
diff --git a/docs/mkdocs/src/docs/isa/programming-model/auto-vs-manual.md b/docs/mkdocs/src/docs/isa/programming-model/auto-vs-manual.md
deleted file mode 100644
index b30bb9ce..00000000
--- a/docs/mkdocs/src/docs/isa/programming-model/auto-vs-manual.md
+++ /dev/null
@@ -1,141 +0,0 @@
-<!-- Generated from `docs/isa/programming-model/auto-vs-manual.md` -->
-
-# Auto Vs Manual
-
-PTO supports both Auto and Manual programming styles because they solve different problems. The ISA manual describes the shared architecture contract; this page explains how each mode delegates responsibilities between author and tooling, and which audience benefits most from each.
-
-## Audience Decision Tree
-
-```
-Who are you?
-│
-├─ Compiler / toolchain developer ──► Auto mode is a contract your tool must implement
-│                                      Manual mode is what your tool emits for users
-│
-└─ Kernel author ──► Do you need precise pipeline control?
-                       │
-                       ├─ YES ──► Manual mode
-                       │          (explicit TASSIGN, TSYNC, set_flag/wait_flag)
-                       │
-                       └─ NO  ──► Auto mode
-                                   (compiler/runtime manages placement and scheduling)
-```
-
-## Auto Mode
-
-In Auto mode, the compiler or runtime infrastructure inserts `TASSIGN`, `TSYNC`, and data-movement operations automatically. The author writes only the compute payload.
-
-### What the Author Writes
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void vec_add(Tile<float, 16, 16>& c,
-             const GlobalTensor<float>& ga,
-             const GlobalTensor<float>& gb,
-             const GlobalTensor<float>& gc) {
-    Tile<float, 16, 16> a, b;
-    TLOAD(a, ga);   // compiler inserts TASSIGN before TLOAD
-    TLOAD(b, gb);   // compiler inserts TASSIGN before TLOAD
-    TADD(c, a, b);  // compiler inserts TSYNC between TLOAD and TADD
-    TSTORE(gc, c);  // compiler inserts TSYNC between TADD and TSTORE
-}
-```
-
-### What the Compiler/Runtime Inserts
-
-```
-TASSIGN(a, @tile(slot))   // auto-assign tile buffer address
-TSYNC()                    // sync before next producer
-TLOAD(a, ga)
-TSYNC()
-TASSIGN(b, @tile(slot))
-TSYNC()
-TLOAD(b, gb)
-TSYNC()
-TADD(c, a, b)
-TSYNC()
-TSTORE(gc, c)
-```
-
-Auto mode does NOT change PTO ISA semantics. The inserted operations are standard PTO operations, not backend-specific magic.
-
-### Constraints
-
-- The compiler must ensure that auto-inserted operations satisfy the same legality rules as explicit operations
-- Auto mode assumes the tile shape and valid region are fully determined at compile time
-- `ptoas` can insert synchronization automatically via the `--enable-insert-sync` flag
-
-## Manual Mode
-
-In Manual mode, the author explicitly binds tile resources and manages synchronization. This gives full control over tile placement, double-buffering, and pipeline overlap.
-
-### What the Author Writes
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void vec_add_manual(Tile<float, 16, 16>& c,
-                    const GlobalTensor<float>& ga,
-                    const GlobalTensor<float>& gb) {
-    Tile<float, 16, 16> a, b;
-    TASSIGN(a, 0x1000);        // explicit tile buffer address
-    TASSIGN(b, 0x2000);
-    TASSIGN(c, 0x3000);
-    TLOAD(a, ga);
-    TLOAD(b, gb);
-    TSYNC();                    // explicit synchronization
-    TADD(c, a, b);
-    TSYNC();
-    TSTORE(gc, c);
-}
-```
-
-### Double-Buffering Example
-
-Manual mode enables double-buffering — overlapping DMA and compute on alternating tile slots:
-
-```cpp
-// Tile slot 0 and slot 1 alternate between compute and DMA
-TASSIGN(tile[0], 0x1000);
-TASSIGN(tile[1], 0x2000);
-
-// Iteration i: compute on slot 0, DMA-load next tile on slot 1
-TLOAD(tile[1], gm_next);       // start DMA for next iteration
-set_flag(PIPE_MTE2, PIPE_V, ID0);
-wait_flag(PIPE_MTE2, PIPE_V, ID0);
-TADD(c, tile[0], src[0]);      // compute on current tile
-TSTORE(gm_out, c);
-TSYNC();
-```
-
-## Shared Contract
-
-Both modes still share the same ISA contract:
-
-| Aspect | Auto | Manual |
-|--------|------|--------|
-| PTO ISA semantics | Identical | Identical |
-| Valid-region rules | Same | Same |
-| Movement semantics (TLOAD/TSTORE) | Same | Same |
-| Synchronization contract | Compiler-inserted | Author-controlled |
-| Resource binding | Compiler-inserted | Author-controlled |
-| Tile type/layout constraints | Same | Same |
-| Target profile restrictions | Same | Same |
-
-## Cases That Are Not Allowed
-
-- Documenting Auto mode as if it made illegal programs legal
-- Treating Manual mode details as global guarantees for every PTO program
-- Collapsing Auto and Manual into separate ISAs instead of two ways to author PTO programs
-- Relying on auto-inserted synchronization in code that requires precise pipeline ordering (use Manual instead)
-
-## See Also
-
-- [Execution Agents And Target Profiles](../machine-model/execution-agents.md)
-- [Synchronization And Ordering](../machine-model/ordering-and-synchronization.md)
-- [Portability And Target Profiles](../reference/portability-and-target-profiles.md)
-- [GlobalTensor And Data Movement](./globaltensor-and-data-movement.md)
-- [Tile Sync And Config](../tile/sync-and-config.md)
diff --git a/docs/mkdocs/src/docs/isa/programming-model/auto-vs-manual_zh.md b/docs/mkdocs/src/docs/isa/programming-model/auto-vs-manual_zh.md
deleted file mode 100644
index e5e539ee..00000000
--- a/docs/mkdocs/src/docs/isa/programming-model/auto-vs-manual_zh.md
+++ /dev/null
@@ -1,15 +0,0 @@
-# Auto Vs Manual
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](auto-vs-manual.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册状态与类型](../../../manual/03-state-and-types_zh.md)
-- [中文章节手册 Tile 与 GlobalTensor](../../../manual/04-tiles-and-globaltensor_zh.md)
-- [中文章节手册编程指南](../../../manual/08-programming_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/programming-model/globaltensor-and-data-movement.md b/docs/mkdocs/src/docs/isa/programming-model/globaltensor-and-data-movement.md
deleted file mode 100644
index 6cc896b4..00000000
--- a/docs/mkdocs/src/docs/isa/programming-model/globaltensor-and-data-movement.md
+++ /dev/null
@@ -1,235 +0,0 @@
-<!-- Generated from `docs/isa/programming-model/globaltensor-and-data-movement.md` -->
-
-# GlobalTensor And Data Movement
-
-PTO does not hide movement between global memory and local execution state. `GlobalTensor` is the architecture-visible GM-facing object, and movement operations define when data enters or leaves the local payload surfaces. This page describes the GM-facing types and the complete data paths to tile buffers and vector registers.
-
-## GlobalTensor
-
-### GlobalTensor Template Signature
-
-```
-GlobalTensor<DType, Shape, Stride, Layout>
-```
-
-| Parameter | Type | Description |
-|-----------|------|-------------|
-| `DType` | C++ type | Element type matching the target tile |
-| `Shape` | `Shape<ND>()` | N-dimensional shape: `Shape<B, H, W, R, C>` |
-| `Stride` | `Stride<ND>()` | Per-dimension strides in elements |
-| `Layout` | enum | Memory layout: `ND` (row-major), `DN` (col-major), `NZ` (row-major fractal) |
-
-`GlobalTensor` represents a view of `__gm__` (off-chip device) memory. It is not the storage itself — it is a descriptor that pairs a pointer with shape and stride metadata.
-
-### GlobalTensor vs PartitionTensorView
-
-Two GM-facing types appear in PTO programs:
-
-| Type | Description | Usage |
-|------|-------------|-------|
-| `GlobalTensor` | C++ API type; wraps a `__gm__ T*` with shape/stride | C++ kernel code |
-| `!pto.partition_tensor_view<MxNxdtype>` | SSA/IR type; GM partition descriptor | PTO-AS and MLIR IR |
-| `!pto.memref<dtype, Nd>` | MLIR standard memref | Lowered form |
-
-The `partition_tensor_view` describes a sub-partition of GM visible to a specific block or sub-block. Its shape is always 5D: `(B, H, W, R, C)` — batch, height, width, tile rows, tile columns.
-
-### Supported Layouts
-
-| Layout | Stride Pattern | Description |
-|--------|---------------|-------------|
-| `ND` (default) | `stride[R] = C, stride[W] = R*C, stride[H] = W*H*C, ...` | Row-major, C-contiguous |
-| `DN` | `stride[C] = B, stride[R] = B*C, stride[W] = B*C*R, ...` | Column-major, Fortran-contiguous |
-| `NZ` | Row-major fractal stride | Used with fractal tile layouts |
-
-## Tile Surface Data Path
-
-The tile surface (`pto.t*`) moves data between GM and tile buffers through MTE2/MTE3:
-
-```
-GM
-  │
-  │  copy via DMA engine
-  ▼
-UB (Unified Buffer, 256 KB on-chip)
-  │
-  │  implicit tile buffer fill
-  ▼
-Tile Buffer  ──►  Tile Compute  ──►  Tile Buffer
-                                     │
-                                     │  copy via DMA engine
-                                     ▼
-                                   GM
-```
-
-### TLOAD
-
-`TLOAD` moves data from a `GlobalTensor` into a tile buffer:
-
-```
-dst[i, j] = src[ r0 + i, c0 + j ]
-```
-
-Where `r0` and `c0` are the base offsets derived from the `GlobalTensor` shape/stride and the tile's declared valid region `(Rv, Cv)`.
-
-**Transfer size**: `TLOAD` transfers exactly `dst.GetValidRow() × dst.GetValidCol()` elements.
-
-**Constraints**:
-- Source dtype size MUST equal destination dtype size.
-- Layout compatibility MUST be satisfied:
-  - `TileType::Vec`: ND→ND, DN→DN, NZ→NZ
-  - `TileType::Mat`: ND→ND, DN→DN, NZ→NZ, ND→NZ, DN→ZN
-
-### TSTORE
-
-`TSTORE` moves data from a tile buffer to a `GlobalTensor`:
-
-```
-dst[ r0 + i, c0 + j ] = src[i, j]
-```
-
-Where `i ∈ [0, src.GetValidRow())`, `j ∈ [0, src.GetValidCol())`.
-
-**Transfer size**: `TSTORE` transfers exactly `src.GetValidRow() × src.GetValidCol()` elements.
-
-### Atomic Store Variants
-
-`TSTORE` supports atomic store modes via the `AtomicType` attribute:
-
-| AtomicType | Behavior |
-|------------|----------|
-| `AtomicNone` | Normal store (overwrite) |
-| `AtomicAdd` | Atomic add to GM location |
-| `AtomicMax` | Atomic max |
-| `AtomicMin` | Atomic min |
-
-## Vector Surface Data Path
-
-The vector surface (`pto.v*`) requires an explicit GM↔UB DMA step before vector loads and after vector stores:
-
-```
-GM
-  │
-  │  copy_ubuf_to_gm / copy_gm_to_ubuf (DMA, MTE2/MTE3)
-  ▼
-UB (Unified Buffer, 256 KB on-chip)
-  │
-  │  vlds / vsld / vgather2 (vector load, from UB to vreg)
-  ▼
-Vector Registers  ──►  Vector Compute  ──►  Vector Registers
-                                                │
-                                                │  vsts / vsst / vscatter (vector store)
-                                                ▼
-                                            UB ──► GM
-```
-
-### DMA Copy Operations
-
-The following scalar/control operations configure and execute GM↔UB DMA:
-
-| Operation | Direction | Description |
-|-----------|-----------|-------------|
-| `copy_gm_to_ubuf` | GM → UB | Move data from GM to UB staging area |
-| `copy_ubuf_to_gm` | UB → GM | Move data from UB to GM |
-| `copy_ubuf_to_ubuf` | UB → UB | Copy within UB (e.g., double-buffering) |
-
-These are `pto.*` control-surface operations. They do NOT implicitly synchronize — a `set_flag`/`wait_flag` sequence or explicit `TSYNC` is required before the data is consumed by subsequent vector compute.
-
-### Vector Load/Store (pto.v*)
-
-After DMA staging, `vlds`/`vsld` bring data from UB into vector registers, and `vsts`/`vsst` write data from vector registers back to UB:
-
-| Operation | Path | Description |
-|-----------|------|-------------|
-| `vlds` | UB → vreg | Standard vector load with distribution mode |
-| `vsld` | vreg → UB | Standard vector store |
-| `vgather2` | UB → vreg | Strided/gather load from UB |
-| `vscatter` | vreg → UB | Strided/scatter store to UB |
-
-**Distribution modes** (for `vlds`):
-
-| Mode | Meaning |
-|------|---------|
-| `NORM` | Contiguous 256-byte load |
-| `BRC_B8/B16/B32` | Broadcast: all lanes read the same address |
-| `US_B8/B16` | Upsample: duplicate every Nth element |
-| `DS_B8/B16` | Downsample: keep every Nth element |
-| `UNPK_B8/B16/B32` | Unpack: zero-extend to wider type |
-| `DINTLV_B32` | Deinterleave: extract even/odd lanes |
-| `SPLT2CHN_B8/B16` | Split 2-channel |
-| `SPLT4CHN_B8` | Split 4-channel (RGBA→R) |
-
-## MTE Pipeline
-
-The DMA engine uses three sub-units that operate in a pipeline:
-
-| MTE | Direction | Role in Tile Surface | Role in Vector Surface |
-|-----|-----------|---------------------|----------------------|
-| `MTE1` | GM → UB | Optional: explicit prefetch | Pre-stage data before vector load |
-| `MTE2` | GM → UB | Load staging: GM→UB→tile buffer (via TLOAD) | DMA copy: GM→UB (via `copy_gm_to_ubuf`) |
-| `MTE3` | Tile → GM | Store: tile→UB→GM (via TSTORE) | DMA copy: UB→GM (via `copy_ubuf_to_gm`) |
-
-MTE1, MTE2, and MTE3 can operate in parallel with the Vector Pipeline and Matrix Multiply Unit when proper `set_flag`/`wait_flag` synchronization is used.
-
-## Constraints
-
-- Movement legality depends on source surface, destination surface, layout, and target profile.
-- Movement ops do not erase valid-region rules; they carry or define them.
-- Vector-surface loads and stores obey their own buffer/register rules and are NOT interchangeable with tile movement.
-- DMA copy operations require explicit synchronization before their data is consumed by vector compute.
-- `TLOAD`/`TSTORE` carry valid-region information implicitly; the transfer size is determined by the destination/source tile's valid region.
-
-## Cases That Are Not Allowed
-
-- Documenting data movement as though it were implicit when the ISA requires an explicit move.
-- Assuming vector-buffer traffic and tile-buffer traffic share the same legality contract.
-- Silently relying on target-specific movement shortcuts as if they were architecture-wide.
-- Issuing a `vlds` before the corresponding `copy_gm_to_ubuf` has completed without an intervening `set_flag`/`wait_flag`.
-
-## Examples
-
-### Tile Surface: Elementwise Add
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void vec_add(Tile<float, 16, 16>& c, const GlobalTensor<float>& ga,
-             const GlobalTensor<float>& gb) {
-    Tile<float, 16, 16> a, b;
-    TLOAD(a, ga);           // GM → UB → Tile Buffer A
-    TLOAD(b, gb);           // GM → UB → Tile Buffer B
-    TADD(c, a, b);          // c = a + b, iterated over c's valid region
-    TSTORE(gc, c);          // Tile Buffer C → UB → GM
-}
-```
-
-### Vector Surface: Fine-Grained Vector Load/Store
-
-```c
-// 1. DMA copy from GM to UB staging area
-copy_gm_to_ubuf(%ub_ptr, %gm_ptr, %sid, %n_burst, %len_burst, %stride_dst, %stride_src);
-
-// 2. Signal Vector pipe that data is ready
-set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
-
-// 3. Wait for data, then vector load
-wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
-%vreg = pto.vlds %ub[%offset] {dist = "NORM"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>;
-
-// 4. Vector compute
-%result = pto.vadd %vreg, %vreg : !pto.vreg<64xf32> -> !pto.vreg<64xf32>;
-
-// 5. Vector store
-pto.vsts %result, %ub_out[%offset] : !pto.vreg<64xf32>, !pto.ptr<f32, ub> -> ();
-
-// 6. DMA copy from UB back to GM
-copy_ubuf_to_gm(%ub_out, %gm_out, %sid, %n_burst, %len_burst, %reserved, %stride_dst, %stride_src);
-```
-
-## See Also
-
-- [Tiles And Valid Regions](./tiles-and-valid-regions.md)
-- [Vector Instruction Surface](../instruction-surfaces/vector-instructions.md)
-- [Tile Memory And Data Movement Families](../tile/memory-and-data-movement.md)
-- [Vector Load/Store Reference](../vector/vector-load-store.md)
-- [Scalar DMA Copy Reference](../scalar/dma-copy.md)
diff --git a/docs/mkdocs/src/docs/isa/programming-model/globaltensor-and-data-movement_zh.md b/docs/mkdocs/src/docs/isa/programming-model/globaltensor-and-data-movement_zh.md
deleted file mode 100644
index 3c5bcde5..00000000
--- a/docs/mkdocs/src/docs/isa/programming-model/globaltensor-and-data-movement_zh.md
+++ /dev/null
@@ -1,15 +0,0 @@
-# GlobalTensor And Data Movement
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](globaltensor-and-data-movement.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册状态与类型](../../../manual/03-state-and-types_zh.md)
-- [中文章节手册 Tile 与 GlobalTensor](../../../manual/04-tiles-and-globaltensor_zh.md)
-- [中文章节手册编程指南](../../../manual/08-programming_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/programming-model/tiles-and-valid-regions.md b/docs/mkdocs/src/docs/isa/programming-model/tiles-and-valid-regions.md
deleted file mode 100644
index 33840262..00000000
--- a/docs/mkdocs/src/docs/isa/programming-model/tiles-and-valid-regions.md
+++ /dev/null
@@ -1,202 +0,0 @@
-<!-- Generated from `docs/isa/programming-model/tiles-and-valid-regions.md` -->
-
-# Tiles And Valid Regions
-
-Tiles are the primary payload objects in PTO. Most `pto.t*` semantics are defined over tiles, which is why tile shape, layout, location role, and valid-region metadata are architecture-visible.
-
-Real kernels rarely fill an entire physical rectangle: edge tiles, partial blocks, and padding are normal. If the ISA pretends every element of the stored rectangle is meaningful, backends and authors disagree in silence. PTO instead carries **valid rows and columns** (`Rv`, `Cv`) so legality and semantics are defined on the meaningful domain first.
-
-## Mechanism
-
-### Tile Template Signature
-
-A PTO tile is declared with the following template parameters:
-
-```
-Tile<TileType, DType, Rows, Cols, BLayout, SLayout, Fractal, Pad>
-```
-
-| Parameter | Type | Description |
-|-----------|------|-------------|
-| `TileType` | enum | Storage role: `Vec`, `Mat`, `Acc`, `Scalar`, `Left`, `Right` |
-| `DType` | C++ type | Element type: `half`, `bfloat16_t`, `float`, `int8_t`, etc. |
-| `Rows` | positive integer | Physical row count of the tile buffer |
-| `Cols` | positive integer | Physical column count of the tile buffer |
-| `BLayout` | enum | Block layout: `RowMajor` (C-contiguous) or `ColMajor` (Fortran-contiguous) |
-| `SLayout` | enum | Stripe layout: `NoneBox` (uniform rectangular), `RowMajor` (fractal/strided), `ColMajor` (fractal/strided) |
-| `Fractal` | enum | Fractal encoding: `None`, `NZ`, `ZN`, `FR`, `RN` (valid only when `SLayout != NoneBox`) |
-| `Pad` | enum | Padding value for out-of-valid-region elements: `Zero`, `Null`, `Invalid` |
-
-### TileType
-
-Every tile buffer carries a `TileType` that determines which execution pipeline processes it:
-
-| TileType | Pipeline | Typical Use |
-|----------|----------|-------------|
-| `Vec` | Vector Pipe (V) | General elementwise, unary, binary, reduce operations |
-| `Mat` | Matrix/CUBE Pipe (M) | Matmul input operands (`TMATMUL`, `TGEMV`) |
-| `Acc` | Matrix Pipe (accumulator) | Matmul output accumulator; may accumulate across iterations |
-| `Scalar` | Scalar Unit | Scalar tile with `Rows = Cols = 1` |
-| `Left` | Matrix Pipe | Left-hand operand of `TMATMUL_MX` (A matrix, must be `NZ` layout) |
-| `Right` | Matrix Pipe | Right-hand operand of `TMATMUL_MX` (B matrix, must be `NN` layout) |
-
-### Valid Region
-
-The valid region is the architecture-visible statement of which elements are meaningful. It is expressed as a pair `(Rv, Cv)` — valid rows and valid columns — accessible at runtime via `tile.GetValidRow()` and `tile.GetValidCol()`.
-
-**Semantics**: For any tile operation, `dst[i, j]` is defined if and only if `0 ≤ i < dst.Rv` and `0 ≤ j < dst.Cv`. Elements outside this domain have **no architectural meaning** unless a specific instruction page explicitly defines their behavior.
-
-**Formula**:
-```
-Domain(dst) = { (i, j) | 0 ≤ i < dst.Rv  and  0 ≤ j < dst.Cv }
-```
-
-**Per-instruction iteration domain**: Unless a specific instruction states otherwise, the iteration domain is the **destination tile's valid region**:
-```
-for i in [0, dst.Rv):
-    for j in [0, dst.Cv):
-        dst[i, j] = f(src0[i, j], src1[i, j], ...)
-```
-For source tiles, `src[i, j]` is read regardless of whether `(i, j)` falls within the source's own valid region; the value read for out-of-region lanes is **implementation-defined**.
-
-### Block Layout (BLayout)
-
-`BLayout` describes the in-memory stride between adjacent elements in the row and column directions. Full reference: [Layout Reference](../state-and-types/layout.md).
-
-| BLayout | Stride in Row Direction | Stride in Col Direction |
-|---------|------------------------|------------------------|
-| `RowMajor` (default) | `Cols` (elements per row) | `1` (contiguous in memory) |
-| `ColMajor` | `1` (strided) | `Rows` (elements per column) |
-
-`RowMajor` is the CPU/GPU conventional layout (row 0 is contiguous in memory). `ColMajor` is the Fortran/matrix-convention layout (column 0 is contiguous).
-
-### Stripe Layout (SLayout)
-
-`SLayout` describes whether the tile's sub-elements use a uniform rectangular layout or a fractal/strided layout:
-
-| SLayout | Description | Requires |
-|---------|-------------|----------|
-| `NoneBox` | Uniform rectangular tile: all elements equally spaced | Default for most ops |
-| `RowMajor` | Strided/fractal row layout | Fractal encoding (`NZ`, `FR`) |
-| `ColMajor` | Strided/fractal column layout | Fractal encoding (`ZN`, `RN`) |
-
-### Fractal Layout
-
-When `SLayout != NoneBox`, the `Fractal` parameter encodes the striding pattern for matrix multiplication or other strided access patterns:
-
-| Fractal | Layout | Typical Use |
-|---------|--------|-------------|
-| `None` | Not fractal (standard rectangular tile) | Elementwise ops, general compute |
-| `NZ` | Row-major fractal (`NZ` = Z-order row-major) | LHS matmul operand on A5; `SLayout::RowMajor` |
-| `ZN` | Column-major fractal | Symmetric variant of NZ |
-| `FR` | Row-fractal | CUBE-specific strided pattern |
-| `RN` | Row-N-fractal | CUBE-specific strided pattern |
-
-### Layout Combinations by TileType
-
-| TileType | Supported BLayout | Supported SLayout | Supported Fractal | Typical Ops |
-|----------|------------------|-------------------|-------------------|-------------|
-| `Vec` | `RowMajor`, `ColMajor` | `NoneBox` | `None` | `TADD`, `TMUL`, `TCVT`, `TLOAD/TSTORE` |
-| `Mat` | `RowMajor`, `ColMajor` | `NoneBox` | `None` | `TGEMV`, `TGEMV_ACC`, `TGEMV_BIAS` |
-| `Acc` | `RowMajor`, `ColMajor` | `NoneBox` | `None` | `TMATMUL`, `TMATMUL_ACC` output |
-| `Left` | `RowMajor` | `RowMajor` | `NZ` | LHS of `TMATMUL_MX` |
-| `Right` | `RowMajor` | `NoneBox` | `NN` (implicit) | RHS of `TMATMUL_MX` |
-| `Scalar` | `RowMajor` | `NoneBox` | `None` | Single-element scalar tiles |
-
-### Padding
-
-Elements outside the valid region may be initialized with a padding value. The `Pad` parameter controls this:
-
-| Pad Value | Meaning |
-|-----------|---------|
-| `Zero` | Out-of-valid-region elements are initialized to zero |
-| `Null` | Out-of-valid-region elements are undefined; must not be read |
-| `Invalid` | Elements are marked invalid; reading is undefined |
-
-## Compact Mode
-
-When a tile's physical dimensions exceed the valid region (common at matrix edges), compact mode determines how padding elements are handled. This is especially important for matmul and `TEXTRACT`/`TINSERT` operations.
-
-### Compact Mode in TEXTRACT
-
-`TEXTRACT` supports four compact modes for layout conversion between normal and fractal tiles:
-
-| Mode | Description |
-|------|-------------|
-| `ND2NZ` | Normal row-major → NZ fractal. Valid data is packed contiguously in Z-order; padding is excluded. |
-| `NZ2ND` | NZ fractal → Normal row-major. Valid data is unpacked from Z-order to row-major. |
-| `ND` | Straight copy, no layout transformation. |
-| `ND2NZ2` | Like `ND2NZ` but groups rows in blocks of 2 for specific CUBE access patterns. |
-
-### Compact Mode in TMATMUL_MX
-
-For MX-format matmul, the Left tile uses NZ fractal layout with compact addressing. When `M % tile_M ≠ 0` or `N % tile_N ≠ 0`, the fractal address generator produces addresses only for valid rows, excluding padding from CUBE processing.
-
-## Inputs
-
-The programming model expects the author or the frontend to supply:
-
-- tiles with a legal type and layout combination (see layout combinations table above)
-- valid-row and valid-column information when edge tiles or partial tiles exist
-- instruction operands whose tile roles make sense together (e.g., `Left` + `Right` → `Acc` for matmul)
-
-## Expected Outputs
-
-Tile-producing operations yield a destination tile whose payload, valid region, and legality are defined by the selected instruction family and the interaction of the input tiles. The destination `TileType` and layout must be compatible with the instruction.
-
-## Constraints
-
-- Semantics are defined only inside the declared valid region unless an instruction page says otherwise
-- Multi-input tile ops iterate over the **destination** valid region, reading source tiles lane-by-lane at the corresponding indices regardless of the source's own valid region (implementation-defined values for out-of-region source lanes)
-- A legal tile type is not enough by itself; shape, layout, location intent, and target profile also matter
-- The combination of `TileType`, `BLayout`, `SLayout`, and `Fractal` MUST match one of the supported combinations in the layout table above
-
-## Cases That Are Not Allowed
-
-- Treating out-of-valid-region elements as architecturally meaningful data
-- Assuming every backend will silently repair mismatched valid-region use
-- Using tile roles or layouts that a family or target profile does not permit
-- Relying on any specific implementation-defined value from a source tile lane outside its valid region
-
-## Examples
-
-### Example 1: Edge Tile with Partial Valid Region
-
-An edge tile may have a physical shape of `16 x 16` while only `5 x 9` values are valid:
-
-```cpp
-using EdgeTile = Tile<TileType::Vec, half, 16, 16, RowMajor, NoneBox, None, Zero>;
-EdgeTile tile;
-tile.SetValidRegion(5, 9);
-// Only tile[0..4][0..8] is architecturally meaningful
-```
-
-### Example 2: Matmul Tile Roles
-
-```cpp
-using A = Tile<TileType::Left, int8_t, 16, 16, RowMajor, RowMajor, NZ, Null>;
-using B = Tile<TileType::Right, int8_t, 16, 16, RowMajor, NoneBox, NN, Null>;
-using C = Tile<TileType::Acc, int32_t, 16, 16, RowMajor, NoneBox, None, Zero>;
-A a; B b; C c;
-TMATMUL(c, a, b);  // c[i,j] = sum_k a[i,k] * b[k,j]
-```
-
-### Example 3: Elementwise Operation Over Valid Region
-
-```cpp
-// TADD iterates over dst's valid region:
-// for i in [0, dst.Rv), for j in [0, dst.Cv):
-//     dst[i,j] = src0[i,j] + src1[i,j]
-Tile<TileType::Vec, float, 16, 16> dst, src0, src1;
-dst.SetValidRegion(8, 8);
-// Only dst[0..7][0..7], src0[0..7][0..7], src1[0..7][0..7] participate
-TADD(dst, src0, src1);
-```
-
-## See Also
-
-- [Introduction: what PTO is](../introduction/what-is-pto-visa.md)
-- [GlobalTensor And Data Movement](./globaltensor-and-data-movement.md)
-- [Type System](../state-and-types/type-system.md)
-- [Layout Reference](../state-and-types/layout.md)
-- [Tile Instruction Surface](../instruction-surfaces/tile-instructions.md)
diff --git a/docs/mkdocs/src/docs/isa/programming-model/tiles-and-valid-regions_zh.md b/docs/mkdocs/src/docs/isa/programming-model/tiles-and-valid-regions_zh.md
deleted file mode 100644
index 7c5d3668..00000000
--- a/docs/mkdocs/src/docs/isa/programming-model/tiles-and-valid-regions_zh.md
+++ /dev/null
@@ -1,15 +0,0 @@
-# Tiles And Valid Regions
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tiles-and-valid-regions.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册状态与类型](../../../manual/03-state-and-types_zh.md)
-- [中文章节手册 Tile 与 GlobalTensor](../../../manual/04-tiles-and-globaltensor_zh.md)
-- [中文章节手册编程指南](../../../manual/08-programming_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/reference/README.md b/docs/mkdocs/src/docs/isa/reference/README.md
deleted file mode 100644
index 43d3605f..00000000
--- a/docs/mkdocs/src/docs/isa/reference/README.md
+++ /dev/null
@@ -1,11 +0,0 @@
-<!-- Generated from `docs/isa/reference/README.md` -->
-
-# Reference Notes
-
-These notes support the main PTO ISA manual.
-
-- [Format Of Instruction Descriptions](./format-of-instruction-descriptions.md)
-- [Glossary](./glossary.md)
-- [Diagnostics And Illegal Cases](./diagnostics-and-illegal-cases.md)
-- [Portability And Target Profiles](./portability-and-target-profiles.md)
-- [Source Of Truth](./source-of-truth.md)
diff --git a/docs/mkdocs/src/docs/isa/reference/README_zh.md b/docs/mkdocs/src/docs/isa/reference/README_zh.md
deleted file mode 100644
index 843d6eb2..00000000
--- a/docs/mkdocs/src/docs/isa/reference/README_zh.md
+++ /dev/null
@@ -1,17 +0,0 @@
-<!-- Generated from `docs/isa/reference/README_zh.md` -->
-
-# 参考注释
-
-这些注释支持主要的 PTO ISA 手册，涵盖格式规范、术语表、诊断、可移植性和规范来源。
-
-## 本章内容
-
-- [指令描述格式](reference/format-of-instruction-descriptions.md) — per-op 页面的标准格式规范
-- [术语表](reference/glossary.md) — PTO ISA 中的关键术语定义
-- [诊断与非法情况](reference/diagnostics-and-illegal-cases.md) — 操作失败和非法情况的处理
-- [可移植性与目标 Profile](reference/portability-and-target-profiles.md) — PTO 在不同目标 Profile 之间的可移植性
-- [规范来源](reference/source-of-truth.md) — PTO ISA 规范的权威来源与优先级
-
-## 章节定位
-
-本章属于手册的第 8 章（支持性参考章节），可在需要时查阅。
diff --git a/docs/mkdocs/src/docs/isa/reference/diagnostics-and-illegal-cases.md b/docs/mkdocs/src/docs/isa/reference/diagnostics-and-illegal-cases.md
deleted file mode 100644
index 8cada30b..00000000
--- a/docs/mkdocs/src/docs/isa/reference/diagnostics-and-illegal-cases.md
+++ /dev/null
@@ -1,12 +0,0 @@
-<!-- Generated from `docs/isa/reference/diagnostics-and-illegal-cases.md` -->
-
-# Diagnostics And Illegal Cases
-
-PTO manual pages should distinguish:
-
-- type-class errors
-- legality failures
-- target-profile restrictions
-- unsupported behavior that should not be documented as legal
-
-Every family and per-op page should name the cases that are not allowed instead of leaving them implicit.
diff --git a/docs/mkdocs/src/docs/isa/reference/diagnostics-and-illegal-cases_zh.md b/docs/mkdocs/src/docs/isa/reference/diagnostics-and-illegal-cases_zh.md
deleted file mode 100644
index f4f61f9e..00000000
--- a/docs/mkdocs/src/docs/isa/reference/diagnostics-and-illegal-cases_zh.md
+++ /dev/null
@@ -1,12 +0,0 @@
-# Diagnostics And Illegal Cases
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](diagnostics-and-illegal-cases.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/reference/format-of-instruction-descriptions.md b/docs/mkdocs/src/docs/isa/reference/format-of-instruction-descriptions.md
deleted file mode 100644
index 757bbd42..00000000
--- a/docs/mkdocs/src/docs/isa/reference/format-of-instruction-descriptions.md
+++ /dev/null
@@ -1,52 +0,0 @@
-<!-- Generated from `docs/isa/reference/format-of-instruction-descriptions.md` -->
-
-# Format Of Instruction Descriptions
-
-This section defines how **per-instruction** and **family** pages in this manual are written. Readers should know what to expect from every opcode page, and authors should keep pages comparable across families.
-
-PTO is **tile-first** and **valid-region-first**. Instruction text always means what happens in the declared valid region unless the page explicitly defines behavior outside it.
-
-## Family Pages
-
-A **family** page (for example sync and config, elementwise tile–tile, vector load/store) states:
-
-- what the family is for, in one short opening section
-- shared legality rules, operand roles, and interaction with valid regions
-- pointers into the per-op pages
-
-Family pages do not need to repeat every opcode; they set the contract for the group.
-
-## Per-Op Pages
-
-Each `pto.*` operation page should make the following easy to find. Section titles may vary if a different shape reads better, but the information should be present.
-
-1. **Name and surface** — Mnemonic (`pto.tadd`, `pto.vlds`, …) and which instruction surface it belongs to (tile, vector, scalar/control).
-
-2. **Summary** — One or two sentences: what the operation does on the meaningful domain.
-
-3. **Mechanism** — Precise mathematical or dataflow description over the valid region (and any documented exceptions).
-
-4. **Syntax** — Reference to PTO-AS spelling where relevant; optional **AS** and **IR** patterns when they help interchange and tooling (many pages use SSA and DPS-style examples).
-
-5. **C++ intrinsic** — When the public C++ API is normative for authors, the `pto_instr.hpp` declaration is cited.
-
-6. **Inputs and outputs** — Operands, including tile roles and immediate operands.
-
-7. **Side effects** — Synchronization edges, configuration state, or “none beyond the destination tile” as appropriate.
-
-8. **Constraints and illegal cases** — What verifiers and backends reject; target-profile narrowing may be called out here or under a dedicated subsection.
-
-9. **Examples** — At least one concrete snippet or pseudocode where it clarifies use.
-
-10. **Related links** — Family overview, neighbors in the nav, and cross-links to the programming or memory model when ordering matters.
-
-## Normative Language
-
-Use **MUST**, **SHOULD**, and **MAY** only for rules that a test, verifier, or review can check. Prefer plain language for explanation.
-
-## See Also
-
-- [Instruction surfaces](../instruction-surfaces/README.md)
-- [Instruction families](../instruction-families/README.md)
-- [Diagnostics and illegal cases](./diagnostics-and-illegal-cases.md)
-- [Document structure](../introduction/document-structure.md)
diff --git a/docs/mkdocs/src/docs/isa/reference/glossary.md b/docs/mkdocs/src/docs/isa/reference/glossary.md
deleted file mode 100644
index 9a759c99..00000000
--- a/docs/mkdocs/src/docs/isa/reference/glossary.md
+++ /dev/null
@@ -1,11 +0,0 @@
-<!-- Generated from `docs/isa/reference/glossary.md` -->
-
-# Glossary
-
-- **PTO VISA**: the architecture-visible virtual ISA contract for PTO.
-- **Target profile**: a target-specific narrowing of the portable PTO ISA surface, such as CPU simulation, A2/A3-class, or A5-class support.
-- **Tile surface**: the `pto.t*` operation surface.
-- **Vector surface**: the `pto.v*` operation surface.
-- **Scalar/control surface**: the `pto.*` supporting scalar, control, and configuration surface.
-- **Valid region**: the subset of a physical tile whose values are architecturally meaningful.
-- **Location intent**: the role or storage intent that affects legality.
diff --git a/docs/mkdocs/src/docs/isa/reference/glossary_zh.md b/docs/mkdocs/src/docs/isa/reference/glossary_zh.md
deleted file mode 100644
index 992865fa..00000000
--- a/docs/mkdocs/src/docs/isa/reference/glossary_zh.md
+++ /dev/null
@@ -1,12 +0,0 @@
-# Glossary
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](glossary.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/reference/portability-and-target-profiles.md b/docs/mkdocs/src/docs/isa/reference/portability-and-target-profiles.md
deleted file mode 100644
index 5fdcce72..00000000
--- a/docs/mkdocs/src/docs/isa/reference/portability-and-target-profiles.md
+++ /dev/null
@@ -1,24 +0,0 @@
-<!-- Generated from `docs/isa/reference/portability-and-target-profiles.md` -->
-
-# Portability And Target Profiles
-
-PTO is portable at the virtual-ISA level, not at the level of every target-specific optimization or support subset.
-
-## Portable PTO Contract
-
-Portable PTO documentation should describe:
-
-- architecture-visible semantics of legal programs
-- the required synchronization and visibility edges
-- the meaning of tile, vector, scalar/control, and communication surfaces
-
-## Target Narrowing
-
-Target profiles may narrow:
-
-- supported data types
-- supported layouts or tile roles
-- supported vector forms and pipeline features
-- supported performance-oriented or irregular families
-
-These restrictions must be documented as target-profile restrictions, not as redefinitions of PTO itself.
diff --git a/docs/mkdocs/src/docs/isa/reference/portability-and-target-profiles_zh.md b/docs/mkdocs/src/docs/isa/reference/portability-and-target-profiles_zh.md
deleted file mode 100644
index 60c2978f..00000000
--- a/docs/mkdocs/src/docs/isa/reference/portability-and-target-profiles_zh.md
+++ /dev/null
@@ -1,12 +0,0 @@
-# Portability And Target Profiles
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](portability-and-target-profiles.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/reference/source-of-truth.md b/docs/mkdocs/src/docs/isa/reference/source-of-truth.md
deleted file mode 100644
index 5880233f..00000000
--- a/docs/mkdocs/src/docs/isa/reference/source-of-truth.md
+++ /dev/null
@@ -1,52 +0,0 @@
-<!-- Generated from `docs/isa/reference/source-of-truth.md` -->
-
-# Source Of Truth
-
-Use this order when rewriting or validating PTO ISA documentation:
-
-1. `include/pto/common/pto_instr.hpp` — C++ intrinsic declarations; the public API contract
-2. Current PTO ISA docs in this repo — authoritative prose descriptions
-3. PTO-AS docs ([PTO-AS Specification](../assembly/PTO-AS.md)) — syntax, assembly spelling, assembly-level forms
-4. Older manual prose only as migration background
-
-If a prose source conflicts with the code-visible PTO surface, do not document unsupported behavior as architecture.
-
-## Source Order
-
-When the specification boundary is unclear, use the following order of authority:
-
-1. **PTO ISA manual** and per-op ISA pages — architecture-visible semantics
-2. **Code** (C++ headers, backend implementations) — legal instruction surface
-3. **PTO-AS docs** ([PTO-AS Specification](../assembly/PTO-AS.md)) — syntax, assembly spelling, assembly-level forms
-4. **Target profile notes** — backend-specific narrowing
-
-## Two Compilation Flows
-
-PTO programs flow through the toolchain in two ways. Both paths share the same PTO ISA semantics:
-
-```
-PTO program (.pto text)
-        │
-        ├──► ptoas ──► C++ ──► bisheng ──► binary  (Flow A)
-        │
-        └──► ptoas ──────────────────► binary           (Flow B)
-```
-
-The `ptoas` tool is the authoritative assembler. When documentation describes what "PTO" does, it refers to the semantics defined by PTO ISA, regardless of which flow is used to produce the final binary.
-
-## What the Source Order Means for Authors
-
-- If the manual says an operation is legal and the code rejects it, file a bug — the code should match the manual.
-- If the manual is silent and the code accepts it, the code is authoritative — the manual should be updated.
-- If the manual and the code disagree, the code is authoritative — the manual is wrong.
-- If the manual is silent and the code rejects it, the code is authoritative — the behavior is backend-specific.
-
-## PTOAS as the Authoritative Assembler
-
-`ptoas` is the reference implementation of the PTO assembler. It defines:
-
-- The PTO-AS grammar and syntax
-- The parsing and validation rules
-- The lowering semantics from PTO-AS to C++ or binary
-
-When the PTO ISA manual specifies syntax forms (SSA, DPS), it refers to what `ptoas` accepts. See [PTO-AS Specification](../assembly/PTO-AS.md) for the full grammar reference.
diff --git a/docs/mkdocs/src/docs/isa/reference/source-of-truth_zh.md b/docs/mkdocs/src/docs/isa/reference/source-of-truth_zh.md
deleted file mode 100644
index ed8c8e92..00000000
--- a/docs/mkdocs/src/docs/isa/reference/source-of-truth_zh.md
+++ /dev/null
@@ -1,12 +0,0 @@
-# Source Of Truth
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](source-of-truth.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/README.md b/docs/mkdocs/src/docs/isa/scalar/README.md
deleted file mode 100644
index 1d1a3cbf..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/README.md
+++ /dev/null
@@ -1,17 +0,0 @@
-<!-- Generated from `docs/isa/scalar/README.md` -->
-
-# Scalar And Control Reference
-
-This tree documents the `pto.*` scalar/control surface of PTO ISA: synchronization, DMA configuration, predicate-state movement, predicate construction, and the shared scalar source shell around tile and vector payload execution.
-
-The key distinction is architectural role, not only spelling. `pto.*` pages live here when they expose control, DMA, predicate, or other non-payload state directly. When a family exists only to summarize how those forms interact with vector execution, the vector family overviews remain linked as related material rather than acting as the primary per-op reference.
-
-## Families
-
-- [Control and configuration](./control-and-configuration.md)
-- [Pipeline sync](./pipeline-sync.md)
-- [DMA copy](./dma-copy.md)
-- [Predicate load store](./predicate-load-store.md)
-- [Predicate generation and algebra](./predicate-generation-and-algebra.md)
-- [Shared scalar arithmetic](./shared-arith.md)
-- [Shared structured control flow](./shared-scf.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/README_zh.md b/docs/mkdocs/src/docs/isa/scalar/README_zh.md
deleted file mode 100644
index af06661e..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/README_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Scalar And Control Reference
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](README.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [现有中文指令说明](../README_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/control-and-configuration.md b/docs/mkdocs/src/docs/isa/scalar/control-and-configuration.md
deleted file mode 100644
index da179bad..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/control-and-configuration.md
+++ /dev/null
@@ -1,26 +0,0 @@
-<!-- Generated from `docs/isa/scalar/control-and-configuration.md` -->
-
-# Scalar And Control Families: Control And Configuration
-
-This page is the control-shell overview for the `pto.*` surface. It explains how PTO programs establish ordering, configure DMA, and manipulate predicate-visible state around tile and vector payload work.
-
-## Summary
-
-Scalar and control operations do not carry tile payload semantics themselves. They set up the execution environment in which `pto.t*` and `pto.v*` work becomes legal and well ordered.
-
-## Main Subfamilies
-
-- [Pipeline sync](./pipeline-sync.md): explicit producer-consumer edges, buffer-token protocols, and vector-scope memory barriers.
-- [DMA copy](./dma-copy.md): loop-size and stride configuration plus GM↔UB and UB↔UB copy operations.
-- [Predicate load store](./predicate-load-store.md): moving `!pto.mask` state through UB and handling unaligned predicate-store streams.
-- [Predicate generation and algebra](./predicate-generation-and-algebra.md): mask creation, tail masks, boolean combination, and predicate rearrangement.
-
-## Architectural Role
-
-The `pto.*` surface is where PTO exposes stateful setup and synchronization explicitly. These forms are still part of the virtual ISA contract, but their visible outputs are control, mask, or configuration state rather than tile or vector payload results.
-
-## Related Material
-
-- [Scalar and control instruction surface](../instruction-surfaces/scalar-and-control-instructions.md)
-- [Scalar and control family overview](../instruction-families/scalar-and-control-families.md)
-- [Vector ISA reference](../vector/README.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/control-and-configuration_zh.md b/docs/mkdocs/src/docs/isa/scalar/control-and-configuration_zh.md
deleted file mode 100644
index 8dffa826..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/control-and-configuration_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Scalar And Control Families: Control And Configuration
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](control-and-configuration.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/dma-copy.md b/docs/mkdocs/src/docs/isa/scalar/dma-copy.md
deleted file mode 100644
index 6e789f07..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/dma-copy.md
+++ /dev/null
@@ -1,29 +0,0 @@
-<!-- Generated from `docs/isa/scalar/dma-copy.md` -->
-
-# DMA Copy
-
-These `pto.*` forms configure and execute scalar-side DMA movement between GM and UB or inside UB. They are part of the scalar/control surface because they define configuration and copy behavior, not vector-register compute.
-
-## What This Family Covers
-
-- nested-loop size and stride registers for GM↔UB transfers
-- GM to UB copies
-- UB to GM copies
-- UB to UB copies
-
-## Per-Op Pages
-
-- [pto.set_loop_size_outtoub](./ops/dma-copy/set-loop-size-outtoub.md)
-- [pto.set_loop2_stride_outtoub](./ops/dma-copy/set-loop2-stride-outtoub.md)
-- [pto.set_loop1_stride_outtoub](./ops/dma-copy/set-loop1-stride-outtoub.md)
-- [pto.set_loop_size_ubtoout](./ops/dma-copy/set-loop-size-ubtoout.md)
-- [pto.set_loop2_stride_ubtoout](./ops/dma-copy/set-loop2-stride-ubtoout.md)
-- [pto.set_loop1_stride_ubtoout](./ops/dma-copy/set-loop1-stride-ubtoout.md)
-- [pto.copy_gm_to_ubuf](./ops/dma-copy/copy-gm-to-ubuf.md)
-- [pto.copy_ubuf_to_gm](./ops/dma-copy/copy-ubuf-to-gm.md)
-- [pto.copy_ubuf_to_ubuf](./ops/dma-copy/copy-ubuf-to-ubuf.md)
-
-## Related Material
-
-- [Control and configuration](./control-and-configuration.md)
-- [Vector Families: DMA Copy](../vector/dma-copy.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/dma-copy_zh.md b/docs/mkdocs/src/docs/isa/scalar/dma-copy_zh.md
deleted file mode 100644
index dfc2039d..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/dma-copy_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# DMA Copy
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](dma-copy.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf.md
deleted file mode 100644
index acc47c8a..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf.md
+++ /dev/null
@@ -1,85 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf.md` -->
-
-# pto.copy_gm_to_ubuf
-
-Standalone reference page for `pto.copy_gm_to_ubuf`. This page belongs to the [DMA Copy](../../dma-copy.md) family in the PTO ISA manual.
-
-## Summary
-
-DMA transfer from Global Memory (`!pto.ptr<T, gm>`) to Unified Buffer (`!pto.ptr<T, ub>`).
-
-## Mechanism
-
-`pto.copy_gm_to_ubuf` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```mlir
-pto.copy_gm_to_ubuf %gm_src, %ub_dst,
-    %sid, %n_burst, %len_burst, %left_padding, %right_padding,
-    %data_select_bit, %l2_cache_ctl, %src_stride, %dst_stride
-    : !pto.ptr<T, gm>, !pto.ptr<T, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
-## Detailed Notes
-
-```mlir
-pto.copy_gm_to_ubuf %gm_src, %ub_dst,
-    %sid, %n_burst, %len_burst, %left_padding, %right_padding,
-    %data_select_bit, %l2_cache_ctl, %src_stride, %dst_stride
-    : !pto.ptr<T, gm>, !pto.ptr<T, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
-**Parameters:**
-
-| Parameter | Description |
-|-----------|-------------|
-| `%gm_src` | GM source pointer (`!pto.ptr<T, gm>`) |
-| `%ub_dst` | UB destination pointer (`!pto.ptr<T, ub>`, 32B-aligned) |
-| `%sid` | Stream ID (usually 0) |
-| `%n_burst` | Number of burst rows (innermost loop count) |
-| `%len_burst` | Contiguous bytes transferred per burst row |
-| `%left_padding` | Left padding count (bytes) |
-| `%right_padding` | Right padding count (bytes) |
-| `%data_select_bit` | Padding / data-select control bit (`i1`) |
-| `%l2_cache_ctl` | L2 cache allocate control (TBD — controls whether DMA allocates in L2 cache) |
-| `%src_stride` | GM source stride: start-to-start distance between consecutive burst rows (bytes) |
-| `%dst_stride` | UB destination stride: start-to-start distance between consecutive burst rows (bytes, 32B-aligned) |
-
-## Related Ops / Family Links
-
-- Family overview: [DMA Copy](../../dma-copy.md)
-- Previous op in family: [pto.set_loop1_stride_ubtoout](./set-loop1-stride-ubtoout.md)
-- Next op in family: [pto.copy_ubuf_to_gm](./copy-ubuf-to-gm.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf_zh.md
deleted file mode 100644
index d1ff61d2..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.copy_gm_to_ubuf
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](copy-gm-to-ubuf.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm.md
deleted file mode 100644
index c9128c79..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm.md
+++ /dev/null
@@ -1,78 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm.md` -->
-
-# pto.copy_ubuf_to_gm
-
-Standalone reference page for `pto.copy_ubuf_to_gm`. This page belongs to the [DMA Copy](../../dma-copy.md) family in the PTO ISA manual.
-
-## Summary
-
-DMA transfer from Unified Buffer (`!pto.ptr<T, ub>`) to Global Memory (`!pto.ptr<T, gm>`). MTE3 reads only `len_burst` bytes from each UB row (de-padding).
-
-## Mechanism
-
-`pto.copy_ubuf_to_gm` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```mlir
-pto.copy_ubuf_to_gm %ub_src, %gm_dst,
-    %sid, %n_burst, %len_burst, %reserved, %dst_stride, %src_stride
-    : !pto.ptr<T, ub>, !pto.ptr<T, gm>, i64, i64, i64, i64, i64, i64
-```
-
-## Detailed Notes
-
-```mlir
-pto.copy_ubuf_to_gm %ub_src, %gm_dst,
-    %sid, %n_burst, %len_burst, %reserved, %dst_stride, %src_stride
-    : !pto.ptr<T, ub>, !pto.ptr<T, gm>, i64, i64, i64, i64, i64, i64
-```
-
-**Parameters:**
-
-| Parameter | Description |
-|-----------|-------------|
-| `%ub_src` | UB source pointer (`!pto.ptr<T, ub>`, 32B-aligned) |
-| `%gm_dst` | GM destination pointer (`!pto.ptr<T, gm>`) |
-| `%sid` | Stream ID (usually 0) |
-| `%n_burst` | Number of burst rows |
-| `%len_burst` | Contiguous bytes transferred per burst row |
-| `%reserved` | Reserved field (set to 0) |
-| `%dst_stride` | GM destination stride: start-to-start distance between consecutive burst rows (bytes) |
-| `%src_stride` | UB source stride: start-to-start distance between consecutive burst rows (bytes, 32B-aligned) |
-
-## Related Ops / Family Links
-
-- Family overview: [DMA Copy](../../dma-copy.md)
-- Previous op in family: [pto.copy_gm_to_ubuf](./copy-gm-to-ubuf.md)
-- Next op in family: [pto.copy_ubuf_to_ubuf](./copy-ubuf-to-ubuf.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm_zh.md
deleted file mode 100644
index 1e403147..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.copy_ubuf_to_gm
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](copy-ubuf-to-gm.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf.md
deleted file mode 100644
index 5b920b31..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf.md
+++ /dev/null
@@ -1,824 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf.md` -->
-
-# pto.copy_ubuf_to_ubuf
-
-Standalone reference page for `pto.copy_ubuf_to_ubuf`. This page belongs to the [DMA Copy](../../dma-copy.md) family in the PTO ISA manual.
-
-## Summary
-
-Copy within Unified Buffer.
-
-## Mechanism
-
-`pto.copy_ubuf_to_ubuf` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```mlir
-pto.copy_ubuf_to_ubuf %source, %dest, %sid, %n_burst, %len_burst, %src_stride, %dst_stride
-    : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64 x5
-```
-
-```
-burst    = lenBurst contiguous bytes transferred per row
-stride   = distance (bytes) from start of row[r] to start of row[r+1]
-pad      = ub_stride - lenBurst, padded to the 32B alignment boundary
-```
-
-```
-GM (source, `!pto.ptr<T, gm>`):
-
-          |<--- src_stride (start-to-start) --->|
-          |<- len_burst ->|                     |
-Row 0:    [##DATA########]......................|
-Row 1:    [##DATA########]......................|
-Row 2:    [##DATA########]......................|
-          ...
-Row N-1:  [##DATA########]
-
-UB (destination, `!pto.ptr<T, ub>`, 32B-aligned):
-
-          |<---------- dst_stride (32B-aligned) ---------->|
-          |<- len_burst ->|<- pad (to 32B boundary) ->|    |
-Row 0:    [##DATA########][000000 PAD 000000000000000]
-Row 1:    [##DATA########][000000 PAD 000000000000000]
-Row 2:    [##DATA########][000000 PAD 000000000000000]
-          ...
-Row N-1:  [##DATA########][000000 PAD 000000000000000]
-
-N = n_burst
-stride = start of row[r] to start of row[r+1]
-pad    = filled with pad_val to 32B boundary (data_select_bit=true)
-[DATA] = valid data transferred by DMA
-[PAD]  = pad_val fill (set via set_mov_pad_val)
-```
-
-```
-UB (source, `!pto.ptr<T, ub>`, 32B-aligned start addr):
-
-          |<---------- src_stride (32B-aligned) --------->|
-          |<- len_burst ->|<-- pad (ignored on read) -->| |
-Row 0:    [##DATA########][000 pad 000000000000000000]
-Row 1:    [##DATA########][000 pad 000000000000000000]
-Row 2:    [##DATA########][000 pad 000000000000000000]
-          ...
-Row N-1:  [##DATA########][000 pad 000000000000000000]
-
-GM (destination, `!pto.ptr<T, gm>`):
-
-          |<--- dst_stride (start-to-start) --->|
-          |<- len_burst ->|                     |
-Row 0:    [##DATA########]......................|
-Row 1:    [##DATA########]......................|
-Row 2:    [##DATA########]......................|
-          ...
-Row N-1:  [##DATA########]
-
-N = n_burst
-MTE3 reads only len_burst bytes from each UB row (de-padding).
-Only len_burst bytes are written to each GM row.
-```
-
-```c
-// C equivalent of what the HW executes:
-for (int j = 0; j < loop2_count; j++) {                // HW outer loop
-    uint8_t *gm1 = gm_src + j * loop2_src_stride;
-    uint8_t *ub1 = ub_dst + j * loop2_dst_stride;
-
-    for (int k = 0; k < loop1_count; k++) {            // HW inner loop
-        uint8_t *gm2 = gm1 + k * loop1_src_stride;
-        uint8_t *ub2 = ub1 + k * loop1_dst_stride;
-
-        for (int r = 0; r < n_burst; r++) {            // burst engine
-            memcpy(ub2 + r * dst_stride,               //   UB dest row
-                   gm2 + r * src_stride,               //   GM src row
-                   len_burst);                          //   contiguous bytes
-            if (data_select_bit)
-                memset(ub2 + r * dst_stride + len_burst,
-                       pad_val, dst_stride - len_burst);
-        }
-    }
-}
-```
-
-```c
-// C equivalent:
-for (int j = 0; j < loop2_count; j++) {
-    uint8_t *ub1 = ub_src + j * loop2_src_stride;
-    uint8_t *gm1 = gm_dst + j * loop2_dst_stride;
-
-    for (int k = 0; k < loop1_count; k++) {
-        uint8_t *ub2 = ub1 + k * loop1_src_stride;
-        uint8_t *gm2 = gm1 + k * loop1_dst_stride;
-
-        for (int r = 0; r < n_burst; r++) {
-            memcpy(gm2 + r * dst_stride,               //   GM dest row
-                   ub2 + r * src_stride,               //   UB src row
-                   len_burst);                          //   contiguous bytes
-        }
-    }
-}
-```
-
-```
-GM layout (32 × 32 f32, contiguous):
-
-    |<- len_burst = 128B (32 × 4) ->|
-    |<- src_stride = 128B --------->|
-    +--[#######TILE#######]--+  row 0
-    +--[#######TILE#######]--+  row 1
-    ...
-    +--[#######TILE#######]--+  row 31
-
-UB layout (32 × 32 f32, 32B-aligned, contiguous):
-
-    |<- dst_stride = 128B (32B-aligned) ->|
-    +--[#######TILE#######]--+  row 0
-    +--[#######TILE#######]--+  row 1
-    ...
-    +--[#######TILE#######]--+  row 31
-
-    len_burst   = 32 × 4 = 128 bytes
-    src_stride  = 128 bytes (contiguous rows)
-    dst_stride  = 128 bytes (already 32B-aligned, no padding)
-```
-
-```mlir
-// Simple 2D load — no multi-level loops needed
-pto.set_loop_size_outtoub %c1_i64, %c1_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %arg0, %ub_in,
-    %c0_i64,       // sid = 0
-    %c32_i64,      // n_burst = 32 (32 rows)
-    %c128_i64,     // len_burst = 128 bytes per row
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %false,        // data_select_bit = false
-    %c0_i64,       // l2_cache_ctl = 0
-    %c128_i64,     // src_stride = 128 bytes
-    %c128_i64      // dst_stride = 128 bytes
-    : !pto.ptr<f32, gm>, !pto.ptr<f32, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
-```
-GM layout (1024 × 512 f16):
-
-    col 0          col 128               col 512
-    |              |                     |
-    +--[###TILE###]+.....................+  row R
-    +--[###TILE###]+.....................+  row R+1
-    ...
-    +--[###TILE###]+.....................+  row R+63
-
-    |<--------- src_stride = 1024B ----------->|
-    |<-len_burst=256B->|
-
-    len_burst   = 128 × 2 = 256 bytes (128 f16 elements)
-    src_stride  = 512 × 2 = 1024 bytes (start-to-start, full GM row)
-
-UB layout (64 × 128 f16, 32B-aligned, contiguous):
-
-    +--[###TILE###]--+  row 0  (256 bytes, 32B-aligned, no pad)
-    +--[###TILE###]--+  row 1
-    ...
-    +--[###TILE###]--+  row 63
-
-    dst_stride = 256 bytes (= len_burst, already 32B-aligned, no padding)
-```
-
-```mlir
-// Simple 2D load — no multi-level loops needed
-pto.set_loop_size_outtoub %c1_i64, %c1_i64 : i64, i64
-pto.set_loop1_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-pto.set_loop2_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr,
-    %c0_i64,       // sid = 0
-    %c64_i64,      // n_burst = 64 (64 rows)
-    %c256_i64,     // len_burst = 256 bytes per row
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %false,        // data_select_bit = false
-    %c0_i64,       // l2_cache_ctl = 0
-    %c1024_i64,    // src_stride = 1024 bytes (full matrix row)
-    %c256_i64      // dst_stride = 256 bytes (tile row)
-    : !pto.ptr<f16, gm>, !pto.ptr<f16, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
-```
-GM (100 cols valid, contiguous):
-
-    |<-len_burst=200B->|
-    |<- src_stride=200B (start-to-start) ->|
-    +--[####DATA####]-+  row 0
-    +--[####DATA####]-+  row 1
-    ...
-    +--[####DATA####]-+  row 63
-
-UB (128 cols wide, 32B-aligned, padded):
-
-    |<--------- dst_stride = 256B (32B-aligned) --------->|
-    |<-len_burst=200B->|<---- pad = 56B to 32B boundary ->|
-    +--[####DATA####]-+[0000000 PAD 0000000000000000000000]+  row 0
-    +--[####DATA####]-+[0000000 PAD 0000000000000000000000]+  row 1
-    ...
-    +--[####DATA####]-+[0000000 PAD 0000000000000000000000]+  row 63
-
-    len_burst   = 100 × 2 = 200 bytes
-    src_stride  = 200 bytes (start-to-start, contiguous in GM)
-    dst_stride  = 128 × 2 = 256 bytes (32B-aligned tile width in UB)
-    pad         = 256 - 200 = 56 bytes (padded to 32B boundary with pad_val)
-```
-
-```mlir
-pto.set_loop_size_outtoub %c1_i64, %c1_i64 : i64, i64
-pto.set_loop1_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-pto.set_loop2_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr,
-    %c0_i64,       // sid = 0
-    %c64_i64,      // n_burst = 64
-    %c200_i64,     // len_burst = 200 bytes
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %true,         // data_select_bit = true (enable padding)
-    %c0_i64,       // l2_cache_ctl = 0
-    %c200_i64,     // src_stride = 200 bytes
-    %c256_i64      // dst_stride = 256 bytes (32B-aligned)
-    : !pto.ptr<f16, gm>, !pto.ptr<f16, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
-```
-UB (source, 32B-aligned, 32 × 32 f32):
-
-    |<- src_stride = 128B (32B-aligned) ->|
-    |<- len_burst = 128B ->|
-    +--[#######TILE#######]---+  row 0
-    +--[#######TILE#######]---+  row 1
-    ...
-    +--[#######TILE#######]---+  row 31
-
-    (no padding here — len_burst == src_stride)
-
-GM (dest, 32 × 32 f32):
-
-    |<- dst_stride = 128B ->|
-    |<- len_burst = 128B -->|
-    +--[#######TILE#######]---+  row 0
-    +--[#######TILE#######]---+  row 1
-    ...
-    +--[#######TILE#######]---+  row 31
-```
-
-```mlir
-// Configure MTE3 strides
-pto.set_loop_size_ubtoout %c1_i64, %c1_i64 : i64, i64
-
-pto.copy_ubuf_to_gm %ub_out, %arg1,
-    %c0_i64,       // sid = 0
-    %c32_i64,      // n_burst = 32
-    %c128_i64,     // len_burst = 128 bytes
-    %c0_i64,       // reserved = 0
-    %c128_i64,     // dst_stride = 128 bytes
-    %c128_i64      // src_stride = 128 bytes
-    : !pto.ptr<f32, ub>, !pto.ptr<f32, gm>, i64, i64, i64, i64, i64, i64
-```
-
-```
-UB (source, 32B-aligned, 64 × 128 f16):
-
-    |<- src_stride = 256B (32B-aligned) ->|
-    |<- len_burst = 256B ->|
-    +--[#####TILE#####]---+  row 0
-    +--[#####TILE#####]---+  row 1
-    ...
-    +--[#####TILE#####]---+  row 63
-
-    (no padding here — len_burst == src_stride)
-
-GM (dest, into 1024 × 512 matrix):
-
-    |<----------- dst_stride = 1024B (start-to-start) --------->|
-    |<- len_burst = 256B ->|                                    |
-    col 0          col 128                              col 512
-    +--[#####TILE#####]---+.............................+  row R
-    +--[#####TILE#####]---+.............................+  row R+1
-    ...
-    +--[#####TILE#####]---+.............................+  row R+63
-
-    MTE3 reads len_burst bytes from each 32B-aligned UB row,
-    writes only len_burst bytes per GM row (stride controls row spacing).
-```
-
-```mlir
-// Configure MTE3 strides
-pto.set_loop_size_ubtoout %c1_i64, %c1_i64 : i64, i64
-pto.set_loop1_stride_ubtoout %c0_i64, %c0_i64 : i64, i64
-pto.set_loop2_stride_ubtoout %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_ubuf_to_gm %ub_ptr, %gm_ptr,
-    %c0_i64,       // sid = 0
-    %c64_i64,      // n_burst = 64
-    %c256_i64,     // len_burst = 256 bytes
-    %c0_i64,       // reserved = 0
-    %c1024_i64,    // dst_stride = 1024 bytes (GM row)
-    %c256_i64      // src_stride = 256 bytes (UB row)
-    : !pto.ptr<f16, ub>, !pto.ptr<f16, gm>, i64, i64, i64, i64, i64, i64
-```
-
-```
-GM [4, 8, 128] f16 (contiguous):        UB (4 tiles laid out sequentially):
-
-    batch 0: 8 rows × 256 bytes          [batch 0: 8×128][batch 1: 8×128]
-    batch 1: 8 rows × 256 bytes          [batch 2: 8×128][batch 3: 8×128]
-    batch 2: 8 rows × 256 bytes
-    batch 3: 8 rows × 256 bytes          loop1 src_stride = 2048 bytes (8 × 256)
-                                          loop1 dst_stride = 2048 bytes (8 × 256)
-    Each batch = 8 × 256 = 2048 bytes     loop1_count = 4 (iterate over batches)
-```
-
-```mlir
-// loop1_count = 4 batches, loop2_count = 1 (not used)
-pto.set_loop_size_outtoub %c4_i64, %c1_i64 : i64, i64
-
-// loop1 stride: advance by one batch (2048 bytes) in both GM and UB
-pto.set_loop1_stride_outtoub %c2048_i64, %c2048_i64 : i64, i64
-pto.set_loop2_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr,
-    %c0_i64,       // sid = 0
-    %c8_i64,       // n_burst = 8 rows per batch
-    %c256_i64,     // len_burst = 256 bytes per row
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %false,        // data_select_bit = false
-    %c0_i64,       // l2_cache_ctl = 0
-    %c256_i64,     // src_stride = 256 (contiguous rows)
-    %c256_i64      // dst_stride = 256 (contiguous rows)
-    : !pto.ptr<f16, gm>, !pto.ptr<f16, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
-```
-loop1 iter 0: gm_ptr + 0×2048 → ub_ptr + 0×2048, DMA 8 rows × 256B
-loop1 iter 1: gm_ptr + 1×2048 → ub_ptr + 1×2048, DMA 8 rows × 256B
-loop1 iter 2: gm_ptr + 2×2048 → ub_ptr + 2×2048, DMA 8 rows × 256B
-loop1 iter 3: gm_ptr + 3×2048 → ub_ptr + 3×2048, DMA 8 rows × 256B
-```
-
-## Detailed Notes
-
-```mlir
-pto.copy_ubuf_to_ubuf %source, %dest, %sid, %n_burst, %len_burst, %src_stride, %dst_stride
-    : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64 x5
-```
-
-**Parameters:**
-
-| Parameter | Description |
-|-----------|-------------|
-| `%source` | UB source pointer |
-| `%dest` | UB destination pointer |
-| `%sid` | Stream ID |
-| `%n_burst` | Number of bursts |
-| `%len_burst` | Length per burst |
-| `%src_stride` | Source stride |
-| `%dst_stride` | Destination stride |
-
-## Burst / Stride / Pad Model
-
-All A5 DMA addresses are **stride-based**: stride is the distance from the start of one row to the start of the next row (`stride >= lenBurst`). There is no separate "gap" parameter.
-
-### Key Terms
-
-```
-burst    = lenBurst contiguous bytes transferred per row
-stride   = distance (bytes) from start of row[r] to start of row[r+1]
-pad      = ub_stride - lenBurst, padded to the 32B alignment boundary
-```
-
-### Alignment Constraints
-
-- **UB addresses** (both source and destination) must be **32-byte aligned**.
-- **GM→UB padding**: When `data_select_bit = true`, each UB row is padded from `lenBurst` up to the **32B-aligned boundary** of `ub_stride` with `pad_val` (set via `set_mov_pad_val`). This ensures every UB row starts at a 32B-aligned offset.
-- **UB→GM de-padding**: MTE3 reads `lenBurst` bytes from each 32B-aligned UB row (skipping any padding that was added during load), writing only valid data to GM. This effectively strips padding on store.
-
-### 2D Diagram: GM→UB (pto.copy_gm_to_ubuf)
-
-```
-GM (source, `!pto.ptr<T, gm>`):
-
-          |<--- src_stride (start-to-start) --->|
-          |<- len_burst ->|                     |
-Row 0:    [##DATA########]......................|
-Row 1:    [##DATA########]......................|
-Row 2:    [##DATA########]......................|
-          ...
-Row N-1:  [##DATA########]
-
-UB (destination, `!pto.ptr<T, ub>`, 32B-aligned):
-
-          |<---------- dst_stride (32B-aligned) ---------->|
-          |<- len_burst ->|<- pad (to 32B boundary) ->|    |
-Row 0:    [##DATA########][000000 PAD 000000000000000]
-Row 1:    [##DATA########][000000 PAD 000000000000000]
-Row 2:    [##DATA########][000000 PAD 000000000000000]
-          ...
-Row N-1:  [##DATA########][000000 PAD 000000000000000]
-
-N = n_burst
-stride = start of row[r] to start of row[r+1]
-pad    = filled with pad_val to 32B boundary (data_select_bit=true)
-[DATA] = valid data transferred by DMA
-[PAD]  = pad_val fill (set via set_mov_pad_val)
-```
-
-### 2D Diagram: UB→GM (pto.copy_ubuf_to_gm)
-
-```
-UB (source, `!pto.ptr<T, ub>`, 32B-aligned start addr):
-
-          |<---------- src_stride (32B-aligned) --------->|
-          |<- len_burst ->|<-- pad (ignored on read) -->| |
-Row 0:    [##DATA########][000 pad 000000000000000000]
-Row 1:    [##DATA########][000 pad 000000000000000000]
-Row 2:    [##DATA########][000 pad 000000000000000000]
-          ...
-Row N-1:  [##DATA########][000 pad 000000000000000000]
-
-GM (destination, `!pto.ptr<T, gm>`):
-
-          |<--- dst_stride (start-to-start) --->|
-          |<- len_burst ->|                     |
-Row 0:    [##DATA########]......................|
-Row 1:    [##DATA########]......................|
-Row 2:    [##DATA########]......................|
-          ...
-Row N-1:  [##DATA########]
-
-N = n_burst
-MTE3 reads only len_burst bytes from each UB row (de-padding).
-Only len_burst bytes are written to each GM row.
-```
-
-## Multi-Level Loop Semantics (C Code)
-
-The full DMA transfer is a nested loop. The HW loop registers (set before the copy) control the outer levels, and the copy instruction parameters control the innermost burst level.
-
-### GM→UB Full Loop
-
-```c
-// C equivalent of what the HW executes:
-for (int j = 0; j < loop2_count; j++) {                // HW outer loop
-    uint8_t *gm1 = gm_src + j * loop2_src_stride;
-    uint8_t *ub1 = ub_dst + j * loop2_dst_stride;
-
-    for (int k = 0; k < loop1_count; k++) {            // HW inner loop
-        uint8_t *gm2 = gm1 + k * loop1_src_stride;
-        uint8_t *ub2 = ub1 + k * loop1_dst_stride;
-
-        for (int r = 0; r < n_burst; r++) {            // burst engine
-            memcpy(ub2 + r * dst_stride,               //   UB dest row
-                   gm2 + r * src_stride,               //   GM src row
-                   len_burst);                          //   contiguous bytes
-            if (data_select_bit)
-                memset(ub2 + r * dst_stride + len_burst,
-                       pad_val, dst_stride - len_burst);
-        }
-    }
-}
-```
-
-### UB→GM Full Loop
-
-```c
-// C equivalent:
-for (int j = 0; j < loop2_count; j++) {
-    uint8_t *ub1 = ub_src + j * loop2_src_stride;
-    uint8_t *gm1 = gm_dst + j * loop2_dst_stride;
-
-    for (int k = 0; k < loop1_count; k++) {
-        uint8_t *ub2 = ub1 + k * loop1_src_stride;
-        uint8_t *gm2 = gm1 + k * loop1_dst_stride;
-
-        for (int r = 0; r < n_burst; r++) {
-            memcpy(gm2 + r * dst_stride,               //   GM dest row
-                   ub2 + r * src_stride,               //   UB src row
-                   len_burst);                          //   contiguous bytes
-        }
-    }
-}
-```
-
-## Example 1: GM→UB — Load a 32×32 f32 Tile (Simple Case)
-
-Load a 32×32 f32 tile from GM into UB. This matches the `abs_kernel_2d` test case.
-
-```
-GM layout (32 × 32 f32, contiguous):
-
-    |<- len_burst = 128B (32 × 4) ->|
-    |<- src_stride = 128B --------->|
-    +--[#######TILE#######]--+  row 0
-    +--[#######TILE#######]--+  row 1
-    ...
-    +--[#######TILE#######]--+  row 31
-
-UB layout (32 × 32 f32, 32B-aligned, contiguous):
-
-    |<- dst_stride = 128B (32B-aligned) ->|
-    +--[#######TILE#######]--+  row 0
-    +--[#######TILE#######]--+  row 1
-    ...
-    +--[#######TILE#######]--+  row 31
-
-    len_burst   = 32 × 4 = 128 bytes
-    src_stride  = 128 bytes (contiguous rows)
-    dst_stride  = 128 bytes (already 32B-aligned, no padding)
-```
-
-```mlir
-// Simple 2D load — no multi-level loops needed
-pto.set_loop_size_outtoub %c1_i64, %c1_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %arg0, %ub_in,
-    %c0_i64,       // sid = 0
-    %c32_i64,      // n_burst = 32 (32 rows)
-    %c128_i64,     // len_burst = 128 bytes per row
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %false,        // data_select_bit = false
-    %c0_i64,       // l2_cache_ctl = 0
-    %c128_i64,     // src_stride = 128 bytes
-    %c128_i64      // dst_stride = 128 bytes
-    : !pto.ptr<f32, gm>, !pto.ptr<f32, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
-## Example 2: GM→UB — Load a 2D Tile from a Larger Matrix
-
-Load a 64×128 tile (f16) from a 1024×512 matrix in GM into UB.
-
-```
-GM layout (1024 × 512 f16):
-
-    col 0          col 128               col 512
-    |              |                     |
-    +--[###TILE###]+.....................+  row R
-    +--[###TILE###]+.....................+  row R+1
-    ...
-    +--[###TILE###]+.....................+  row R+63
-
-    |<--------- src_stride = 1024B ----------->|
-    |<-len_burst=256B->|
-
-    len_burst   = 128 × 2 = 256 bytes (128 f16 elements)
-    src_stride  = 512 × 2 = 1024 bytes (start-to-start, full GM row)
-
-UB layout (64 × 128 f16, 32B-aligned, contiguous):
-
-    +--[###TILE###]--+  row 0  (256 bytes, 32B-aligned, no pad)
-    +--[###TILE###]--+  row 1
-    ...
-    +--[###TILE###]--+  row 63
-
-    dst_stride = 256 bytes (= len_burst, already 32B-aligned, no padding)
-```
-
-```mlir
-// Simple 2D load — no multi-level loops needed
-pto.set_loop_size_outtoub %c1_i64, %c1_i64 : i64, i64
-pto.set_loop1_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-pto.set_loop2_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr,
-    %c0_i64,       // sid = 0
-    %c64_i64,      // n_burst = 64 (64 rows)
-    %c256_i64,     // len_burst = 256 bytes per row
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %false,        // data_select_bit = false
-    %c0_i64,       // l2_cache_ctl = 0
-    %c1024_i64,    // src_stride = 1024 bytes (full matrix row)
-    %c256_i64      // dst_stride = 256 bytes (tile row)
-    : !pto.ptr<f16, gm>, !pto.ptr<f16, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
-## Example 3: GM→UB — Load with Padding
-
-Load 100 valid columns from GM into a 128-wide UB tile (f16). The remaining 28 columns are zero-padded.
-
-```
-GM (100 cols valid, contiguous):
-
-    |<-len_burst=200B->|
-    |<- src_stride=200B (start-to-start) ->|
-    +--[####DATA####]-+  row 0
-    +--[####DATA####]-+  row 1
-    ...
-    +--[####DATA####]-+  row 63
-
-UB (128 cols wide, 32B-aligned, padded):
-
-    |<--------- dst_stride = 256B (32B-aligned) --------->|
-    |<-len_burst=200B->|<---- pad = 56B to 32B boundary ->|
-    +--[####DATA####]-+[0000000 PAD 0000000000000000000000]+  row 0
-    +--[####DATA####]-+[0000000 PAD 0000000000000000000000]+  row 1
-    ...
-    +--[####DATA####]-+[0000000 PAD 0000000000000000000000]+  row 63
-
-    len_burst   = 100 × 2 = 200 bytes
-    src_stride  = 200 bytes (start-to-start, contiguous in GM)
-    dst_stride  = 128 × 2 = 256 bytes (32B-aligned tile width in UB)
-    pad         = 256 - 200 = 56 bytes (padded to 32B boundary with pad_val)
-```
-
-```mlir
-pto.set_loop_size_outtoub %c1_i64, %c1_i64 : i64, i64
-pto.set_loop1_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-pto.set_loop2_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr,
-    %c0_i64,       // sid = 0
-    %c64_i64,      // n_burst = 64
-    %c200_i64,     // len_burst = 200 bytes
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %true,         // data_select_bit = true (enable padding)
-    %c0_i64,       // l2_cache_ctl = 0
-    %c200_i64,     // src_stride = 200 bytes
-    %c256_i64      // dst_stride = 256 bytes (32B-aligned)
-    : !pto.ptr<f16, gm>, !pto.ptr<f16, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
-## Example 4: UB→GM — Store a 32×32 f32 Tile (Simple Case)
-
-Store a 32×32 f32 tile from UB back to GM. This matches the `abs_kernel_2d` test case.
-
-```
-UB (source, 32B-aligned, 32 × 32 f32):
-
-    |<- src_stride = 128B (32B-aligned) ->|
-    |<- len_burst = 128B ->|
-    +--[#######TILE#######]---+  row 0
-    +--[#######TILE#######]---+  row 1
-    ...
-    +--[#######TILE#######]---+  row 31
-
-    (no padding here — len_burst == src_stride)
-
-GM (dest, 32 × 32 f32):
-
-    |<- dst_stride = 128B ->|
-    |<- len_burst = 128B -->|
-    +--[#######TILE#######]---+  row 0
-    +--[#######TILE#######]---+  row 1
-    ...
-    +--[#######TILE#######]---+  row 31
-```
-
-```mlir
-// Configure MTE3 strides
-pto.set_loop_size_ubtoout %c1_i64, %c1_i64 : i64, i64
-
-pto.copy_ubuf_to_gm %ub_out, %arg1,
-    %c0_i64,       // sid = 0
-    %c32_i64,      // n_burst = 32
-    %c128_i64,     // len_burst = 128 bytes
-    %c0_i64,       // reserved = 0
-    %c128_i64,     // dst_stride = 128 bytes
-    %c128_i64      // src_stride = 128 bytes
-    : !pto.ptr<f32, ub>, !pto.ptr<f32, gm>, i64, i64, i64, i64, i64, i64
-```
-
-## Example 5: UB→GM — Store a 2D Tile Back to a Larger Matrix
-
-Store a 64×128 tile (f16) from UB back to a 1024×512 GM matrix at an offset.
-
-```
-UB (source, 32B-aligned, 64 × 128 f16):
-
-    |<- src_stride = 256B (32B-aligned) ->|
-    |<- len_burst = 256B ->|
-    +--[#####TILE#####]---+  row 0
-    +--[#####TILE#####]---+  row 1
-    ...
-    +--[#####TILE#####]---+  row 63
-
-    (no padding here — len_burst == src_stride)
-
-GM (dest, into 1024 × 512 matrix):
-
-    |<----------- dst_stride = 1024B (start-to-start) --------->|
-    |<- len_burst = 256B ->|                                    |
-    col 0          col 128                              col 512
-    +--[#####TILE#####]---+.............................+  row R
-    +--[#####TILE#####]---+.............................+  row R+1
-    ...
-    +--[#####TILE#####]---+.............................+  row R+63
-
-    MTE3 reads len_burst bytes from each 32B-aligned UB row,
-    writes only len_burst bytes per GM row (stride controls row spacing).
-```
-
-```mlir
-// Configure MTE3 strides
-pto.set_loop_size_ubtoout %c1_i64, %c1_i64 : i64, i64
-pto.set_loop1_stride_ubtoout %c0_i64, %c0_i64 : i64, i64
-pto.set_loop2_stride_ubtoout %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_ubuf_to_gm %ub_ptr, %gm_ptr,
-    %c0_i64,       // sid = 0
-    %c64_i64,      // n_burst = 64
-    %c256_i64,     // len_burst = 256 bytes
-    %c0_i64,       // reserved = 0
-    %c1024_i64,    // dst_stride = 1024 bytes (GM row)
-    %c256_i64      // src_stride = 256 bytes (UB row)
-    : !pto.ptr<f16, ub>, !pto.ptr<f16, gm>, i64, i64, i64, i64, i64, i64
-```
-
-## Example 6: GM→UB with Multi-Level Loop (Batch of Tiles)
-
-Load 4 batches of 8×128 tiles from a [4, 8, 128] f16 tensor using loop1.
-
-```
-GM [4, 8, 128] f16 (contiguous):        UB (4 tiles laid out sequentially):
-
-    batch 0: 8 rows × 256 bytes          [batch 0: 8×128][batch 1: 8×128]
-    batch 1: 8 rows × 256 bytes          [batch 2: 8×128][batch 3: 8×128]
-    batch 2: 8 rows × 256 bytes
-    batch 3: 8 rows × 256 bytes          loop1 src_stride = 2048 bytes (8 × 256)
-                                          loop1 dst_stride = 2048 bytes (8 × 256)
-    Each batch = 8 × 256 = 2048 bytes     loop1_count = 4 (iterate over batches)
-```
-
-```mlir
-// loop1_count = 4 batches, loop2_count = 1 (not used)
-pto.set_loop_size_outtoub %c4_i64, %c1_i64 : i64, i64
-
-// loop1 stride: advance by one batch (2048 bytes) in both GM and UB
-pto.set_loop1_stride_outtoub %c2048_i64, %c2048_i64 : i64, i64
-pto.set_loop2_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr,
-    %c0_i64,       // sid = 0
-    %c8_i64,       // n_burst = 8 rows per batch
-    %c256_i64,     // len_burst = 256 bytes per row
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %false,        // data_select_bit = false
-    %c0_i64,       // l2_cache_ctl = 0
-    %c256_i64,     // src_stride = 256 (contiguous rows)
-    %c256_i64      // dst_stride = 256 (contiguous rows)
-    : !pto.ptr<f16, gm>, !pto.ptr<f16, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
-Execution trace:
-
-```
-loop1 iter 0: gm_ptr + 0×2048 → ub_ptr + 0×2048, DMA 8 rows × 256B
-loop1 iter 1: gm_ptr + 1×2048 → ub_ptr + 1×2048, DMA 8 rows × 256B
-loop1 iter 2: gm_ptr + 2×2048 → ub_ptr + 2×2048, DMA 8 rows × 256B
-loop1 iter 3: gm_ptr + 3×2048 → ub_ptr + 3×2048, DMA 8 rows × 256B
-```
-
-## Related Ops / Family Links
-
-- Family overview: [DMA Copy](../../dma-copy.md)
-- Previous op in family: [pto.copy_ubuf_to_gm](./copy-ubuf-to-gm.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf_zh.md
deleted file mode 100644
index 2a162422..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.copy_ubuf_to_ubuf
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](copy-ubuf-to-ubuf.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub.md
deleted file mode 100644
index db28a200..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub.md
+++ /dev/null
@@ -1,68 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub.md` -->
-
-# pto.set_loop_size_outtoub
-
-Standalone reference page for `pto.set_loop_size_outtoub`. This page belongs to the [DMA Copy](../../dma-copy.md) family in the PTO ISA manual.
-
-## Summary
-
-Configure HW loop iteration counts for GM→UB DMA.
-
-## Mechanism
-
-`pto.set_loop_size_outtoub` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-```mlir
-pto.set_loop_size_outtoub %loop1_count, %loop2_count : i64, i64
-```
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```mlir
-pto.set_loop_size_outtoub %loop1_count, %loop2_count : i64, i64
-```
-
-## Detailed Notes
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%loop1_count` | 21 bits | Inner HW loop iteration count |
-| `%loop2_count` | 21 bits | Outer HW loop iteration count |
-
-When not using multi-level looping, set both to 1.
-
-## Related Ops / Family Links
-
-- Family overview: [DMA Copy](../../dma-copy.md)
-- Next op in family: [pto.set_loop2_stride_outtoub](./set-loop2-stride-outtoub.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub_zh.md
deleted file mode 100644
index d504d089..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.set_loop_size_outtoub
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](set-loop-size-outtoub.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout.md
deleted file mode 100644
index 04dbd022..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout.md
+++ /dev/null
@@ -1,67 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout.md` -->
-
-# pto.set_loop_size_ubtoout
-
-Standalone reference page for `pto.set_loop_size_ubtoout`. This page belongs to the [DMA Copy](../../dma-copy.md) family in the PTO ISA manual.
-
-## Summary
-
-Configure HW loop iteration counts for UB→GM DMA.
-
-## Mechanism
-
-`pto.set_loop_size_ubtoout` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-```mlir
-pto.set_loop_size_ubtoout %loop1_count, %loop2_count : i64, i64
-```
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```mlir
-pto.set_loop_size_ubtoout %loop1_count, %loop2_count : i64, i64
-```
-
-## Detailed Notes
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%loop1_count` | 21 bits | Inner HW loop iteration count |
-| `%loop2_count` | 21 bits | Outer HW loop iteration count |
-
-## Related Ops / Family Links
-
-- Family overview: [DMA Copy](../../dma-copy.md)
-- Previous op in family: [pto.set_loop1_stride_outtoub](./set-loop1-stride-outtoub.md)
-- Next op in family: [pto.set_loop2_stride_ubtoout](./set-loop2-stride-ubtoout.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout_zh.md
deleted file mode 100644
index e42d1d3f..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.set_loop_size_ubtoout
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](set-loop-size-ubtoout.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub.md
deleted file mode 100644
index 96d558b5..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub.md
+++ /dev/null
@@ -1,73 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub.md` -->
-
-# pto.set_loop1_stride_outtoub
-
-Standalone reference page for `pto.set_loop1_stride_outtoub`. This page belongs to the [DMA Copy](../../dma-copy.md) family in the PTO ISA manual.
-
-## Summary
-
-Configure inner loop (loop1) pointer advance for GM→UB DMA.
-
-## Mechanism
-
-`pto.set_loop1_stride_outtoub` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-```mlir
-pto.set_loop1_stride_outtoub %src_stride, %dst_stride : i64, i64
-```
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```mlir
-pto.set_loop1_stride_outtoub %src_stride, %dst_stride : i64, i64
-```
-
-## Detailed Notes
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%src_stride` | 40 bits | GM source pointer advance per loop1 iteration (bytes) |
-| `%dst_stride` | 21 bits | UB destination pointer advance per loop1 iteration (bytes) |
-
-## Loop Stride Configuration (UB→GM)
-
-These ops configure the MTE3 DMA engine's hardware loops for UB→GM transfers. They must be set **before** calling `pto.copy_ubuf_to_gm`.
-
-Note: UB stride fields are 21 bits (sufficient for 256KB UB address space), GM stride fields are 40 bits (full GM address range).
-
-## Related Ops / Family Links
-
-- Family overview: [DMA Copy](../../dma-copy.md)
-- Previous op in family: [pto.set_loop2_stride_outtoub](./set-loop2-stride-outtoub.md)
-- Next op in family: [pto.set_loop_size_ubtoout](./set-loop-size-ubtoout.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub_zh.md
deleted file mode 100644
index 8e467954..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.set_loop1_stride_outtoub
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](set-loop1-stride-outtoub.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout.md
deleted file mode 100644
index fea97d28..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout.md` -->
-
-# pto.set_loop1_stride_ubtoout
-
-Standalone reference page for `pto.set_loop1_stride_ubtoout`. This page belongs to the [DMA Copy](../../dma-copy.md) family in the PTO ISA manual.
-
-## Summary
-
-Configure inner loop (loop1) pointer advance for UB→GM DMA.
-
-## Mechanism
-
-`pto.set_loop1_stride_ubtoout` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-```mlir
-pto.set_loop1_stride_ubtoout %src_stride, %dst_stride : i64, i64
-```
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```mlir
-pto.set_loop1_stride_ubtoout %src_stride, %dst_stride : i64, i64
-```
-
-## Detailed Notes
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%src_stride` | 21 bits | UB source pointer advance per loop1 iteration (bytes) |
-| `%dst_stride` | 40 bits | GM destination pointer advance per loop1 iteration (bytes) |
-
-## DMA Transfer Execution
-
-## Related Ops / Family Links
-
-- Family overview: [DMA Copy](../../dma-copy.md)
-- Previous op in family: [pto.set_loop2_stride_ubtoout](./set-loop2-stride-ubtoout.md)
-- Next op in family: [pto.copy_gm_to_ubuf](./copy-gm-to-ubuf.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout_zh.md
deleted file mode 100644
index 0c6158df..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.set_loop1_stride_ubtoout
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](set-loop1-stride-ubtoout.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub.md
deleted file mode 100644
index 2c1d3796..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub.md` -->
-
-# pto.set_loop2_stride_outtoub
-
-Standalone reference page for `pto.set_loop2_stride_outtoub`. This page belongs to the [DMA Copy](../../dma-copy.md) family in the PTO ISA manual.
-
-## Summary
-
-Configure outer loop (loop2) pointer advance for GM→UB DMA.
-
-## Mechanism
-
-`pto.set_loop2_stride_outtoub` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-```mlir
-pto.set_loop2_stride_outtoub %src_stride, %dst_stride : i64, i64
-```
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```mlir
-pto.set_loop2_stride_outtoub %src_stride, %dst_stride : i64, i64
-```
-
-## Detailed Notes
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%src_stride` | 40 bits | GM source pointer advance per loop2 iteration (bytes) |
-| `%dst_stride` | 21 bits | UB destination pointer advance per loop2 iteration (bytes) |
-
-After each loop2 iteration, the DMA engine advances the GM read pointer by `%src_stride` and UB write pointer by `%dst_stride`.
-
-## Related Ops / Family Links
-
-- Family overview: [DMA Copy](../../dma-copy.md)
-- Previous op in family: [pto.set_loop_size_outtoub](./set-loop-size-outtoub.md)
-- Next op in family: [pto.set_loop1_stride_outtoub](./set-loop1-stride-outtoub.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub_zh.md
deleted file mode 100644
index 530a4540..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.set_loop2_stride_outtoub
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](set-loop2-stride-outtoub.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout.md
deleted file mode 100644
index 86275b35..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout.md
+++ /dev/null
@@ -1,67 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout.md` -->
-
-# pto.set_loop2_stride_ubtoout
-
-Standalone reference page for `pto.set_loop2_stride_ubtoout`. This page belongs to the [DMA Copy](../../dma-copy.md) family in the PTO ISA manual.
-
-## Summary
-
-Configure outer loop (loop2) pointer advance for UB→GM DMA.
-
-## Mechanism
-
-`pto.set_loop2_stride_ubtoout` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-```mlir
-pto.set_loop2_stride_ubtoout %src_stride, %dst_stride : i64, i64
-```
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```mlir
-pto.set_loop2_stride_ubtoout %src_stride, %dst_stride : i64, i64
-```
-
-## Detailed Notes
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%src_stride` | 21 bits | UB source pointer advance per loop2 iteration (bytes) |
-| `%dst_stride` | 40 bits | GM destination pointer advance per loop2 iteration (bytes) |
-
-## Related Ops / Family Links
-
-- Family overview: [DMA Copy](../../dma-copy.md)
-- Previous op in family: [pto.set_loop_size_ubtoout](./set-loop-size-ubtoout.md)
-- Next op in family: [pto.set_loop1_stride_ubtoout](./set-loop1-stride-ubtoout.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout_zh.md
deleted file mode 100644
index da1625be..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.set_loop2_stride_ubtoout
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](set-loop2-stride-ubtoout.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/get-buf.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/get-buf.md
deleted file mode 100644
index e0f6046e..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/get-buf.md
+++ /dev/null
@@ -1,135 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/pipeline-sync/get-buf.md` -->
-
-# pto.get_buf
-
-Standalone reference page for `pto.get_buf`. This page belongs to the [Pipeline Sync](../../pipeline-sync.md) family in the PTO ISA manual.
-
-## Summary
-
-Acquire a buffer slot in a double-buffering protocol. Implicitly signals readiness to the consuming pipeline via the buffer-token event system.
-
-## Mechanism
-
-`pto.get_buf` acquires a named buffer slot for the calling (consumer) pipeline. It is the acquiring half of the `get_buf`/`rls_buf` double-buffering protocol.
-
-The operation:
-
-1. **Checks availability**: If the named buffer ID is held by another pipeline, the calling pipeline **blocks** until the holder releases it.
-2. **Acquires the slot**: Marks the buffer as held by the calling pipeline.
-3. **Implicitly signals**: Issues `set_flag` from the consumer pipeline to the producer pipeline on the buffer's associated event ID, allowing the producer to proceed.
-
-The consumer pipeline holds the buffer until a matching `rls_buf` is issued. Buffer IDs use program order and the double-buffering protocol to implicitly resolve RAW and WAR dependencies — no explicit event IDs are required.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-get_buf %buf_id, "PIPE_*", %mode : i64, i64
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-pto.get_buf %buf_id, "PIPE_*", %mode : i64, i64
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void GET_BUF(int64_t buf_id,
-                      pipe_t consumer_pipe,
-                      int64_t mode);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%buf_id` | `i64` | Buffer slot identifier (0–N, where N is profile-defined) |
-| `"PIPE_*"` | pipe identifier | The consumer pipeline that is acquiring this slot |
-| `%mode` | `i64` | Protocol mode; controls how the release signals the next stage |
-
-## Expected Outputs
-
-None. This form is defined by its side effect on buffer state and synchronization.
-
-## Side Effects
-
-- Marks the buffer slot as held by the calling pipeline.
-- Implicitly issues `set_flag` from the consumer pipeline to the producer pipeline on the buffer's associated event ID.
-- May block if the buffer is not currently available.
-
-## Constraints
-
-- **Buffer ID uniqueness per pipeline**: Each pipeline may hold at most one slot per buffer ID at a time. Acquiring the same buffer ID twice on the same pipeline without an intervening `rls_buf` is **illegal**.
-- **Producer must release**: The producer pipeline must have issued `rls_buf` on the same buffer ID before this acquire can succeed. Acquiring a buffer that was never released (first iteration) succeeds immediately since all slots start free.
-- **No explicit event IDs**: Unlike `set_flag`/`wait_flag`, buffer ID management requires no explicit event naming. The hardware maps buffer IDs to internal event IDs.
-- **Buffer ID range**: Buffer IDs MUST be in the range `[0, B)` where `B` is the profile-defined maximum. Out-of-range IDs are **illegal**.
-
-## Exceptions
-
-- Illegal if `%buf_id` is not in the valid range for the target profile.
-- Illegal if the same pipeline acquires the same buffer ID twice without an intervening `rls_buf`.
-- Illegal if the buffer ID was never released and the producer has not yet issued `rls_buf` (the acquire will block indefinitely, which is treated as a protocol error).
-- Illegal on CPU simulator if buffer state is inconsistent.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Buffer acquire | Simulated | Supported | Supported |
-| Implicit set_flag | Simulated | Supported | Supported |
-| Blocking on unavailable slot | Simulated | Supported | Supported |
-| Maximum buffer IDs | Implementation-defined | 32 (global pool) | 32 (global pool) |
-
-## Examples
-
-### Acquire buffer for computation
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void compute_loop(int64_t buf_id,
-                 Ptr<ub_space_t, ub_t> ub_in,
-                 Ptr<ub_space_t, ub_t> ub_out) {
-    // Consumer (Vector pipe) acquires input buffer
-    GET_BUF(buf_id, PIPE_V, 0);
-
-    // Vector load, compute, store ...
-    RegBuf<predicate_t> mask;
-    PSET_B32(mask, "PAT_ALL");
-
-    // Release input buffer so MTE2 can reuse it
-    RLS_BUF(buf_id, PIPE_V, 0);
-}
-```
-
-### SSA form — acquire in loop
-
-```mlir
-scf.for %i = %c0 to %N step %c1 {
-    // Acquire input buffer slot i%2
-    pto.get_buf %bufid_in[%pp], "PIPE_V", %c0 : i64, i64
-
-    // Acquire output buffer slot i%2
-    pto.get_buf %bufid_out[%pp], "PIPE_V", %c0 : i64, i64
-
-    // Compute (loads from ub_in[%pp], stores to ub_out[%pp])
-    // ...
-
-    // Release both slots
-    pto.rls_buf %bufid_in[%pp], "PIPE_V", %c0 : i64, i64
-    pto.rls_buf %bufid_out[%pp], "PIPE_V", %c0 : i64, i64
-}
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Pipeline Sync](../../pipeline-sync.md)
-- Previous op in family: [pto.pipe_barrier](./pipe-barrier.md)
-- Next op in family: [pto.rls_buf](./rls-buf.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/get-buf_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/get-buf_zh.md
deleted file mode 100644
index c01edae8..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/get-buf_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.get_buf
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](get-buf.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/mem-bar.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/mem-bar.md
deleted file mode 100644
index 35478b44..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/mem-bar.md
+++ /dev/null
@@ -1,548 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/pipeline-sync/mem-bar.md` -->
-
-# pto.mem_bar
-
-Standalone reference page for `pto.mem_bar`. This page belongs to the [Pipeline Sync](../../pipeline-sync.md) family in the PTO ISA manual.
-
-## Summary
-
-Intra-vector-pipe memory fence within `__VEC_SCOPE__`. Required when UB addresses alias between vector load/store operations.
-
-## Mechanism
-
-`pto.mem_bar` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-```mlir
-pto.mem_bar "BARRIER_TYPE"    // BARRIER_TYPE ∈ { "VV_ALL", "VST_VLD", "VLD_VST" }
-```
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```c
-mem_bar(barrier_type);
-```
-
-```mlir
-pto.vsts %v0, %ub[%c0] : !pto.vreg<64xf32>, !pto.ptr<f32, ub>
-pto.mem_bar "VST_VLD"
-%v1 = pto.vlds %ub[%c0] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-```
-
-```mlir
-// ─── Stage 1: MTE2 loads data from GM into UB ───
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...
-
-// MTE2 signals: "UB data is ready for Vector pipe"
-pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-
-// ─── Stage 2: Vector pipe consumes UB data ───
-// Vector waits until MTE2's signal arrives
-pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-
-scf.for %dummy = %c0 to %c1 step %c1 {
-  %v   = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-  pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-} {llvm.loop.aivector_scope}
-
-// Vector signals: "UB output is ready for MTE3"
-pto.set_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
-
-// ─── Stage 3: MTE3 stores result from UB back to GM ───
-// MTE3 waits until Vector's signal arrives
-pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
-
-pto.copy_ubuf_to_gm %ub_out, %gm_out, ...
-```
-
-```mlir
-// ─── Stage 1: MTE2 loads data into UB ───
-// MTE2 acquires ub_ptr — blocks if Vector hasn't released it from a prior iteration
-pto.get_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...
-// MTE2 done writing ub_ptr — release it so Vector can consume
-pto.rls_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64
-
-// ─── Stage 2: Vector computation ───
-// Vector acquires ub_ptr (input) — blocks until MTE2 releases it (RAW: MTE2 write → V read)
-pto.get_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
-// Vector acquires ub_out (output) — blocks until MTE3 releases it from a prior iteration (WAR: MTE3 read → V write)
-pto.get_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64
-
-scf.for %dummy = %c0 to %c1 step %c1 {
-  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-  %v   = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-  pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-} {llvm.loop.aivector_scope}
-
-// Vector done reading ub_ptr — release so MTE2 can reuse it in next iteration
-pto.rls_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
-// Vector done writing ub_out — release so MTE3 can consume
-pto.rls_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64
-
-// ─── Stage 3: MTE3 stores result to GM ───
-// MTE3 acquires ub_out — blocks until Vector releases it (RAW: V write → MTE3 read)
-pto.get_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64
-pto.copy_ubuf_to_gm %ub_out, %gm_out, ...
-// MTE3 done reading ub_out — release so Vector can reuse it in next iteration
-pto.rls_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64
-```
-
-```mlir
-// ═══ Pre-loop: prime ALL reverse-dependency signals ═══
-// Both input and output buffers start unused. We must pre-send
-// reverse-dep signals so the first iteration's wait_flags don't deadlock.
-pto.set_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_0"]   // ◀ PRIME: buf_in[0] "free"
-pto.set_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_1"]   // ◀ PRIME: buf_in[1] "free"
-pto.set_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_0"]  // ◀ PRIME: buf_out[0] "free"
-pto.set_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_1"]  // ◀ PRIME: buf_out[1] "free"
-
-scf.for %i = %c0 to %N step %c1 {
-  // ── All 3 stages in same iteration, indexed by i%2 ──
-  // %pp = i % 2  (ping/pong selector for buffer & event IDs)
-
-  // ── MTE2: load tile[i] into buf_in[i%2] ──
-  // WAR: wait until Vector has released buf_in[i%2] from iteration i-2
-  pto.wait_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
-  pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_in[%pp], ...
-  // RAW: signal Vector that buf_in[i%2] data is ready
-  pto.set_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]
-
-  // ── Vector: compute buf_in[i%2] → buf_out[i%2] ──
-  // RAW: wait for MTE2 to finish loading buf_in[i%2]
-  pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]
-  // WAR: wait for MTE3 to finish reading buf_out[i%2] from iteration i-2
-  pto.wait_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
-  scf.for %dummy = %c0 to %c1 step %c1 {
-    %v   = pto.vlds %ub_in[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-    %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-    %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-    pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-  } {llvm.loop.aivector_scope}
-  // WAR: tell MTE2 "done reading buf_in[i%2]"
-  pto.set_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
-  // RAW: tell MTE3 "buf_out[i%2] result ready"
-  pto.set_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]
-
-  // ── MTE3: store result from buf_out[i%2] to GM ──
-  // RAW: wait for Vector to finish writing buf_out[i%2]
-  pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]
-  pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
-  // WAR: tell Vector "done reading buf_out[i%2]"
-  pto.set_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
-}
-
-// ═══ Post-loop: drain — match every pre-loop prime with a wait ═══
-// Each priming set_flag must be paired. The last loop iteration's
-// set_flags are consumed by wait_flags that will never fire inside the
-// loop (there is no iteration i+2). Drain them here.
-pto.wait_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_{(N-1)%2}"]  // ◀ DRAIN
-pto.wait_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_{(N-2)%2}"]  // ◀ DRAIN
-pto.wait_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_{(N-1)%2}"] // ◀ DRAIN
-pto.wait_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_{(N-2)%2}"] // ◀ DRAIN
-```
-
-```mlir
-scf.for %i = %c0 to %N step %c1 {
-  // %pp = i % 2  (ping/pong selector)
-
-  // ── MTE2: load tile[i] into buf[i%2] ──
-  // Acquires buf[i%2] — on first iteration, buffer is free so proceeds immediately.
-  // On later iterations, blocks until Vector releases buf[i%2] (WAR: automatic).
-  pto.get_buf %bufid_buf[%pp], "PIPE_MTE2"
-  pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_buf[%pp], ...
-  pto.rls_buf %bufid_buf[%pp], "PIPE_MTE2"
-
-  // ── Vector: compute on buf[i%2] ──
-  // Acquires buf[i%2] — blocks until MTE2 releases it (RAW: automatic)
-  pto.get_buf %bufid_buf[%pp], "PIPE_V"
-  pto.get_buf %bufid_out[%pp], "PIPE_V"
-  scf.for %dummy = %c0 to %c1 step %c1 {
-    %v   = pto.vlds %ub_buf[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-    %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-    %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-    pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-  } {llvm.loop.aivector_scope}
-  // Release buf[i%2] — MTE2 can reuse in iteration i+2 (WAR resolved)
-  pto.rls_buf %bufid_buf[%pp], "PIPE_V"
-  pto.rls_buf %bufid_out[%pp], "PIPE_V"
-
-  // ── MTE3: store result ──
-  // Acquires out[i%2] — blocks until Vector releases it (RAW: automatic)
-  pto.get_buf %bufid_out[%pp], "PIPE_MTE3"
-  pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
-  pto.rls_buf %bufid_out[%pp], "PIPE_MTE3"
-}
-// No post-loop drain needed — last rls_buf completes the pipeline.
-```
-
-```
-Core Cluster (1:2 ratio)
-┌─────────────────────────────────────────────┐
-│  ┌──────────────┐    ┌──────────────┐       │
-│  │  AIC (Cube)  │    │  AIV0 (Vec)  │       │
-│  │  ┌────────┐  │    │  ┌────────┐  │       │
-│  │  │   SU   │──┼────┼──│   SU   │  │       │
-│  │  └────────┘  │    │  └────────┘  │       │
-│  │  CUBE pipe   │    │  MTE2/V/MTE3 │       │
-│  │  L0C buffer  │    │  UB (256KB)  │       │
-│  └──────────────┘    └──────────────┘       │
-│                      ┌──────────────┐       │
-│                      │  AIV1 (Vec)  │       │
-│                      │  ┌────────┐  │       │
-│                      │  │   SU   │  │       │
-│                      │  └────────┘  │       │
-│                      │  MTE2/V/MTE3 │       │
-│                      │  UB (256KB)  │       │
-│                      └──────────────┘       │
-└─────────────────────────────────────────────┘
-```
-
-```c
-// mode2 broadcast/reduce semantics for 1:2 cluster
-set_cross_core(pipe, semaphore_id);   // pipe: VEC/MTE2/CUBE/FIX
-wait_flag_dev(semaphore_id);          // SU-level blocking
-```
-
-```
-C→V Broadcast (one set reaches both):
-    AIC ──set_cross_core──┬──> AIV0 sema++
-                          └──> AIV1 sema++
-
-V→C Reduce (one wait for both):
-    AIV0 ──set_cross_core──┐
-                           ├──> AIC wait_flag_dev (blocks until BOTH)
-    AIV1 ──set_cross_core──┘
-```
-
-## Detailed Notes
-
-```c
-mem_bar(barrier_type);
-```
-
-**Barrier types:**
-
-| Type | Semantics |
-|------|-----------|
-| `VV_ALL` | All prior vector ops complete before subsequent |
-| `VST_VLD` | All prior vector stores visible before subsequent loads |
-| `VLD_VST` | All prior vector loads complete before subsequent stores |
-
-**Example:** Ensure stores are visible before loads to same UB region:
-```mlir
-pto.vsts %v0, %ub[%c0] : !pto.vreg<64xf32>, !pto.ptr<f32, ub>
-pto.mem_bar "VST_VLD"
-%v1 = pto.vlds %ub[%c0] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-```
-
-## Intra-Core Sync Patterns & Examples
-
-### Example 1: `set_flag` / `wait_flag` (Explicit Events)
-
-Each cross-pipeline data dependency requires an explicit signal/wait pair. The programmer must manually insert `set_flag` after the producer and `wait_flag` before the consumer.
-
-```mlir
-// ─── Stage 1: MTE2 loads data from GM into UB ───
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...
-
-// MTE2 signals: "UB data is ready for Vector pipe"
-pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-
-// ─── Stage 2: Vector pipe consumes UB data ───
-// Vector waits until MTE2's signal arrives
-pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-
-scf.for %dummy = %c0 to %c1 step %c1 {
-  %v   = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-  pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-} {llvm.loop.aivector_scope}
-
-// Vector signals: "UB output is ready for MTE3"
-pto.set_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
-
-// ─── Stage 3: MTE3 stores result from UB back to GM ───
-// MTE3 waits until Vector's signal arrives
-pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
-
-pto.copy_ubuf_to_gm %ub_out, %gm_out, ...
-```
-
-**Key property:** Every cross-pipeline edge is an explicit `(set_flag, wait_flag)` pair. Simple for straight-line code, but gets verbose in loops (see Example 3).
-
-### Example 2: `get_buf` / `rls_buf` (Resource-Based)
-
-Instead of naming events, each pipeline declares when it **acquires** (`get_buf`) and **releases** (`rls_buf`) a shared UB buffer. Cross-pipeline RAW/WAR dependencies are resolved implicitly by program order — if MTE2 releases `buf_A` and Vector later acquires `buf_A`, the hardware ensures the acquire cannot proceed until the release completes.
-
-```mlir
-// ─── Stage 1: MTE2 loads data into UB ───
-// MTE2 acquires ub_ptr — blocks if Vector hasn't released it from a prior iteration
-pto.get_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...
-// MTE2 done writing ub_ptr — release it so Vector can consume
-pto.rls_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64
-
-// ─── Stage 2: Vector computation ───
-// Vector acquires ub_ptr (input) — blocks until MTE2 releases it (RAW: MTE2 write → V read)
-pto.get_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
-// Vector acquires ub_out (output) — blocks until MTE3 releases it from a prior iteration (WAR: MTE3 read → V write)
-pto.get_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64
-
-scf.for %dummy = %c0 to %c1 step %c1 {
-  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-  %v   = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-  pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-} {llvm.loop.aivector_scope}
-
-// Vector done reading ub_ptr — release so MTE2 can reuse it in next iteration
-pto.rls_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
-// Vector done writing ub_out — release so MTE3 can consume
-pto.rls_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64
-
-// ─── Stage 3: MTE3 stores result to GM ───
-// MTE3 acquires ub_out — blocks until Vector releases it (RAW: V write → MTE3 read)
-pto.get_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64
-pto.copy_ubuf_to_gm %ub_out, %gm_out, ...
-// MTE3 done reading ub_out — release so Vector can reuse it in next iteration
-pto.rls_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64
-```
-
-**Key property:** No event IDs needed. Dependencies are implicit from program order of `get_buf`/`rls_buf` on the same buffer ID. This becomes much more convenient in multi-iteration loops (see Example 3).
-
-### Example 3: Ping/Pong Double-Buffering Loop
-
-Double-buffering overlaps DMA and compute by using two UB buffers alternately. All three stages (MTE2, Vector, MTE3) appear in the **same iteration** — the hardware pipelines them across iterations because different iterations operate on different buffers (`buf[i%2]`).
-
-#### Event ID scheme (`set_flag` / `wait_flag`)
-
-With 2 ping/pong buffers and 2 pipeline pairs (MTE2↔V, V↔MTE3), `set_flag`/`wait_flag` needs **8 event IDs** = 2 pipe-pairs × 2 buffers × (forward + reverse):
-
-**MTE2 ↔ Vector (input buffers):**
-
-| Event ID | Direction | Purpose |
-|----------|-----------|---------|
-| `EVT_IN_FWD_0` | MTE2 → V | RAW: buf_in[0] data ready |
-| `EVT_IN_FWD_1` | MTE2 → V | RAW: buf_in[1] data ready |
-| `EVT_IN_REV_0` | V → MTE2 | WAR: Vector done reading buf_in[0] |
-| `EVT_IN_REV_1` | V → MTE2 | WAR: Vector done reading buf_in[1] |
-
-**Vector ↔ MTE3 (output buffers):**
-
-| Event ID | Direction | Purpose |
-|----------|-----------|---------|
-| `EVT_OUT_FWD_0` | V → MTE3 | RAW: buf_out[0] result ready |
-| `EVT_OUT_FWD_1` | V → MTE3 | RAW: buf_out[1] result ready |
-| `EVT_OUT_REV_0` | MTE3 → V | WAR: MTE3 done reading buf_out[0] |
-| `EVT_OUT_REV_1` | MTE3 → V | WAR: MTE3 done reading buf_out[1] |
-
-#### 3a. `set_flag` / `wait_flag` version
-
-```mlir
-// ═══ Pre-loop: prime ALL reverse-dependency signals ═══
-// Both input and output buffers start unused. We must pre-send
-// reverse-dep signals so the first iteration's wait_flags don't deadlock.
-pto.set_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_0"]   // ◀ PRIME: buf_in[0] "free"
-pto.set_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_1"]   // ◀ PRIME: buf_in[1] "free"
-pto.set_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_0"]  // ◀ PRIME: buf_out[0] "free"
-pto.set_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_1"]  // ◀ PRIME: buf_out[1] "free"
-
-scf.for %i = %c0 to %N step %c1 {
-  // ── All 3 stages in same iteration, indexed by i%2 ──
-  // %pp = i % 2  (ping/pong selector for buffer & event IDs)
-
-  // ── MTE2: load tile[i] into buf_in[i%2] ──
-  // WAR: wait until Vector has released buf_in[i%2] from iteration i-2
-  pto.wait_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
-  pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_in[%pp], ...
-  // RAW: signal Vector that buf_in[i%2] data is ready
-  pto.set_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]
-
-  // ── Vector: compute buf_in[i%2] → buf_out[i%2] ──
-  // RAW: wait for MTE2 to finish loading buf_in[i%2]
-  pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]
-  // WAR: wait for MTE3 to finish reading buf_out[i%2] from iteration i-2
-  pto.wait_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
-  scf.for %dummy = %c0 to %c1 step %c1 {
-    %v   = pto.vlds %ub_in[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-    %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-    %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-    pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-  } {llvm.loop.aivector_scope}
-  // WAR: tell MTE2 "done reading buf_in[i%2]"
-  pto.set_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
-  // RAW: tell MTE3 "buf_out[i%2] result ready"
-  pto.set_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]
-
-  // ── MTE3: store result from buf_out[i%2] to GM ──
-  // RAW: wait for Vector to finish writing buf_out[i%2]
-  pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]
-  pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
-  // WAR: tell Vector "done reading buf_out[i%2]"
-  pto.set_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
-}
-
-// ═══ Post-loop: drain — match every pre-loop prime with a wait ═══
-// Each priming set_flag must be paired. The last loop iteration's
-// set_flags are consumed by wait_flags that will never fire inside the
-// loop (there is no iteration i+2). Drain them here.
-pto.wait_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_{(N-1)%2}"]  // ◀ DRAIN
-pto.wait_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_{(N-2)%2}"]  // ◀ DRAIN
-pto.wait_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_{(N-1)%2}"] // ◀ DRAIN
-pto.wait_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_{(N-2)%2}"] // ◀ DRAIN
-```
-
-**What `set_flag`/`wait_flag` requires outside the loop:**
-
-#### 3b. `get_buf` / `rls_buf` version
-
-Same ping/pong double-buffering, but **no pre-loop priming or post-loop draining needed.** Buffer acquire/release semantics handle everything.
-
-```mlir
-scf.for %i = %c0 to %N step %c1 {
-  // %pp = i % 2  (ping/pong selector)
-
-  // ── MTE2: load tile[i] into buf[i%2] ──
-  // Acquires buf[i%2] — on first iteration, buffer is free so proceeds immediately.
-  // On later iterations, blocks until Vector releases buf[i%2] (WAR: automatic).
-  pto.get_buf %bufid_buf[%pp], "PIPE_MTE2"
-  pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_buf[%pp], ...
-  pto.rls_buf %bufid_buf[%pp], "PIPE_MTE2"
-
-  // ── Vector: compute on buf[i%2] ──
-  // Acquires buf[i%2] — blocks until MTE2 releases it (RAW: automatic)
-  pto.get_buf %bufid_buf[%pp], "PIPE_V"
-  pto.get_buf %bufid_out[%pp], "PIPE_V"
-  scf.for %dummy = %c0 to %c1 step %c1 {
-    %v   = pto.vlds %ub_buf[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-    %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-    %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-    pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-  } {llvm.loop.aivector_scope}
-  // Release buf[i%2] — MTE2 can reuse in iteration i+2 (WAR resolved)
-  pto.rls_buf %bufid_buf[%pp], "PIPE_V"
-  pto.rls_buf %bufid_out[%pp], "PIPE_V"
-
-  // ── MTE3: store result ──
-  // Acquires out[i%2] — blocks until Vector releases it (RAW: automatic)
-  pto.get_buf %bufid_out[%pp], "PIPE_MTE3"
-  pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
-  pto.rls_buf %bufid_out[%pp], "PIPE_MTE3"
-}
-// No post-loop drain needed — last rls_buf completes the pipeline.
-```
-
-**No priming, no draining, no event IDs.** The acquire/release protocol on buffer IDs indexed by `i%2` implicitly resolves all cross-pipeline dependencies:
-- **RAW** (MTE2→V): Vector's `get_buf` blocks until MTE2's `rls_buf` on `buf[i%2]`
-- **WAR** (V→MTE2): MTE2's `get_buf` in iteration `i+2` blocks until Vector's `rls_buf` in iteration `i` (same buffer)
-
-## Comparison Summary
-
-| Aspect | `set_flag` / `wait_flag` | `get_buf` / `rls_buf` |
-|--------|--------------------------|------------------------|
-| Dependency model | Explicit event signals | Implicit via buffer acquire/release |
-| IDs per pipe-pair | **8** = 2 buffers × 2 dirs × 2 (fwd+rev) | 1 fwd + 1 rev per buffer (shared global pool) |
-| Total HW IDs | 8 per pipe-pair, grows with buffers | **32 global** across all pipes |
-| Reverse (WAR) deps | Extra `set_flag`/`wait_flag` pair per buffer | Handled automatically |
-| Pre-loop setup | `set_flag` to prime each reverse dep | None |
-| Post-loop teardown | `wait_flag` to drain all primed signals | None |
-| Straight-line code | Simple, clear | Slightly more verbose (bracket each stage) |
-| Ping/pong loops | 8 event IDs + 4 prime + 4 drain | Same pattern, no overhead |
-| Best used for | Simple pipelines, fine-grained control | Double/multi-buffering, complex loops |
-
-## Inter-Core Sync
-
-> **Note:** Inter-core sync is only needed for **mixed Cube+Vector tasks** where Cube produces data that Vector consumes (or vice versa). **Vec-only tasks can ignore this section entirely.**
-
-These ops coordinate execution across the Cube block and Vector subblocks within a cluster. Each core cluster consists of **1 Cube block : 2 Vector subblocks**, each with its own **SU (Sequencer Unit)** running independent instruction streams.
-
-```
-Core Cluster (1:2 ratio)
-┌─────────────────────────────────────────────┐
-│  ┌──────────────┐    ┌──────────────┐       │
-│  │  AIC (Cube)  │    │  AIV0 (Vec)  │       │
-│  │  ┌────────┐  │    │  ┌────────┐  │       │
-│  │  │   SU   │──┼────┼──│   SU   │  │       │
-│  │  └────────┘  │    │  └────────┘  │       │
-│  │  CUBE pipe   │    │  MTE2/V/MTE3 │       │
-│  │  L0C buffer  │    │  UB (256KB)  │       │
-│  └──────────────┘    └──────────────┘       │
-│                      ┌──────────────┐       │
-│                      │  AIV1 (Vec)  │       │
-│                      │  ┌────────┐  │       │
-│                      │  │   SU   │  │       │
-│                      │  └────────┘  │       │
-│                      │  MTE2/V/MTE3 │       │
-│                      │  UB (256KB)  │       │
-│                      └──────────────┘       │
-└─────────────────────────────────────────────┘
-```
-
-### Platform Comparison
-
-| Aspect | A2A3 (Ascend 910) | A5 (A5) |
-|--------|-------------------|-----------------|
-| **Signal op** | `set_cross_core` (mode2) | `set_intra_block` |
-| **Wait op** | `wait_flag_dev` | `wait_intra_core` |
-| **Wait behavior** | SU-level blocking (entire core stalls) | Per-pipeline (only named pipe stalls) |
-| **Semaphore pool** | 16 IDs per cluster, 4-bit counter | 16 IDs, but 32-ID address space (see below) |
-| **C→V** | **Broadcast**: one `set` reaches both AIV0+AIV1 | **1:1**: separate `set` per subblock required |
-| **V→C** | **Reduce**: Cube waits for both subblocks in one `wait` | **1:1**: Cube needs separate `wait` per subblock |
-
-### A2A3: `set_cross_core` / `wait_flag_dev`
-
-```c
-// mode2 broadcast/reduce semantics for 1:2 cluster
-set_cross_core(pipe, semaphore_id);   // pipe: VEC/MTE2/CUBE/FIX
-wait_flag_dev(semaphore_id);          // SU-level blocking
-```
-
-```
-C→V Broadcast (one set reaches both):
-    AIC ──set_cross_core──┬──> AIV0 sema++
-                          └──> AIV1 sema++
-
-V→C Reduce (one wait for both):
-    AIV0 ──set_cross_core──┐
-                           ├──> AIC wait_flag_dev (blocks until BOTH)
-    AIV1 ──set_cross_core──┘
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Pipeline Sync](../../pipeline-sync.md)
-- Previous op in family: [pto.rls_buf](./rls-buf.md)
-- Next op in family: [pto.set_cross_core](./set-cross-core.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/mem-bar_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/mem-bar_zh.md
deleted file mode 100644
index 2c38b9de..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/mem-bar_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.mem_bar
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](mem-bar.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/pipe-barrier.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/pipe-barrier.md
deleted file mode 100644
index 4cc6c49b..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/pipe-barrier.md
+++ /dev/null
@@ -1,88 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/pipeline-sync/pipe-barrier.md` -->
-
-# pto.pipe_barrier
-
-Standalone reference page for `pto.pipe_barrier`. This page belongs to the [Pipeline Sync](../../pipeline-sync.md) family in the PTO ISA manual.
-
-## Summary
-
-Drain all pending ops in the specified pipe. All previously issued operations on that pipe complete before any subsequent operation begins.
-
-## Mechanism
-
-`pto.pipe_barrier` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-```mlir
-pto.pipe_barrier "PIPE_*"
-```
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```c
-pipe_barrier(pipe);
-```
-
-```mlir
-// Both stores target the same GM address — order matters!
-pto.copy_ubuf_to_gm %ub_partial_0, %gm_result, ...
-// Without pipe_barrier, MTE3 could execute the second copy before the first
-// completes, producing a non-deterministic result at %gm_result.
-pto.pipe_barrier "PIPE_MTE3"
-// After barrier: first copy is guaranteed complete. Second copy overwrites deterministically.
-pto.copy_ubuf_to_gm %ub_partial_1, %gm_result, ...
-```
-
-## Detailed Notes
-
-```c
-pipe_barrier(pipe);
-```
-
-**Pipe identifiers:** `PIPE_MTE2`, `PIPE_V`, `PIPE_MTE3`
-
-**Example:** Two back-to-back `copy_ubuf_to_gm` calls writing to the same GM address. Without a barrier, MTE3 may reorder them and the final GM value is non-deterministic:
-
-```mlir
-// Both stores target the same GM address — order matters!
-pto.copy_ubuf_to_gm %ub_partial_0, %gm_result, ...
-// Without pipe_barrier, MTE3 could execute the second copy before the first
-// completes, producing a non-deterministic result at %gm_result.
-pto.pipe_barrier "PIPE_MTE3"
-// After barrier: first copy is guaranteed complete. Second copy overwrites deterministically.
-pto.copy_ubuf_to_gm %ub_partial_1, %gm_result, ...
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Pipeline Sync](../../pipeline-sync.md)
-- Previous op in family: [pto.wait_flag](./wait-flag.md)
-- Next op in family: [pto.get_buf](./get-buf.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/pipe-barrier_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/pipe-barrier_zh.md
deleted file mode 100644
index dd06d967..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/pipe-barrier_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pipe_barrier
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pipe-barrier.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/rls-buf.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/rls-buf.md
deleted file mode 100644
index 433c1099..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/rls-buf.md
+++ /dev/null
@@ -1,126 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/pipeline-sync/rls-buf.md` -->
-
-# pto.rls_buf
-
-Standalone reference page for `pto.rls_buf`. This page belongs to the [Pipeline Sync](../../pipeline-sync.md) family in the PTO ISA manual.
-
-## Summary
-
-Release a buffer slot, implicitly signaling the consuming pipeline to proceed.
-
-## Mechanism
-
-`pto.rls_buf` releases a previously acquired buffer slot for the calling (producer) pipeline. It is the releasing half of the `get_buf`/`rls_buf` double-buffering protocol.
-
-The operation:
-
-1. **Releases the slot**: Marks the buffer as free for the calling (producer) pipeline.
-2. **Implicitly signals**: Issues `set_flag` from the producer pipeline to the consumer pipeline on the buffer's associated event ID, unblocking the consumer.
-
-After `rls_buf`, the producer pipeline no longer holds the buffer and MUST NOT access it until it re-acquires it in a future iteration. The consumer pipeline is unblocked by the implicit `set_flag` and may proceed.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-rls_buf %buf_id, "PIPE_*", %mode : i64, i64
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-pto.rls_buf %buf_id, "PIPE_*", %mode : i64, i64
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void RLS_BUF(int64_t buf_id,
-                      pipe_t producer_pipe,
-                      int64_t mode);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%buf_id` | `i64` | Buffer slot identifier (must match a prior `get_buf`) |
-| `"PIPE_*"` | pipe identifier | The producer pipeline that is releasing this slot |
-| `%mode` | `i64` | Protocol mode; controls how the next stage is signaled |
-
-## Expected Outputs
-
-None. This form is defined by its side effect on buffer state and synchronization.
-
-## Side Effects
-
-- Marks the buffer slot as free for the calling pipeline.
-- Implicitly issues `set_flag` from the producer pipeline to the consumer pipeline on the buffer's associated event ID.
-- Does **not** block.
-
-## Constraints
-
-- **Must match prior acquire**: The calling pipeline MUST have previously acquired the named buffer ID via `get_buf`. Releasing a buffer that was never acquired is **illegal**.
-- **Release-after-produce order**: `rls_buf` MUST be issued only after the producer has completed all work on the buffer. Releasing before the data is ready produces **implementation-defined** results.
-- **One release per acquire**: Each `get_buf` MUST be matched by exactly one `rls_buf` before the next `get_buf` on the same pipeline and buffer ID. Extra releases or missing releases are **illegal**.
-- **Producer-consumer pairing**: The pipeline named in `rls_buf` is the producer pipeline (the one that wrote to the buffer). The matching `get_buf` names the consumer pipeline.
-
-## Exceptions
-
-- Illegal if `%buf_id` was not previously acquired by the calling pipeline.
-- Illegal if an extra `rls_buf` is issued without a matching prior `get_buf`.
-- Illegal if `rls_buf` is issued before the producer has finished writing to the buffer (data hazard).
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Buffer release | Simulated | Supported | Supported |
-| Implicit set_flag | Simulated | Supported | Supported |
-| Maximum buffer IDs | Implementation-defined | 32 (global pool) | 32 (global pool) |
-
-## Examples
-
-### Release after DMA load
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void load_and_release(int64_t buf_id,
-                      Ptr<ub_space_t, ub_t> gm_src,
-                      Ptr<ub_space_t, ub_t> ub_dst) {
-    // Acquire buffer slot (MTE2 acquires to write)
-    GET_BUF(buf_id, PIPE_MTE2, 0);
-
-    // MTE2 DMA load: GM → UB
-    COPY_GM_TO_UBUF(gm_src, ub_dst, /* ... */);
-
-    // Release: MTE2 signals Vector that data is ready
-    RLS_BUF(buf_id, PIPE_MTE2, 0);
-}
-```
-
-### SSA form — matching acquire/release
-
-```mlir
-// Producer (MTE2) acquires, loads, releases
-pto.get_buf %bufid_in[%pp], "PIPE_MTE2", %c0 : i64, i64
-pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_in[%pp], ...
-pto.rls_buf %bufid_in[%pp], "PIPE_MTE2", %c0 : i64, i64
-
-// Consumer (Vector) acquires, computes, releases
-pto.get_buf %bufid_in[%pp], "PIPE_V", %c0 : i64, i64
-// ... vector compute on ub_in[%pp] ...
-pto.rls_buf %bufid_in[%pp], "PIPE_V", %c0 : i64, i64
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Pipeline Sync](../../pipeline-sync.md)
-- Previous op in family: [pto.get_buf](./get-buf.md)
-- Next op in family: [pto.mem_bar](./mem-bar.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/rls-buf_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/rls-buf_zh.md
deleted file mode 100644
index b1f4b3f8..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/rls-buf_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.rls_buf
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](rls-buf.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-cross-core.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-cross-core.md
deleted file mode 100644
index 654f3667..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-cross-core.md
+++ /dev/null
@@ -1,135 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/pipeline-sync/set-cross-core.md` -->
-
-# pto.set_cross_core
-
-Standalone reference page for `pto.set_cross_core`. This page belongs to the [Pipeline Sync](../../pipeline-sync.md) family in the PTO ISA manual.
-
-## Summary
-
-Signal an event to another core in a cluster (A2A3). Uses **mode2 broadcast semantics**: one signal reaches both vector subblocks simultaneously; the Cube blocks until both subblocks have signaled back.
-
-## Mechanism
-
-`pto.set_cross_core` signals an event between execution units in a Core Cluster on the A2A3 platform.
-
-**Mode2 semantics (A2A3 cluster, 1 Cube : 2 Vector subblocks):**
-
-- **C→V broadcast**: One `set_cross_core` from the Cube (AIC) atomically increments the semaphore for **both** AIV0 and AIV1 subblocks simultaneously.
-- **V→C reduce**: When the Cube calls `wait_flag_dev`, it blocks until **both** AIV0 and AIV1 have called `set_cross_core` on the same semaphore. Only then does the Cube unblock.
-
-This is a hardware reduce operation: the Cube need only issue one `wait_flag_dev` to synchronize with both subblocks, rather than one per subblock.
-
-The semaphore is a counter: incremented by `set_cross_core`, decremented by `wait_flag_dev`. A `wait_flag_dev` unblocks when the counter reaches zero.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-set_cross_core %core_id, %event_id : i64, i64
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-pto.set_cross_core %core_id, %event_id : i64, i64
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void SET_CROSS_CORE(int64_t core_id, int64_t event_id);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%core_id` | `i64` | Target core identifier (subblock selector: 0 = AIV0, 1 = AIV1) |
-| `%event_id` | `i64` | Semaphore/event identifier |
-
-## Expected Outputs
-
-None. This form is defined by its side effect on inter-core synchronization state.
-
-## Side Effects
-
-- Atomically signals the named event to the target core.
-- On mode2: increments semaphore for both subblocks simultaneously.
-
-## Constraints
-
-- **A2A3 only**: `set_cross_core` is only available on the A2A3 profile. Programs that use this operation MUST provide a fallback path for other profiles.
-- **Semaphore pool**: The pool has 16 physical semaphore IDs per cluster. The hardware implements a 4-bit counter (0–15). `set_cross_core` increments the counter; `wait_flag_dev` decrements it. If the counter would overflow past 15, the behavior is **implementation-defined**.
-- **Broadcast vs. per-subblock**: The broadcast behavior is specific to mode2. Other modes (if supported) may have different semantics.
-- **core_id meaning**: `core_id = 0` targets AIV0 subblock; `core_id = 1` targets AIV1 subblock. Other values are **illegal**.
-
-## Exceptions
-
-- Illegal on non-A2A3 profiles.
-- Illegal if `%event_id` is outside the valid range (0–15).
-- Illegal if the hardware counter would overflow.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| `set_cross_core` | Not available | Supported (mode2) | Use `set_intra_block` |
-| Mode2 broadcast semantics | Not applicable | Supported | Not applicable |
-| Semaphore pool size | Not applicable | 16 IDs | Not applicable |
-| Per-subblock signaling | Not applicable | 1 set reaches both | 1 set per subblock |
-
-CPU simulator does not implement `set_cross_core`. Portable programs MUST guard this operation with profile checks or provide CPU-sim fallback.
-
-## Examples
-
-### A2A3: Cube broadcasts to both vector subblocks
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Cube signals both AIV subblocks simultaneously
-SET_CROSS_CORE(/* core_id */ 0, /* event_id */ 0);  // broadcast to both AIV0 and AIV1
-```
-
-### SSA form — Cube→Vector broadcast
-
-```mlir
-// Cube: broadcast completion signal to both AIV0 and AIV1
-pto.set_cross_core %c0_i64, %c0_i64 : i64, i64
-
-// Both AIV subblocks receive the signal (atomic broadcast)
-
-// AIV0: signals back when its work is done
-pto.set_cross_core %c0_i64, %c1_i64 : i64, i64  // signals event 1
-
-// AIV1: signals back when its work is done
-pto.set_cross_core %c1_i64, %c1_i64 : i64, i64  // signals event 1
-
-// Cube: waits for BOTH AIV0 and AIV1 (reduce)
-pto.wait_flag_dev %c0_i64, %c1_i64 : i64, i64
-// Unblocks only when both subblocks have signaled
-```
-
-### SSA form — Vector→Cube reduce
-
-```mlir
-// AIV0: signal Cube that vector work segment is done
-pto.set_cross_core %c0_i64, %c2_i64 : i64, i64
-
-// AIV1: signal Cube that vector work segment is done
-pto.set_cross_core %c1_i64, %c2_i64 : i64, i64
-
-// Cube: waits for both subblocks on one semaphore (reduce)
-pto.wait_flag_dev %c2_i64 : i64
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Pipeline Sync](../../pipeline-sync.md)
-- Previous op in family: [pto.mem_bar](./mem-bar.md)
-- Next op in family: [pto.wait_flag_dev](./wait-flag-dev.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-cross-core_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-cross-core_zh.md
deleted file mode 100644
index 1c9229ca..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-cross-core_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.set_cross_core
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](set-cross-core.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-flag.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-flag.md
deleted file mode 100644
index 8c33066e..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-flag.md
+++ /dev/null
@@ -1,72 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/pipeline-sync/set-flag.md` -->
-
-# pto.set_flag
-
-Standalone reference page for `pto.set_flag`. This page belongs to the [Pipeline Sync](../../pipeline-sync.md) family in the PTO ISA manual.
-
-## Summary
-
-Signal event from source pipe to destination pipe.
-
-## Mechanism
-
-`pto.set_flag` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-```mlir
-pto.set_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]
-```
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```c
-set_flag(src_pipe, dst_pipe, event_id);
-```
-
-```mlir
-pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-```
-
-## Detailed Notes
-
-```c
-set_flag(src_pipe, dst_pipe, event_id);
-```
-
-**Example:** After MTE2 completes GM→UB transfer, signal Vector pipe:
-```mlir
-pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Pipeline Sync](../../pipeline-sync.md)
-- Next op in family: [pto.wait_flag](./wait-flag.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-flag_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-flag_zh.md
deleted file mode 100644
index 8b8f3a05..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-flag_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.set_flag
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](set-flag.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-intra-block.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-intra-block.md
deleted file mode 100644
index e078ef74..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-intra-block.md
+++ /dev/null
@@ -1,129 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/pipeline-sync/set-intra-block.md` -->
-
-# pto.set_intra_block
-
-Standalone reference page for `pto.set_intra_block`. This page belongs to the [Pipeline Sync](../../pipeline-sync.md) family in the PTO ISA manual.
-
-## Summary
-
-Signal an event within a cluster (A5). Uses **1:1 per-subblock semantics**: each call targets exactly one subblock. No broadcast; separate calls are required for each subblock.
-
-## Mechanism
-
-`pto.set_intra_block` signals an event within a Core Cluster on the A5 platform.
-
-**1:1 semantics (A5 cluster, 1 Cube : 2 Vector subblocks):**
-
-- Each `set_intra_block` call targets **exactly one** subblock (determined by the semaphore ID).
-- IDs 0–15 target AIV0; IDs 16–31 (base + 15) target AIV1.
-- There is **no broadcast**: to signal both subblocks, two separate `set_intra_block` calls are required.
-
-This contrasts with A2A3's `set_cross_core` which broadcasts to both subblocks with one call.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-set_intra_block %pipe, %sem_id : i64, i64
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-pto.set_intra_block %pipe, %sem_id : i64, i64
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void SET_INTRA_BLOCK(pipe_t trigger_pipe, int64_t sem_id);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%pipe` | `pipe_t` | The triggering pipeline on the calling core |
-| `%sem_id` | `i64` | Semaphore ID: 0–15 for AIV0, base+15 for AIV1 |
-
-## Expected Outputs
-
-None. This form is defined by its side effect on intra-block synchronization state.
-
-## Side Effects
-
-- Signals the named semaphore to the target subblock.
-- The target subblock's `wait_intra_core` unblocks when the count reaches zero.
-
-## Constraints
-
-- **A5 only**: `set_intra_block` is only available on the A5 profile.
-- **Semaphore ID mapping**: IDs 0–15 target AIV0; IDs 16–31 target AIV1. Programs MUST use the correct ID for the target subblock.
-- **No broadcast**: Unlike A2A3's `set_cross_core`, one `set_intra_block` does NOT reach both subblocks. Separate calls are required for each subblock.
-- **Semaphore pool**: 16 physical IDs with a 32-ID address space. IDs outside 0–31 are **illegal**.
-
-## Exceptions
-
-- Illegal on non-A5 profiles.
-- Illegal if `%sem_id` is outside the range 0–31.
-- Illegal if the target subblock is not reachable (invalid core ID encoding in sem_id).
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| `set_intra_block` | Not available | Use `set_cross_core` | Supported |
-| Broadcast semantics | Not applicable | One set → both subblocks | One set → one subblock |
-| Per-subblock control | Not applicable | Not available | Supported |
-| Semaphore pool | Not applicable | 16 IDs, 4-bit counter | 16 IDs, 32-ID address space |
-
-## Examples
-
-### A5: C→V — separate signals per subblock
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// AIC signals AIV0 on semaphore 0
-SET_INTRA_BLOCK(PIPE_MTE2, /* sem_id */ 0);
-
-// AIC signals AIV1 on semaphore 16 (0 + 15 offset)
-SET_INTRA_BLOCK(PIPE_MTE2, /* sem_id */ 16);
-```
-
-### SSA form — C→V with 1:1 semantics
-
-```mlir
-// AIC: signal AIV0 that data is ready
-pto.set_intra_block "PIPE_MTE2", %c0_i64 : i64, i64
-
-// AIC: signal AIV1 that data is ready
-pto.set_intra_block "PIPE_MTE2", %c16_i64 : i64, i64
-```
-
-### SSA form — V→C with 1:1 semantics
-
-```mlir
-// AIV0: signal AIC that segment 0 is done
-pto.set_intra_block "PIPE_V", %c0_i64 : i64, i64
-
-// AIV1: signal AIC that segment 1 is done
-pto.set_intra_block "PIPE_V", %c0_i64 : i64, i64
-
-// AIC: wait for AIV0 on sem 0
-pto.wait_intra_core "PIPE_MTE2", %c0_i64 : i64, i64
-
-// AIC: wait for AIV1 on sem 16
-pto.wait_intra_core "PIPE_MTE2", %c16_i64 : i64, i64
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Pipeline Sync](../../pipeline-sync.md)
-- Previous op in family: [pto.wait_flag_dev](./wait-flag-dev.md)
-- Next op in family: [pto.wait_intra_core](./wait-intra-core.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-intra-block_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-intra-block_zh.md
deleted file mode 100644
index 08660e20..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/set-intra-block_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.set_intra_block
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](set-intra-block.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag-dev.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag-dev.md
deleted file mode 100644
index 71c620e1..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag-dev.md
+++ /dev/null
@@ -1,138 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/pipeline-sync/wait-flag-dev.md` -->
-
-# pto.wait_flag_dev
-
-Standalone reference page for `pto.wait_flag_dev`. This page belongs to the [Pipeline Sync](../../pipeline-sync.md) family in the PTO ISA manual.
-
-## Summary
-
-Block the entire SU (all pipelines) until a remote core signals an event (A2A3). Uses **mode2 reduce semantics**: one wait unblocks only when **all** subblocks in the cluster have signaled.
-
-## Mechanism
-
-`pto.wait_flag_dev` blocks the entire SU of the calling core until the named event is signaled by the remote core.
-
-**Mode2 reduce semantics (A2A3):**
-
-- The calling core (typically the Cube) waits on a single semaphore. The semaphore counter is decremented by each `set_cross_core` from remote subblocks.
-- The SU is **fully blocked**: all pipelines (PIPE_MTE2, PIPE_V, PIPE_MTE3) are stalled.
-- The wait unblocks when the semaphore counter reaches zero. In mode2 with 1:2 topology, this means both AIV0 and AIV1 must have called `set_cross_core` before the Cube unblocks.
-
-This is the Cube's counterpart to `set_cross_core`. The pattern is:
-
-```
-Cube:         set_cross_core → (broadcast to AIV0+AIV1)
-AIV0/AIV1:   [do work] → set_cross_core
-Cube:         wait_flag_dev → unblocks when BOTH subblocks signaled
-```
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-wait_flag_dev %event_id : i64
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-pto.wait_flag_dev %event_id : i64
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void WAIT_FLAG_DEV(int64_t event_id);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%event_id` | `i64` | Semaphore/event identifier to wait on |
-
-## Expected Outputs
-
-None. This form is defined by its side effect (blocking) on the calling core.
-
-## Side Effects
-
-- **Blocks the entire SU**: All pipelines (MTE2, V, MTE3) on the calling core are stalled until the event fires.
-- Decrements the semaphore counter when the event is signaled.
-
-## Constraints
-
-- **A2A3 only**: `wait_flag_dev` is only available on the A2A3 profile.
-- **SU-level blocking**: Unlike `wait_flag` (intra-core) which only stalls the named destination pipeline, `wait_flag_dev` stalls **all** pipelines on the core. This is more restrictive than A5's `wait_intra_core`.
-- **Semaphore pool**: The pool has 16 physical semaphore IDs per cluster with a 4-bit counter (0–15). The wait unblocks when the counter reaches zero. If the counter is already zero (premature wait), the behavior is **implementation-defined**.
-- **Event must be set**: Waiting on an event that was never set by a matching remote `set_cross_core` is **illegal**.
-
-## Exceptions
-
-- Illegal on non-A2A3 profiles.
-- Illegal if `%event_id` is outside the valid range (0–15).
-- Illegal if the event was never set by a remote core.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| `wait_flag_dev` | Not available | Supported | Use `wait_intra_core` |
-| SU-level blocking | Not applicable | All pipelines blocked | Only named pipe blocked |
-| Semaphore pool size | Not applicable | 16 IDs, 4-bit counter | 16 IDs, 32-ID address space |
-| Reduce semantics | Not applicable | One wait unblocks on N signals | One wait per signal |
-
-## Examples
-
-### A2A3: Cube waits for both vector subblocks
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Cube: broadcast to both AIV subblocks
-SET_CROSS_CORE(/* core_id */ 0, /* event_id */ 0);
-
-// AIV0: do work, then signal
-// AIV1: do work, then signal
-
-// Cube: block until BOTH AIV0 and AIV1 have signaled (reduce)
-WAIT_FLAG_DEV(/* event_id */ 0);
-```
-
-### SSA form — complete C↔V handshake
-
-```mlir
-// === Cube (Producer) ===
-// Signal both AIV subblocks: data is ready
-pto.set_cross_core %c0_i64, %c0_i64 : i64, i64
-
-// === AIV0 (Consumer) ===
-// Wait for Cube's signal
-pto.wait_flag_dev %c0_i64 : i64
-// [process data]
-// Signal back to Cube: work on AIV0 done
-pto.set_cross_core %c0_i64, %c1_i64 : i64, i64
-
-// === AIV1 (Consumer) ===
-// Wait for Cube's signal
-pto.wait_flag_dev %c0_i64 : i64
-// [process data]
-// Signal back to Cube: work on AIV1 done
-pto.set_cross_core %c1_i64, %c1_i64 : i64, i64
-
-// === Cube (Producer) ===
-// Block until BOTH AIV0 and AIV1 have signaled (reduce)
-pto.wait_flag_dev %c1_i64 : i64
-// Both signaled — Cube can proceed
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Pipeline Sync](../../pipeline-sync.md)
-- Previous op in family: [pto.set_cross_core](./set-cross-core.md)
-- Next op in family: [pto.set_intra_block](./set-intra-block.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag-dev_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag-dev_zh.md
deleted file mode 100644
index 733cc32b..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag-dev_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.wait_flag_dev
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](wait-flag-dev.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag.md
deleted file mode 100644
index 8b172e22..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag.md
+++ /dev/null
@@ -1,73 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/pipeline-sync/wait-flag.md` -->
-
-# pto.wait_flag
-
-Standalone reference page for `pto.wait_flag`. This page belongs to the [Pipeline Sync](../../pipeline-sync.md) family in the PTO ISA manual.
-
-## Summary
-
-Block destination pipe until source pipe signals event.
-
-## Mechanism
-
-`pto.wait_flag` is a `pto.*` control/configuration operation. It changes ordering, buffer, event, or DMA-visible state that later payload work depends on. The portable guarantee is the dependency/configuration effect, while concrete pipe/event spaces remain target-profile details.
-
-## Syntax
-
-```mlir
-pto.wait_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]
-```
-
-## Inputs
-
-The inputs are the architecture-visible control operands shown in the syntax: pipe ids, event ids, buffer ids, loop/stride values, pointers, or configuration words used to drive later execution.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation updates control, synchronization, or DMA configuration state. Depending on the form, it may stall a stage, establish a producer-consumer edge, reserve or release a buffer token, or configure later copy behavior.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- It is illegal to use unsupported pipe ids, event ids, buffer ids, or configuration tuples for the selected target profile.
-- Waiting on state that was never established by a matching producer or prior configuration is an illegal PTO program.
-
-## Target-Profile Restrictions
-
-- CPU simulation preserves the visible dependency/configuration contract, but it may not expose every low-level hazard that motivates the form on hardware targets.
-- A2/A3 and A5 profiles may use different concrete pipe, DMA, predicate, or event spaces. Portable code must rely on the documented PTO contract plus the selected target profile.
-
-## Examples
-
-```c
-wait_flag(src_pipe, dst_pipe, event_id);
-```
-
-```mlir
-pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-```
-
-## Detailed Notes
-
-```c
-wait_flag(src_pipe, dst_pipe, event_id);
-```
-
-**Example:** Vector pipe waits for MTE2 data to arrive:
-```mlir
-pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Pipeline Sync](../../pipeline-sync.md)
-- Previous op in family: [pto.set_flag](./set-flag.md)
-- Next op in family: [pto.pipe_barrier](./pipe-barrier.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag_zh.md
deleted file mode 100644
index c0fba961..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-flag_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.wait_flag
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](wait-flag.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-intra-core.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-intra-core.md
deleted file mode 100644
index 661182bf..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-intra-core.md
+++ /dev/null
@@ -1,130 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/pipeline-sync/wait-intra-core.md` -->
-
-# pto.wait_intra_core
-
-Standalone reference page for `pto.wait_intra_core`. This page belongs to the [Pipeline Sync](../../pipeline-sync.md) family in the PTO ISA manual.
-
-## Summary
-
-Block a specific pipeline within a cluster (A5) until a subblock signals an event. Only the named pipeline stalls; other pipelines on the same core continue executing.
-
-## Mechanism
-
-`pto.wait_intra_core` blocks a specific pipeline on the calling core until the named event is signaled by a remote subblock.
-
-**Per-pipeline blocking (A5):**
-
-- Unlike A2A3's `wait_flag_dev` which stalls the **entire SU** (all pipelines), `wait_intra_core` only stalls the **named pipeline**.
-- Other pipelines on the same core continue executing while one pipeline is blocked.
-- The semaphore pool uses 16 physical IDs with a 32-ID address space: IDs 0–15 target AIV0; IDs 16–31 target AIV1.
-
-**Key advantage over A2A3**: A5's `wait_intra_core` enables finer-grained parallelism where multiple pipelines can be in different synchronization states simultaneously.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-wait_intra_core %pipe, %sem_id : i64, i64
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-pto.wait_intra_core %pipe, %sem_id : i64, i64
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void WAIT_INTRA_CORE(pipe_t wait_pipe, int64_t sem_id);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%pipe` | `pipe_t` | The pipeline that should wait (only this pipeline stalls) |
-| `%sem_id` | `i64` | Semaphore ID: 0–15 for AIV0, base+15 for AIV1 |
-
-## Expected Outputs
-
-None. This form is defined by its side effect (blocking) on the named pipeline.
-
-## Side Effects
-
-- Blocks only the named pipeline. Other pipelines on the same core continue.
-- Decrements the semaphore counter when the event is signaled.
-
-## Constraints
-
-- **A5 only**: `wait_intra_core` is only available on the A5 profile.
-- **Per-pipeline blocking**: Only the named pipeline is blocked. All other pipelines continue. This differs fundamentally from A2A3's SU-level blocking.
-- **Semaphore ID mapping**: IDs 0–15 target AIV0; IDs 16–31 target AIV1.
-- **Event must be set**: Waiting on an event that was never set is **illegal**.
-- **Semaphore pool**: 16 physical IDs, 32-ID address space. IDs outside 0–31 are **illegal**.
-
-## Exceptions
-
-- Illegal on non-A5 profiles.
-- Illegal if `%sem_id` is outside the range 0–31.
-- Illegal if the event was never set by a remote subblock.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| `wait_intra_core` | Not available | Use `wait_flag_dev` | Supported |
-| Blocking scope | Not applicable | Entire SU blocked | Only named pipe blocked |
-| Other pipes during wait | Not applicable | All stalled | Continue executing |
-| Semaphore pool | Not applicable | 16 IDs | 16 IDs, 32-ID address space |
-
-## Examples
-
-### A5: per-pipeline blocking vs. A2A3 SU-level blocking
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// A2A3 (wait_flag_dev): entire core stalls
-// WAIT_FLAG_DEV(/* event_id */ 0);  // ALL pipelines blocked
-
-// A5 (wait_intra_core): only PIPE_V stalls
-WAIT_INTRA_CORE(PIPE_V, /* sem_id */ 0);  // Only Vector pipe stalls
-// PIPE_MTE2 and PIPE_MTE3 continue executing
-```
-
-### SSA form — A5 C→V signaling with per-pipeline waits
-
-```mlir
-// === AIC signals both AIV subblocks (no broadcast — separate calls) ===
-pto.set_intra_block "PIPE_MTE2", %c0_i64 : i64, i64   // → AIV0
-pto.set_intra_block "PIPE_MTE2", %c16_i64 : i64, i64  // → AIV1
-
-// === AIV0: Vector pipe waits (only Vector stalls; MTE2/MTE3 continue) ===
-pto.wait_intra_core "PIPE_V", %c0_i64 : i64, i64
-// [AIV0 Vector processes data while AIV0 MTE2/MTE3 continue]
-
-// === AIV1: Vector pipe waits ===
-pto.wait_intra_core "PIPE_V", %c16_i64 : i64, i64
-
-// === AIV0 signals back to AIC ===
-pto.set_intra_block "PIPE_V", %c1_i64 : i64, i64
-
-// === AIV1 signals back to AIC ===
-pto.set_intra_block "PIPE_V", %c1_i64 : i64, i64
-
-// === AIC: waits for both AIV subblocks (separate waits) ===
-pto.wait_intra_core "PIPE_MTE2", %c1_i64 : i64, i64    // wait for AIV0
-pto.wait_intra_core "PIPE_MTE2", %c17_i64 : i64, i64   // wait for AIV1
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Pipeline Sync](../../pipeline-sync.md)
-- Previous op in family: [pto.set_intra_block](./set-intra-block.md)
-- Next op in family: (none — last in family)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-intra-core_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-intra-core_zh.md
deleted file mode 100644
index 481396c8..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/pipeline-sync/wait-intra-core_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.wait_intra_core
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](wait-intra-core.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pand.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pand.md
deleted file mode 100644
index c67ed119..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pand.md
+++ /dev/null
@@ -1,119 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/pand.md` -->
-
-# pto.pand
-
-Standalone reference page for `pto.pand`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Bitwise AND of two predicates.
-
-## Mechanism
-
-`pto.pand` computes the bitwise AND of two predicate registers, producing a new predicate where lane `i` is active iff both source lanes `i` are active.
-
-$$ \mathrm{dst}_i = \mathrm{src0}_i \land \mathrm{src1}_i $$
-
-The third operand (`%mask`) in the syntax is an optional masking predicate for the scalar/control surface; the core boolean operation is `src0 AND src1`.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pand %dst, %src0, %src1 : !pto.mask, !pto.mask, !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%dst = pto.pand %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pand ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PAND(RegBuf<predicate_t>& dst,
-                    const RegBuf<predicate_t>& src0,
-                    const RegBuf<predicate_t>& src1,
-                    const RegBuf<predicate_t>& mask);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%src0` | `!pto.mask` | First source predicate |
-| `%src1` | `!pto.mask` | Second source predicate |
-| `%mask` | `!pto.mask` | Optional masking predicate (scalar/control surface context) |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%dst` | `!pto.mask` | Bitwise AND of src0 and src1 |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Operand widths**: All predicate operands MUST have the same width. Mixing predicates of different widths without explicit pack/unpack is **illegal**.
-- **No implicit masking**: The `mask` operand is for scalar/control surface use; it does not affect the boolean AND operation itself.
-
-## Exceptions
-
-- Illegal if predicate operand widths are not consistent.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Bitwise AND | Simulated | Supported | Supported |
-
-## Examples
-
-### Combine comparison mask with tail mask
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void combine_masks(RegBuf<predicate_t>& dst,
-                   const RegBuf<predicate_t>& cmp_mask,
-                   const RegBuf<predicate_t>& tail_mask) {
-    PAND(dst, cmp_mask, tail_mask, cmp_mask);
-}
-```
-
-### SSA form — intersection of two predicates
-
-```mlir
-// %cmp_mask: lanes where a[i] < b[i]
-%cmp = pto.vcmp %va, %vb, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-
-// %tail_mask: lanes in the remainder region
-%tail = pto.pge_b32 %rem : i32 -> !pto.mask
-
-// Intersection: only process remainder lanes where comparison is true
-%active = pto.pand %cmp, %tail, %cmp : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-
-// Use in predicated operation
-%result = pto.vsel %v_true, %v_false, %active : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.punpack](./punpack.md)
-- Next op in family: [pto.por](./por.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pand_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pand_zh.md
deleted file mode 100644
index 95885c40..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pand_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pand
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pand.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8.md
deleted file mode 100644
index be61d1e3..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8.md
+++ /dev/null
@@ -1,120 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8.md` -->
-
-# pto.pdintlv_b8
-
-Standalone reference page for `pto.pdintlv_b8`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Predicate deinterleave: split one 16-bit predicate register into two 8-bit predicate registers by separating alternating bits.
-
-## Mechanism
-
-`pto.pdintlv_b8` deinterleaves a 16-bit predicate register into two 8-bit predicates by distributing alternating bits. Lane `i` from the lower half goes to the first output; lane `i` from the upper half goes to the second output.
-
-For a 16-bit predicate `src` and 0 ≤ i < 8:
-
-$$ \mathrm{dst0}_i = \mathrm{src}_i $$
-$$ \mathrm{dst1}_i = \mathrm{src}_{i+8} $$
-
-This operation is used when processing 16-bit-wide data with two independent 8-bit predicate contexts, or when separating even/odd lane groups for multi-step processing.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pdintlv_b8 %dst0, %dst1, %src : !pto.mask, !pto.mask, !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%dst0, %dst1 = pto.pdintlv_b8 %src : !pto.mask -> !pto.mask, !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pdintlv_b8 ins(%src : !pto.mask) outs(%dst0, %dst1 : !pto.mask, !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PDINTLV_B8(RegBuf<predicate_t>& dst0,
-                         RegBuf<predicate_t>& dst1,
-                         const RegBuf<predicate_t>& src);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%src` | `!pto.mask` | 16-bit source predicate register |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%dst0` | `!pto.mask` | Lower 8 bits: `src[0..7]` |
-| `%dst1` | `!pto.mask` | Upper 8 bits: `src[8..15]` |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Source width**: The source predicate MUST be 16 bits. Sources of other widths are **illegal**.
-- **Destination width**: Both destination predicates are 8 bits.
-- **Relationship**: `pdintlv_b8` is the inverse of `pintlv_b16`.
-
-## Exceptions
-
-- Illegal if the source predicate width is not 16 bits.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Predicate deinterleave (b8) | Simulated | Supported | Supported |
-
-## Examples
-
-### Separate two 8-bit predicate contexts
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void split_predicate(RegBuf<predicate_t>& dst0,
-                     RegBuf<predicate_t>& dst1,
-                     const RegBuf<predicate_t>& src16) {
-    PDINTLV_B8(dst0, dst1, src16);
-}
-```
-
-### SSA form — two-phase predicate processing
-
-```mlir
-// %src16: 16-bit predicate from some comparison
-
-// Phase 1: process lower 8 lanes
-%lo, %hi = pto.pdintlv_b8 %src16 : !pto.mask -> !pto.mask, !pto.mask
-
-// Use %lo for first phase of vector computation
-%result_lo = pto.vsel %v_a_lo, %v_b_lo, %lo : !pto.vreg<8xf32>, !pto.vreg<8xf32>, !pto.mask -> !pto.vreg<8xf32>
-
-// Use %hi for second phase of vector computation
-%result_hi = pto.vsel %v_a_hi, %v_b_hi, %hi : !pto.vreg<8xf32>, !pto.vreg<8xf32>, !pto.mask -> !pto.vreg<8xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.psel](./psel.md)
-- Next op in family: [pto.pintlv_b16](./pintlv-b16.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8_zh.md
deleted file mode 100644
index 4ebc44ca..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pdintlv_b8
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pdintlv-b8.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16.md
deleted file mode 100644
index 6271b12d..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16.md
+++ /dev/null
@@ -1,112 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16.md` -->
-
-# pto.pge_b16
-
-Standalone reference page for `pto.pge_b16`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Generate a dynamic 16-bit predicate: lanes where the lane index is greater-than-or-equal to a scalar threshold.
-
-## Mechanism
-
-`pto.pge_b16` compares the lane index against a runtime scalar value and produces a predicate where active lanes satisfy `i ≥ scalar`.
-
-For lane index `i` (0 ≤ i < 16) and scalar threshold `s`:
-
-$$ \mathrm{mask}_i = \begin{cases} 1 & \text{if } i \geq s \\ 0 & \text{if } i < s \end{cases} $$
-
-This operation is the scalar complement of `plt_b16`. It is used for tail-mask generation when the vector width is 16 (f16/bf16 with predicate packing context).
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pge_b16 %dst, %scalar : !pto.mask, i16
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask = pto.pge_b16 %scalar : i16 -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pge_b16 ins(%scalar : i16) outs(%mask : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PGE_B16(RegBuf<predicate_t>& dst, int16_t scalar);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%scalar` | `i16` | Lane-index threshold; lanes i ≥ scalar are active |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | 16-bit predicate with active lanes above threshold |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Scalar range**: `scalar` MUST be in the range `[0, 16]`. Values outside this range produce all-1 (if scalar ≤ 0) or all-0 (if scalar ≥ 16) predicates.
-- **Predicate width**: The produced predicate is 16 bits wide. Programs that need wider predicates MUST use `ppack` to combine multiple `_b16` results.
-- **No side effect on scalar**: Unlike `plt_b16`, this operation does NOT modify the scalar operand.
-
-## Exceptions
-
-- Illegal if `scalar` is outside the range `[0, 16]` for the target profile.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Dynamic predicate generation | Simulated | Supported | Supported |
-| 16-bit predicate width | Supported | Supported | Supported |
-| Scalar range enforcement | Enforced | Enforced | Enforced |
-
-## Examples
-
-### Tail mask for remainder loop (f16/bf16)
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void generate_tail_mask(RegBuf<predicate_t>& dst, int16_t remainder) {
-    // remainder lanes active (i >= (16 - remainder))
-    PGE_B16(dst, 16 - remainder);
-}
-```
-
-### SSA form
-
-```mlir
-// %rem holds the remainder count
-%tail = pto.pge_b16 %rem : i16 -> !pto.mask
-
-// Use in predicated vector operation on f16 (128 lanes = 8 × 16-bit predicates)
-%result = pto.vsel %v_true, %v_false, %tail : !pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask -> !pto.vreg<128xf16>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.pge_b8](./pge-b8.md)
-- Next op in family: [pto.pge_b32](./pge-b32.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16_zh.md
deleted file mode 100644
index 5828e84d..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pge_b16
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pge-b16.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32.md
deleted file mode 100644
index 0c12f3d2..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32.md
+++ /dev/null
@@ -1,121 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32.md` -->
-
-# pto.pge_b32
-
-Standalone reference page for `pto.pge_b32`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Generate a dynamic 32-bit predicate: lanes where the lane index is greater-than-or-equal to a scalar threshold.
-
-## Mechanism
-
-`pto.pge_b32` compares the lane index against a runtime scalar value and produces a predicate where active lanes satisfy `i ≥ scalar`.
-
-For lane index `i` (0 ≤ i < 32) and scalar threshold `s`:
-
-$$ \mathrm{mask}_i = \begin{cases} 1 & \text{if } i \geq s \\ 0 & \text{if } i < s \end{cases} $$
-
-The `_b32` variant is the widest directly-generable predicate segment. For f32 (N=64), two `_b32` predicates can be combined with `ppack` to form a full-width mask.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pge_b32 %dst, %scalar : !pto.mask, i32
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask = pto.pge_b32 %scalar : i32 -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pge_b32 ins(%scalar : i32) outs(%mask : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PGE_B32(RegBuf<predicate_t>& dst, int32_t scalar);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%scalar` | `i32` | Lane-index threshold; lanes i ≥ scalar are active |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | 32-bit predicate with active lanes above threshold |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Scalar range**: `scalar` MUST be in the range `[0, 32]`. Values outside this range produce all-1 (if scalar ≤ 0) or all-0 (if scalar ≥ 32) predicates.
-- **Predicate width**: The produced predicate is 32 bits wide. For wider predicates, use `ppack` to combine multiple `_b32` results.
-- **No side effect on scalar**: Unlike `plt_b32`, this operation does NOT modify the scalar operand.
-
-## Exceptions
-
-- Illegal if `scalar` is outside the range `[0, 32]` for the target profile.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Dynamic predicate generation | Simulated | Supported | Supported |
-| 32-bit predicate width | Supported | Supported | Supported |
-| Scalar range enforcement | Enforced | Enforced | Enforced |
-
-## Examples
-
-### Tail mask for remainder loop (f32, 64 lanes)
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void generate_tail_mask_hi(RegBuf<predicate_t>& dst, int32_t remainder) {
-    // upper half: lanes 32..63 that are active
-    // remainder is already subtracted from the lower half
-    PGE_B32(dst, 32 - remainder);
-}
-```
-
-### SSA form
-
-```mlir
-// %rem holds the remainder count (0..63)
-// Generate lower-half tail: lanes 0..31
-%lo = pto.pge_b32 %rem : i32 -> !pto.mask
-
-// Generate upper-half tail: lanes 32..63
-%hi_rem = arith.subi %rem, 32 : i32
-%hi = pto.pge_b32 %hi_rem : i32 -> !pto.mask
-
-// Combine for full 64-lane predicate
-%tail = pto.ppack %lo, %hi : !pto.mask, !pto.mask -> !pto.mask
-
-// Use in predicated vector operation
-%result = pto.vsel %v_true, %v_false, %tail : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.pge_b16](./pge-b16.md)
-- Next op in family: [pto.plt_b8](./plt-b8.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32_zh.md
deleted file mode 100644
index 15731ee4..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pge_b32
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pge-b32.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8.md
deleted file mode 100644
index d13ae9ec..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8.md
+++ /dev/null
@@ -1,124 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8.md` -->
-
-# pto.pge_b8
-
-Standalone reference page for `pto.pge_b8`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Generate a dynamic 8-bit predicate: lanes where the lane index is greater-than-or-equal to a scalar threshold.
-
-## Mechanism
-
-`pto.pge_b8` compares the lane index against a runtime scalar value and produces a predicate where active lanes satisfy `i ≥ scalar`. This is the scalar complement of `plt_b8`, commonly used for tail-mask generation in remainder loops.
-
-For lane index `i` (0 ≤ i < 8) and scalar threshold `s`:
-
-$$ \mathrm{mask}_i = \begin{cases} 1 & \text{if } i \geq s \\ 0 & \text{if } i < s \end{cases} $$
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pge_b8 %dst, %scalar : !pto.mask, i8
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask = pto.pge_b8 %scalar : i8 -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pge_b8 ins(%scalar : i8) outs(%mask : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PGE_B8(RegBuf<predicate_t>& dst, int8_t scalar);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%scalar` | `i8` | Lane-index threshold; lanes i ≥ scalar are active |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | 8-bit predicate with active lanes above threshold |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Scalar range**: `scalar` MUST be in the range `[0, 8]`. Values outside this range produce all-1 (if scalar ≤ 0) or all-0 (if scalar ≥ 8) predicates.具体的实现行为取决于目标 Profile。
-- **Predicate width**: The produced predicate is 8 bits wide. Programs that need wider predicates MUST use `ppack` to combine multiple `_b8` results.
-- **No side effect on scalar**: Unlike `plt_b8`, this operation does NOT modify the scalar operand.
-
-## Exceptions
-
-- Illegal if `scalar` is outside the representable range for the target profile (typically `[0, 8]`).
-- Illegal if the operation is used in a context requiring a predicate width other than 8 bits.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Dynamic predicate generation | Simulated | Supported | Supported |
-| 8-bit predicate width | Supported | Supported | Supported |
-| Scalar range enforcement | Enforced | Enforced | Enforced |
-
-## Examples
-
-### Tail mask for remainder loop
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void generate_tail_mask(RegBuf<predicate_t>& dst, int8_t remainder) {
-    // remainder lanes active (i >= (8 - remainder))
-    PGE_B8(dst, 8 - remainder);
-}
-```
-
-### SSA form
-
-```mlir
-// %c0 holds the remainder count
-%tail = pto.pge_b8 %c0 : i8 -> !pto.mask
-
-// Use in predicated vector operation
-%result = pto.vsel %v_true, %v_false, %tail : !pto.vreg<8xf32>, !pto.vreg<8xf32>, !pto.mask -> !pto.vreg<8xf32>
-```
-
-### Comparison with plt_b8
-
-```mlir
-// pge_b8: lane i is active iff i >= scalar
-//   input: %rem = 3
-//   result: [0,0,0,0,0,1,1,1] (lanes 5,6,7 active)
-
-// plt_b8: lane i is active iff i < scalar; also decrements scalar
-//   input: %rem = 3
-//   result: [1,1,1,0,0,0,0,0] (lanes 0,1,2 active)
-//   output: %scalar_out = 0
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.pset_b32](./pset-b32.md)
-- Next op in family: [pto.pge_b16](./pge-b16.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8_zh.md
deleted file mode 100644
index 08596df1..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pge_b8
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pge-b8.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16.md
deleted file mode 100644
index 4e15492c..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16.md
+++ /dev/null
@@ -1,134 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16.md` -->
-
-# pto.pintlv_b16
-
-Standalone reference page for `pto.pintlv_b16`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Predicate interleave: merge two 16-bit predicate registers into one 32-bit predicate register by alternating bits.
-
-## Mechanism
-
-`pto.pintlv_b16` interleaves two 16-bit predicate registers into one 32-bit predicate by alternating bits from each source. This is the inverse of `pdintlv_b16` (which splits a 32-bit predicate into two 16-bit halves), but `pintlv_b16` specifically operates on 16-bit inputs.
-
-For two 16-bit predicates `src0`, `src1` and 0 ≤ i < 16:
-
-$$ \mathrm{dst}_i = \mathrm{src0}_i $$
-$$ \mathrm{dst}_{i+16} = \mathrm{src1}_i $$
-
-This operation concatenates two 16-bit predicates into a 32-bit predicate register, preserving the lane-to-bit correspondence.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pintlv_b16 %dst, %src0, %src1 : !pto.mask, !pto.mask, !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%dst = pto.pintlv_b16 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pintlv_b16 ins(%src0, %src1 : !pto.mask, !pto.mask) outs(%dst : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PINTLV_B16(RegBuf<predicate_t>& dst,
-                          const RegBuf<predicate_t>& src0,
-                          const RegBuf<predicate_t>& src1);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%src0` | `!pto.mask` | Lower 16-bit source predicate |
-| `%src1` | `!pto.mask` | Upper 16-bit source predicate |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%dst` | `!pto.mask` | 32-bit concatenated predicate |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Source width**: Both source predicates MUST be 16 bits. Other widths are **illegal**.
-- **Destination width**: The destination predicate is 32 bits.
-- **Bit mapping**: `dst[0..15] = src0[0..15]`, `dst[16..31] = src1[0..15]`.
-
-## Exceptions
-
-- Illegal if source predicate widths are not 16 bits.
-- Illegal if destination context does not expect a 32-bit predicate.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Predicate interleave (b16) | Simulated | Supported | Supported |
-
-## Examples
-
-### Concatenate two 16-bit predicates
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void concat_predicates(RegBuf<predicate_t>& dst,
-                      const RegBuf<predicate_t>& lo,
-                      const RegBuf<predicate_t>& hi) {
-    PINTLV_B16(dst, lo, hi);
-}
-```
-
-### SSA form — combine two halves for full 32-bit predicate
-
-```mlir
-// %cmp_lo: comparison result for lanes 0-15
-// %cmp_hi: comparison result for lanes 16-31
-
-// Combine into full 32-bit predicate
-%full = pto.pintlv_b16 %cmp_lo, %cmp_hi : !pto.mask, !pto.mask -> !pto.mask
-
-// Use for 32-lane predicated vector operation
-%result = pto.vsel %v_true, %v_false, %full : !pto.vreg<32xf32>, !pto.vreg<32xf32>, !pto.mask -> !pto.vreg<32xf32>
-```
-
-### Round-trip deinterleave then interleave
-
-```mlir
-// %src32: 32-bit predicate
-
-// Split into two 16-bit halves
-%lo16, %hi16 = pto.pdintlv_b32 %src32 : !pto.mask -> !pto.mask, !pto.mask
-
-// Modify %lo16 or %hi16 independently
-%lo16_mod = pto.pnot %lo16, %lo16 : !pto.mask, !pto.mask -> !pto.mask
-
-// Re-concatenate
-%dst = pto.pintlv_b16 %lo16_mod, %hi16 : !pto.mask, !pto.mask -> !pto.mask
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.pdintlv_b8](./pdintlv-b8.md)
-- Next op in family: (none — last in family)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16_zh.md
deleted file mode 100644
index bbaaf551..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pintlv_b16
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pintlv-b16.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16.md
deleted file mode 100644
index 1ed2a983..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16.md
+++ /dev/null
@@ -1,118 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16.md` -->
-
-# pto.plt_b16
-
-Standalone reference page for `pto.plt_b16`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Generate a dynamic 16-bit predicate with lane index less-than comparison, and atomically decrement the scalar operand.
-
-## Mechanism
-
-`pto.plt_b16` compares the lane index against a runtime scalar value and produces a predicate where active lanes satisfy `i < scalar`, then decrements the scalar by 16.
-
-For lane index `i` (0 ≤ i < 16) and scalar threshold `s`:
-
-$$ \mathrm{mask}_i = \begin{cases} 1 & \text{if } i < s \\ 0 & \text{if } i \geq s \end{cases} $$
-$$ s_{\mathrm{out}} = s - 16 $$
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-plt_b16 %dst, %scalar_in : !pto.mask, i16 -> !pto.mask, i16
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask, %scalar_out = pto.plt_b16 %scalar_in : i16 -> !pto.mask, i16
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.plt_b16 ins(%scalar_in : i16) outs(%mask, %scalar_out : !pto.mask, i16)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PLT_B16(RegBuf<predicate_t>& dst,
-                      int16_t& scalar_inout);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%scalar_in` | `i16` | Lane-index threshold; lanes i < scalar_in are active |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | 16-bit predicate with active lanes below threshold |
-| `%scalar_out` | `i16` | Decremented scalar: `scalar_in - 16` |
-
-## Side Effects
-
-- The scalar operand is **modified in place**: `scalar_out = scalar_in - 16`.
-
-## Constraints
-
-- **Scalar range**: `scalar_in` MUST be in the range `[0, 16]`. After subtraction, `scalar_out` may be negative.
-- **Chain requirement**: Programs MUST use `scalar_out` from one iteration as `scalar_in` of the next. Breaking the chain produces **implementation-defined** predicates.
-- **Predicate width**: The produced predicate is 16 bits wide. For wider predicates, use `ppack`.
-
-## Exceptions
-
-- Illegal if `scalar_in` is outside the range `[0, 16]` for the target profile.
-- Illegal if the scalar chain is broken.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Dynamic predicate generation | Simulated | Supported | Supported |
-| Scalar decrement | Simulated | Supported | Supported |
-| 16-bit predicate width | Supported | Supported | Supported |
-
-## Examples
-
-### Software-pipelined remainder loop (f16/bf16)
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void process_remainder(int16_t& rem, RegBuf<predicate_t>& mask) {
-    // rem = remainder count
-    // predicate: lanes 0..(rem-1) active
-    // rem = rem - 16
-    PLT_B16(mask, rem);
-}
-```
-
-### Chained remainder loop
-
-```mlir
-// Iteration 1: rem = 28
-//   mask: 16 lanes active, rem_out = 12
-%mask1, %rem1 = pto.plt_b16 %rem0 : i16 -> !pto.mask, i16
-
-// Iteration 2: rem = 12
-//   mask: 12 lanes active, rem_out = -4
-%mask2, %rem2 = pto.plt_b16 %rem1 : i16 -> !pto.mask, i16
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.plt_b8](./plt-b8.md)
-- Next op in family: [pto.plt_b32](./plt-b32.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16_zh.md
deleted file mode 100644
index 942e7580..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.plt_b16
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](plt-b16.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32.md
deleted file mode 100644
index 2a148443..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32.md
+++ /dev/null
@@ -1,125 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32.md` -->
-
-# pto.plt_b32
-
-Standalone reference page for `pto.plt_b32`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Generate a dynamic 32-bit predicate with lane index less-than comparison, and atomically decrement the scalar operand.
-
-## Mechanism
-
-`pto.plt_b32` compares the lane index against a runtime scalar value and produces a predicate where active lanes satisfy `i < scalar`, then decrements the scalar by 32.
-
-For lane index `i` (0 ≤ i < 32) and scalar threshold `s`:
-
-$$ \mathrm{mask}_i = \begin{cases} 1 & \text{if } i < s \\ 0 & \text{if } i \geq s \end{cases} $$
-$$ s_{\mathrm{out}} = s - 32 $$
-
-The `_b32` variant is the widest directly-generable predicate segment. For f32 (N=64), two `_b32` predicates from `plt_b32` can be combined with `ppack` to form a full-width mask.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-plt_b32 %dst, %scalar_in : !pto.mask, i32 -> !pto.mask, i32
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask, %scalar_out = pto.plt_b32 %scalar_in : i32 -> !pto.mask, i32
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.plt_b32 ins(%scalar_in : i32) outs(%mask, %scalar_out : !pto.mask, i32)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PLT_B32(RegBuf<predicate_t>& dst,
-                      int32_t& scalar_inout);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%scalar_in` | `i32` | Lane-index threshold; lanes i < scalar_in are active |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | 32-bit predicate with active lanes below threshold |
-| `%scalar_out` | `i32` | Decremented scalar: `scalar_in - 32` |
-
-## Side Effects
-
-- The scalar operand is **modified in place**: `scalar_out = scalar_in - 32`.
-
-## Constraints
-
-- **Scalar range**: `scalar_in` MUST be in the range `[0, 32]`. After subtraction, `scalar_out` may be negative.
-- **Chain requirement**: Programs MUST use `scalar_out` from one iteration as `scalar_in` of the next. Breaking the chain produces **implementation-defined** predicates.
-- **Predicate width**: The produced predicate is 32 bits wide. For f32 (N=64), two `_b32` results can be combined with `ppack`.
-
-## Exceptions
-
-- Illegal if `scalar_in` is outside the range `[0, 32]` for the target profile.
-- Illegal if the scalar chain is broken.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Dynamic predicate generation | Simulated | Supported | Supported |
-| Scalar decrement | Simulated | Supported | Supported |
-| 32-bit predicate width | Supported | Supported | Supported |
-
-## Examples
-
-### Software-pipelined remainder loop (f32)
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void process_remainder(int32_t& rem, RegBuf<predicate_t>& mask) {
-    // rem = remainder count
-    // predicate: lanes 0..(rem-1) active
-    // rem = rem - 32
-    PLT_B32(mask, rem);
-}
-```
-
-### Chained remainder loop with pack for f32
-
-```mlir
-// rem = 47: two iterations needed for 64 lanes
-
-// Iteration 1: rem = 47
-//   lo_mask: 32 lanes active, rem_out = 15
-%lo, %rem1 = pto.plt_b32 %rem0 : i32 -> !pto.mask, i32
-
-// Iteration 2: rem = 15
-//   hi_mask: 15 lanes active, rem_out = -17
-%hi, %rem2 = pto.plt_b32 %rem1 : i32 -> !pto.mask, i32
-
-// Combine two b32 predicates into one 64-bit predicate
-%full_tail = pto.ppack %lo, %hi : !pto.mask, !pto.mask -> !pto.mask
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.plt_b16](./plt-b16.md)
-- Next op in family: [pto.ppack](./ppack.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32_zh.md
deleted file mode 100644
index a4010924..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.plt_b32
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](plt-b32.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8.md
deleted file mode 100644
index 426162b4..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8.md
+++ /dev/null
@@ -1,135 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8.md` -->
-
-# pto.plt_b8
-
-Standalone reference page for `pto.plt_b8`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Generate a dynamic 8-bit predicate with lane index less-than comparison, and atomically decrement the scalar operand.
-
-## Mechanism
-
-`pto.plt_b8` compares the lane index against a runtime scalar value and produces a predicate where active lanes satisfy `i < scalar`. Unlike `pge_b8`, this operation **also** decrements the scalar operand by the predicate width, enabling chained remainder-loop generation.
-
-For lane index `i` (0 ≤ i < 8) and scalar threshold `s`:
-
-$$ \mathrm{mask}_i = \begin{cases} 1 & \text{if } i < s \\ 0 & \text{if } i \geq s \end{cases} $$
-$$ s_{\mathrm{out}} = s - 8 $$
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-plt_b8 %dst, %scalar_in : !pto.mask, i8 -> !pto.mask, i8
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask, %scalar_out = pto.plt_b8 %scalar_in : i8 -> !pto.mask, i8
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.plt_b8 ins(%scalar_in : i8) outs(%mask, %scalar_out : !pto.mask, i8)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PLT_B8(RegBuf<predicate_t>& dst,
-                     int8_t& scalar_inout);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%scalar_in` | `i8` | Lane-index threshold; lanes i < scalar_in are active |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | 8-bit predicate with active lanes below threshold |
-| `%scalar_out` | `i8` | Decremented scalar: `scalar_in - 8` |
-
-## Side Effects
-
-- The scalar operand is **modified in place**: `scalar_out = scalar_in - 8`.
-
-## Constraints
-
-- **Scalar range**: `scalar_in` MUST be in the range `[0, 8]`. After subtraction, `scalar_out` may be negative.
-- **Chain requirement**: Programs MUST use `scalar_out` from one iteration as `scalar_in` of the next. Breaking the chain without re-initializing the scalar produces **implementation-defined** predicates.
-- **Predicate width**: The produced predicate is 8 bits wide. For wider predicates, use `ppack` to combine multiple `_b8` results.
-
-## Exceptions
-
-- Illegal if `scalar_in` is outside the range `[0, 8]` for the target profile.
-- Illegal if the scalar chain is broken (use of uninitialized or stale scalar values).
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Dynamic predicate generation | Simulated | Supported | Supported |
-| Scalar decrement | Simulated | Supported | Supported |
-| 8-bit predicate width | Supported | Supported | Supported |
-
-## Examples
-
-### Software-pipelined remainder loop
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void process_remainder(int8_t& rem, RegBuf<predicate_t>& mask) {
-    // rem = remainder count
-    // predicate: lanes 0..(rem-1) active
-    // rem = rem - 8
-    PLT_B8(mask, rem);
-}
-```
-
-### Chained remainder loop in SSA form
-
-```mlir
-// Iteration 1: rem = 23
-//   mask = [1,1,1,1,1,1,1,1] (8 lanes), rem_out = 15
-%mask1, %rem1 = pto.plt_b8 %rem0 : i8 -> !pto.mask, i8
-
-// Iteration 2: rem = 15
-//   mask = [1,1,1,1,1,1,1,1] (8 lanes), rem_out = 7
-%mask2, %rem2 = pto.plt_b8 %rem1 : i8 -> !pto.mask, i8
-
-// Iteration 3: rem = 7
-//   mask = [1,1,1,1,1,1,1,0] (7 lanes), rem_out = -1
-%mask3, %rem3 = pto.plt_b8 %rem2 : i8 -> !pto.mask, i8
-```
-
-### Compare with pge_b8
-
-```mlir
-// pge_b8: lane i is active iff i >= scalar (tail mask)
-//   input: %rem = 3
-//   result: [0,0,0,0,0,1,1,1] (lanes 5,6,7 active)
-
-// plt_b8: lane i is active iff i < scalar; decrements scalar
-//   input: %rem = 3
-//   result: [1,1,1,0,0,0,0,0] (lanes 0,1,2 active)
-//   output: %scalar_out = -5 (3 - 8)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.pge_b32](./pge-b32.md)
-- Next op in family: [pto.plt_b16](./plt-b16.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8_zh.md
deleted file mode 100644
index b48bec0e..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.plt_b8
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](plt-b8.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot.md
deleted file mode 100644
index 8a826fbd..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot.md
+++ /dev/null
@@ -1,114 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/pnot.md` -->
-
-# pto.pnot
-
-Standalone reference page for `pto.pnot`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Bitwise NOT of a predicate.
-
-## Mechanism
-
-`pto.pnot` computes the bitwise NOT of a predicate register, producing a new predicate where lane `i` is active iff the source lane `i` is inactive.
-
-$$ \mathrm{dst}_i = \neg \mathrm{src}_i $$
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pnot %dst, %src : !pto.mask, !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%dst = pto.pnot %src, %mask : !pto.mask, !pto.mask -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pnot ins(%src, %mask : !pto.mask, !pto.mask) outs(%dst : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PNOT(RegBuf<predicate_t>& dst,
-                   const RegBuf<predicate_t>& src,
-                   const RegBuf<predicate_t>& mask);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%src` | `!pto.mask` | Source predicate to invert |
-| `%mask` | `!pto.mask` | Optional masking predicate |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%dst` | `!pto.mask` | Bitwise NOT of src |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Operand widths**: Both predicates MUST have the same width.
-- **No implicit extension**: `pnot` operates on the full predicate width. For predicates of mixed widths, explicit pack/unpack must be used.
-
-## Exceptions
-
-- Illegal if predicate operand widths are not consistent.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Bitwise NOT | Simulated | Supported | Supported |
-
-## Examples
-
-### Invert a predicate
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void invert_mask(RegBuf<predicate_t>& dst,
-                 const RegBuf<predicate_t>& src) {
-    PNOT(dst, src, src);
-}
-```
-
-### SSA form — complement of comparison result
-
-```mlir
-// %cmp: lanes where a[i] < b[i]
-%cmp = pto.vcmp %va, %vb, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-
-// %tail: lanes in remainder region
-%tail = pto.pge_b32 %rem : i32 -> !pto.mask
-
-// Complement: lanes NOT in remainder region
-%not_tail = pto.pnot %tail, %tail : !pto.mask, !pto.mask -> !pto.mask
-
-// Combine: lanes in remainder region AND NOT in comparison result
-%active = pto.pand %tail, %not_tail, %tail : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.pxor](./pxor.md)
-- Next op in family: [pto.psel](./psel.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot_zh.md
deleted file mode 100644
index f3c4b853..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pnot
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pnot.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/por.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/por.md
deleted file mode 100644
index 9ae5d424..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/por.md
+++ /dev/null
@@ -1,114 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/por.md` -->
-
-# pto.por
-
-Standalone reference page for `pto.por`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Bitwise OR of two predicates.
-
-## Mechanism
-
-`pto.por` computes the bitwise OR of two predicate registers, producing a new predicate where lane `i` is active iff at least one source lane `i` is active.
-
-$$ \mathrm{dst}_i = \mathrm{src0}_i \lor \mathrm{src1}_i $$
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-por %dst, %src0, %src1 : !pto.mask, !pto.mask, !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%dst = pto.por %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.por ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void POR(RegBuf<predicate_t>& dst,
-                  const RegBuf<predicate_t>& src0,
-                  const RegBuf<predicate_t>& src1,
-                  const RegBuf<predicate_t>& mask);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%src0` | `!pto.mask` | First source predicate |
-| `%src1` | `!pto.mask` | Second source predicate |
-| `%mask` | `!pto.mask` | Optional masking predicate |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%dst` | `!pto.mask` | Bitwise OR of src0 and src1 |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Operand widths**: All predicate operands MUST have the same width.
-
-## Exceptions
-
-- Illegal if predicate operand widths are not consistent.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Bitwise OR | Simulated | Supported | Supported |
-
-## Examples
-
-### Combine predicates from two conditions
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void union_masks(RegBuf<predicate_t>& dst,
-                 const RegBuf<predicate_t>& mask_a,
-                 const RegBuf<predicate_t>& mask_b) {
-    POR(dst, mask_a, mask_b, mask_a);
-}
-```
-
-### SSA form — union of two predicates
-
-```mlir
-// %mask_a: lanes where a[i] < threshold_a
-// %mask_b: lanes where b[i] > threshold_b
-
-// Union: lanes satisfying either condition
-%combined = pto.por %mask_a, %mask_b, %mask_a : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-
-// Reconstruct full-width predicate from two halves
-%lo_combined = pto.por %mask_a_lo, %mask_b_lo, %mask_a_lo : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-%hi_combined = pto.por %mask_a_hi, %mask_b_hi, %mask_a_hi : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.pand](./pand.md)
-- Next op in family: [pto.pxor](./pxor.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/por_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/por_zh.md
deleted file mode 100644
index 92f1f7a8..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/por_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.por
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](por.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack.md
deleted file mode 100644
index 4619ec40..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack.md
+++ /dev/null
@@ -1,137 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/ppack.md` -->
-
-# pto.ppack
-
-Standalone reference page for `pto.ppack`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Narrowing pack: concatenate two N-bit predicate segments into one 2N-bit predicate register, selecting one segment by a partition token.
-
-## Mechanism
-
-`pto.ppack` takes a source predicate register and a partition token, and writes a 2N-bit predicate register by filling the selected half with the source bits and zero-filling the other half. It is the inverse of `punpack`.
-
-For source predicate `src` with N bits and partition token `P`:
-
-$$ \mathrm{dst}_{2N} = \begin{cases} \mathrm{ZERO}(N) \Vert \mathrm{src}_N & \text{if } P = \text{LOWER} \\ \mathrm{src}_N \Vert \mathrm{ZERO}(N) & \text{if } P = \text{HIGHER} \end{cases} $$
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-ppack %dst, %src, "PART" : !pto.mask, !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%dst = pto.ppack %src, "PART" : !pto.mask -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.ppack ins(%src, "PART" : !pto.mask) outs(%dst : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PPACK(RegBuf<predicate_t>& dst,
-                   const RegBuf<predicate_t>& src,
-                   const char* partition);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%src` | `!pto.mask` | Source N-bit predicate |
-| `"PART"` | string attribute | Partition token: `"LOWER"` or `"HIGHER"` |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%dst` | `!pto.mask` | 2N-bit predicate with the source in the selected half |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Partition token**: MUST be `"LOWER"` or `"HIGHER"`. Other tokens are **illegal**.
-- **Destination width**: The destination predicate is always 2N bits. Programs MUST ensure the destination context expects a 2N-bit predicate. Attempting to use a 2N-bit result in an N-bit context without explicit extraction via `punpack` is **illegal**.
-- **Source width**: The source predicate MUST be N bits (half the destination width). Mismatched widths are **illegal**.
-- **Zero-fill behavior**: The non-selected half of the destination is always zero-filled, not sign-extended or replicated.
-
-## Exceptions
-
-- Illegal if the partition token is not `"LOWER"` or `"HIGHER"`.
-- Illegal if source and destination predicate widths are not in a 1:2 ratio.
-- Illegal if the operation is used in a context that does not expect a 2N-bit result.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Pack operation | Simulated | Supported | Supported |
-| `LOWER` / `HIGHER` tokens | Supported | Supported | Supported |
-
-## Examples
-
-### Combine two b32 predicates for f32 (64 lanes)
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void pack_for_f32(RegBuf<predicate_t>& dst,
-                  const RegBuf<predicate_t>& lo,
-                  const RegBuf<predicate_t>& hi) {
-    // dst = [ZERO(32) | lo] = hi concatenated with zero
-    PPACK(dst, lo, "LOWER");
-}
-```
-
-### SSA form
-
-```mlir
-// %rem = 47
-// %lo: lanes 0-31 active (from plt_b32 iteration 1)
-// %hi: lanes 0-14 active (from plt_b32 iteration 2, rem = 15)
-
-// Pack %lo into lower half of 64-bit predicate
-%full_lo = pto.ppack %lo, "LOWER" : !pto.mask -> !pto.mask
-
-// Pack %hi into upper half of 64-bit predicate
-%full_hi = pto.ppack %hi, "HIGHER" : !pto.mask -> !pto.mask
-
-// OR them together to get full 64-lane tail mask
-%tail = pto.por %full_lo, %full_hi, %full_lo : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-```
-
-### Construct a full-width mask from two half-width masks
-
-```mlir
-// Pack lower half
-%dst_lower = pto.ppack %src_lower, "LOWER" : !pto.mask -> !pto.mask
-
-// Pack upper half
-%dst_upper = pto.ppack %src_upper, "HIGHER" : !pto.mask -> !pto.mask
-
-// Combine with OR to get full-width predicate
-%combined = pto.por %dst_lower, %dst_upper, %dst_lower : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.plt_b32](./plt-b32.md)
-- Next op in family: [pto.punpack](./punpack.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack_zh.md
deleted file mode 100644
index d618d6af..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.ppack
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](ppack.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/psel.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/psel.md
deleted file mode 100644
index 82f9b41b..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/psel.md
+++ /dev/null
@@ -1,133 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/psel.md` -->
-
-# pto.psel
-
-Standalone reference page for `pto.psel`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Predicate mux: select between two predicate sources based on a third predicate.
-
-## Mechanism
-
-`pto.psel` selects predicate bits from one of two source predicates based on a third predicate. For each lane `i`:
-
-$$ \mathrm{dst}_i = \begin{cases} \mathrm{src0}_i & \text{if } \mathrm{sel}_i = 1 \\ \mathrm{src1}_i & \text{if } \mathrm{sel}_i = 0 \end{cases} $$
-
-This is a predicate-level ternary select, analogous to vector `vsel` but operating on predicate values directly.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-psel %dst, %src0, %src1, %sel : !pto.mask, !pto.mask, !pto.mask, !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%dst = pto.psel %src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.psel ins(%src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PSEL(RegBuf<predicate_t>& dst,
-                   const RegBuf<predicate_t>& src0,
-                   const RegBuf<predicate_t>& src1,
-                   const RegBuf<predicate_t>& sel,
-                   const RegBuf<predicate_t>& mask);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%src0` | `!pto.mask` | Predicate selected when corresponding sel bit is 1 |
-| `%src1` | `!pto.mask` | Predicate selected when corresponding sel bit is 0 |
-| `%sel` | `!pto.mask` | Per-lane selection predicate |
-| `%mask` | `!pto.mask` | Optional masking predicate |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%dst` | `!pto.mask` | Per-lane selection between src0 and src1 |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Operand widths**: All four predicate operands MUST have the same width.
-- **Select semantic**: `sel_i = 1` → select `src0_i`; `sel_i = 0` → select `src1_i`.
-
-## Exceptions
-
-- Illegal if predicate operand widths are not consistent.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Predicate select | Simulated | Supported | Supported |
-
-## Examples
-
-### Predicate mux
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void select_predicate(RegBuf<predicate_t>& dst,
-                     const RegBuf<predicate_t>& src0,
-                     const RegBuf<predicate_t>& src1,
-                     const RegBuf<predicate_t>& sel) {
-    PSEL(dst, src0, src1, sel, sel);
-}
-```
-
-### SSA form — dynamic predicate routing
-
-```mlir
-// %active_a: predicate from comparison A
-// %active_b: predicate from comparison B
-// %condition: runtime condition determining which set to use
-
-// If condition is true, use set A; otherwise use set B
-%active = pto.psel %active_a, %active_b, %condition, %condition
-    : !pto.mask, !pto.mask, !pto.mask, !pto.mask
-    -> !pto.mask
-```
-
-### Equivalent to boolean expression
-
-The `psel` operation is equivalent to the following boolean expression:
-
-```mlir
-// psel %dst, %src0, %src1, %sel
-// = (src0 AND sel) OR (src1 AND NOT sel)
-
-%sel_inv = pto.pnot %sel, %sel : !pto.mask, !pto.mask -> !pto.mask
-%and0 = pto.pand %src0, %sel, %sel : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-%and1 = pto.pand %src1, %sel_inv, %sel : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-%dst = pto.por %and0, %and1, %and0 : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.pnot](./pnot.md)
-- Next op in family: [pto.pdintlv_b8](./pdintlv-b8.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/psel_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/psel_zh.md
deleted file mode 100644
index be0e05cc..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/psel_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.psel
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](psel.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16.md
deleted file mode 100644
index ab7f8c54..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16.md
+++ /dev/null
@@ -1,126 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16.md` -->
-
-# pto.pset_b16
-
-Standalone reference page for `pto.pset_b16`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Construct a 16-bit predicate mask from a compile-time pattern token.
-
-## Mechanism
-
-`pto.pset_b16` sets the predicate register to a static pattern encoded by the pattern token. No runtime data is consumed; the entire result is determined at assembly time.
-
-For a predicate register of width 16 bits:
-
-$$ \mathrm{mask}_i = \begin{cases} 1 & \text{if lane } i \text{ matches pattern} \\ 0 & \text{otherwise} \end{cases} $$
-
-The pattern token fully determines which bits are set.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pset_b16 %dst, "PATTERN" : !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask = pto.pset_b16 "PATTERN" : !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pset_b16 "PATTERN" outs(%mask : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PSET_B16(RegBuf<predicate_t>& dst, const char* pattern);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `"PATTERN"` | string attribute | Compile-time pattern token |
-
-### Supported Pattern Tokens
-
-| Pattern | Predicate Width | Meaning |
-|---------|:--------------:|---------|
-| `PAT_ALL` | 16 | All 16 bits set to 1 |
-| `PAT_ALLF` | 16 | All 16 bits set to 0 |
-| `PAT_VL1` … `PAT_VL16` | 16 | First N bits set to 1 |
-| `PAT_H` | 16 | Bits 8–15 set to 1 (high half), bits 0–7 set to 0 |
-| `PAT_Q` | 16 | Bits 12–15 set to 1 (upper quarter), bits 0–11 set to 0 |
-| `PAT_M3` | 16 | Modular: repeat 1-1-1-0 pattern (lanes 3, 7, 11, 15 active) |
-| `PAT_M4` | 16 | Modular: repeat 1-1-1-1-0-0-0-0 pattern (lanes 0–3, 8–11 active) |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | Constructed 16-bit predicate |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Pattern token validity**: The pattern token MUST be valid for a 16-bit predicate width. Using a `PAT_VL*` token with N > 16 is **illegal**.
-- **Predicate context**: This operation produces a fixed-width predicate. Programs that use it in a wider context MUST use pack/unpack to adapt.
-
-## Exceptions
-
-- Illegal if the pattern token is not valid for the `_b16` (16-bit) variant.
-- Illegal if the pattern token is not supported by the target profile.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| All pattern tokens | Simulated | Supported | Supported |
-| 16-bit predicate width | Supported | Supported | Supported |
-
-## Examples
-
-### Construct all-active mask
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void set_all_active(RegBuf<predicate_t>& dst) {
-    PSET_B16(dst, "PAT_ALL");
-}
-```
-
-### Construct modular pattern
-
-```mlir
-// Modular 3 pattern: lanes 3, 7, 11, 15 active
-%mod3 = pto.pset_b16 "PAT_M3" : !pto.mask
-```
-
-### Construct first-half-active mask
-
-```mlir
-// High half: bits 8–15 active, bits 0–7 inactive
-%high = pto.pset_b16 "PAT_H" : !pto.mask
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.pset_b8](./pset-b8.md)
-- Next op in family: [pto.pset_b32](./pset-b32.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16_zh.md
deleted file mode 100644
index fe98cc6b..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pset_b16
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pset-b16.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32.md
deleted file mode 100644
index 130306a7..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32.md
+++ /dev/null
@@ -1,129 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32.md` -->
-
-# pto.pset_b32
-
-Standalone reference page for `pto.pset_b32`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Construct a 32-bit predicate mask from a compile-time pattern token.
-
-## Mechanism
-
-`pto.pset_b32` sets the predicate register to a static pattern encoded by the pattern token. No runtime data is consumed; the entire result is determined at assembly time.
-
-For a predicate register of width 32 bits:
-
-$$ \mathrm{mask}_i = \begin{cases} 1 & \text{if lane } i \text{ matches pattern} \\ 0 & \text{otherwise} \end{cases} $$
-
-The `_b32` variant is the widest directly-constructable predicate segment. For wider predicates, use `ppack` to combine two `_b32` predicates.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pset_b32 %dst, "PATTERN" : !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask = pto.pset_b32 "PATTERN" : !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pset_b32 "PATTERN" outs(%mask : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PSET_B32(RegBuf<predicate_t>& dst, const char* pattern);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `"PATTERN"` | string attribute | Compile-time pattern token |
-
-### Supported Pattern Tokens
-
-| Pattern | Predicate Width | Meaning |
-|---------|:--------------:|---------|
-| `PAT_ALL` | 32 | All 32 bits set to 1 |
-| `PAT_ALLF` | 32 | All 32 bits set to 0 |
-| `PAT_VL1` … `PAT_VL32` | 32 | First N bits set to 1 |
-| `PAT_H` | 32 | Bits 16–31 set to 1 (high half), bits 0–15 set to 0 |
-| `PAT_Q` | 32 | Bits 24–31 set to 1 (upper quarter), bits 0–23 set to 0 |
-| `PAT_M3` | 32 | Modular 3 pattern |
-| `PAT_M4` | 32 | Modular 4 pattern |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | Constructed 32-bit predicate |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Pattern token validity**: The pattern token MUST be valid for a 32-bit predicate width. Using a `PAT_VL*` token with N > 32 is **illegal**.
-- **Predicate context**: The `_b32` predicate can be combined with another `_b32` using `ppack` to form a 64-bit predicate for f32 vector width (N=64).
-
-## Exceptions
-
-- Illegal if the pattern token is not valid for the `_b32` (32-bit) variant.
-- Illegal if the pattern token is not supported by the target profile.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| All pattern tokens | Simulated | Supported | Supported |
-| 32-bit predicate width | Supported | Supported | Supported |
-
-## Examples
-
-### Construct all-active mask
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void set_all_active(RegBuf<predicate_t>& dst) {
-    PSET_B32(dst, "PAT_ALL");
-}
-```
-
-### Use as active-lane mask for f32 vector operations
-
-```mlir
-// All lanes active for f32 (64-bit predicate = pack two b32)
-%all32 = pto.pset_b32 "PAT_ALL" : !pto.mask
-%all64_lo = pto.pset_b32 "PAT_ALL" : !pto.mask
-%all64_hi = pto.pset_b32 "PAT_ALL" : !pto.mask
-%all64 = pto.ppack %all64_lo, "LOWER" : !pto.mask -> !pto.mask
-```
-
-### Construct remainder mask
-
-```mlir
-// First 12 lanes active (remainder loop)
-%remainder = pto.pset_b32 "PAT_VL12" : !pto.mask
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.pset_b16](./pset-b16.md)
-- Next op in family: [pto.pge_b8](./pge-b8.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32_zh.md
deleted file mode 100644
index b0f78eff..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pset_b32
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pset-b32.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8.md
deleted file mode 100644
index 83205264..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8.md
+++ /dev/null
@@ -1,124 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8.md` -->
-
-# pto.pset_b8
-
-Standalone reference page for `pto.pset_b8`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Construct an 8-bit predicate mask from a compile-time pattern token.
-
-## Mechanism
-
-`pto.pset_b8` sets the predicate register to a static pattern encoded by the pattern token. No runtime data is consumed; the entire result is determined at assembly time.
-
-For a predicate register of width 8 bits:
-
-$$ \mathrm{mask}_i = \begin{cases} 1 & \text{if lane } i \text{ matches pattern} \\ 0 & \text{otherwise} \end{cases} $$
-
-The pattern token fully determines which bits are set. The operation is purely combinational — no pipeline resources are consumed beyond the scalar unit.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pset_b8 %dst, "PATTERN" : !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask = pto.pset_b8 "PATTERN" : !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pset_b8 "PATTERN" outs(%mask : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PSET_B8(RegBuf<predicate_t>& dst, const char* pattern);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `"PATTERN"` | string attribute | Compile-time pattern token |
-
-### Supported Pattern Tokens
-
-| Pattern | Predicate Width | Meaning |
-|---------|:--------------:|---------|
-| `PAT_ALL` | 8 | All 8 bits set to 1 |
-| `PAT_ALLF` | 8 | All 8 bits set to 0 |
-| `PAT_VL1` | 8 | Bit 0 set to 1, bits 1–7 set to 0 |
-| `PAT_VL2` | 8 | Bits 0–1 set to 1, bits 2–7 set to 0 |
-| `PAT_H` | 8 | Bits 4–7 set to 1 (high half), bits 0–3 set to 0 |
-| `PAT_Q` | 8 | Bits 6–7 set to 1 (upper quarter), bits 0–5 set to 0 |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | Constructed 8-bit predicate |
-
-## Side Effects
-
-None. This operation does not modify architectural state other than the destination predicate register.
-
-## Constraints
-
-- **Pattern token validity**: The pattern token MUST be valid for an 8-bit predicate width. Using a `PAT_VL*` token with N > 8 is **illegal**.
-- **Predicate context**: This operation produces a fixed-width predicate. Programs that use it in a wider predicate context MUST ensure width compatibility or use pack/unpack operations to adapt.
-- **No dynamic component**: There are no runtime operands; the result is fully determined by the pattern token.
-
-## Exceptions
-
-- Illegal if the pattern token is not valid for the `_b8` (8-bit) variant.
-- Illegal if the pattern token is not supported by the target profile.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| All pattern tokens | Simulated | Supported | Supported |
-| 8-bit predicate width | Supported | Supported | Supported |
-
-## Examples
-
-### Construct all-active mask
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void set_all_active(RegBuf<predicate_t>& dst) {
-    PSET_B8(dst, "PAT_ALL");
-}
-```
-
-### Construct all-inactive mask
-
-```mlir
-%none = pto.pset_b8 "PAT_ALLF" : !pto.mask
-```
-
-### Construct first-3-lanes-active mask
-
-```mlir
-%first3 = pto.pset_b8 "PAT_VL3" : !pto.mask
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Next op in family: [pto.pset_b16](./pset-b16.md)
-- Previous op in family: (none — first in pattern group)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8_zh.md
deleted file mode 100644
index 3fd8ba40..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pset_b8
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pset-b8.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack.md
deleted file mode 100644
index 30721068..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack.md
+++ /dev/null
@@ -1,124 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/punpack.md` -->
-
-# pto.punpack
-
-Standalone reference page for `pto.punpack`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Widening unpack: extract one N-bit segment from a 2N-bit predicate register, zero-filling the non-selected half of the source.
-
-## Mechanism
-
-`pto.punpack` takes a 2N-bit predicate register and a partition token, and produces an N-bit predicate by selecting one half and zero-filling the other. It is the inverse of `ppack`.
-
-For source predicate `src` with 2N bits and partition token `P`:
-
-$$ \mathrm{dst}_N = \begin{cases} \mathrm{LOWER}(\mathrm{src}_{2N}) & \text{if } P = \text{LOWER} \\ \mathrm{UPPER}(\mathrm{src}_{2N}) & \text{if } P = \text{HIGHER} \end{cases} $$
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-punpack %dst, %src, "PART" : !pto.mask, !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%dst = pto.punpack %src, "PART" : !pto.mask -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.punpack ins(%src, "PART" : !pto.mask) outs(%dst : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PUNPACK(RegBuf<predicate_t>& dst,
-                      const RegBuf<predicate_t>& src,
-                      const char* partition);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%src` | `!pto.mask` | Source 2N-bit predicate |
-| `"PART"` | string attribute | Partition token: `"LOWER"` or `"HIGHER"` |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%dst` | `!pto.mask` | N-bit predicate extracted from the selected half |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Partition token**: MUST be `"LOWER"` or `"HIGHER"`. Other tokens are **illegal**.
-- **Source width**: The source predicate MUST be 2N bits. Programs MUST ensure the source context provides a 2N-bit predicate.
-- **Destination width**: The destination predicate is always N bits. Programs that need a 2N-bit result after extraction MUST use `ppack` to reconstruct it.
-- **Zero-fill behavior**: The non-selected half of the source is ignored (zero-filled); the destination does NOT contain a concatenation or merge of both halves.
-
-## Exceptions
-
-- Illegal if the partition token is not `"LOWER"` or `"HIGHER"`.
-- Illegal if source and destination predicate widths are not in a 2:1 ratio.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Unpack operation | Simulated | Supported | Supported |
-| `LOWER` / `HIGHER` tokens | Supported | Supported | Supported |
-
-## Examples
-
-### Extract upper half of a 64-bit predicate
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void extract_upper(RegBuf<predicate_t>& dst,
-                   const RegBuf<predicate_t>& src_64) {
-    PUNPACK(dst, src_64, "HIGHER");
-}
-```
-
-### Extract and re-pack with modification
-
-```mlir
-// %full_64: 64-bit predicate from a comparison
-
-// Extract lower half
-%lo = pto.punpack %full_64, "LOWER" : !pto.mask -> !pto.mask
-
-// Extract upper half
-%hi = pto.punpack %full_64, "HIGHER" : !pto.mask -> !pto.mask
-
-// Modify lower half (e.g., invert)
-%lo_inv = pto.pnot %lo, %lo : !pto.mask, !pto.mask -> !pto.mask
-
-// Re-pack into 64-bit predicate
-%new_lo = pto.ppack %lo_inv, "LOWER" : !pto.mask -> !pto.mask
-%new_hi = pto.ppack %hi, "HIGHER" : !pto.mask -> !pto.mask
-%new_full = pto.por %new_lo, %new_hi, %new_lo : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.ppack](./ppack.md)
-- Next op in family: [pto.pand](./pand.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack_zh.md
deleted file mode 100644
index f6abca43..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.punpack
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](punpack.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor.md
deleted file mode 100644
index 197c45ca..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor.md
+++ /dev/null
@@ -1,117 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-generation-and-algebra/pxor.md` -->
-
-# pto.pxor
-
-Standalone reference page for `pto.pxor`. This page belongs to the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) family in the PTO ISA manual.
-
-## Summary
-
-Bitwise XOR of two predicates.
-
-## Mechanism
-
-`pto.pxor` computes the bitwise XOR of two predicate registers, producing a new predicate where lane `i` is active iff exactly one of the source lanes `i` is active (but not both).
-
-$$ \mathrm{dst}_i = \mathrm{src0}_i \oplus \mathrm{src1}_i $$
-
-XOR is commonly used to invert one predicate within a mask context: `pxor %p, %inv, %mask` produces `mask XOR inv`, effectively inverting `inv` where `mask` is 1.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pxor %dst, %src0, %src1 : !pto.mask, !pto.mask, !pto.mask
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%dst = pto.pxor %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pxor ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PXOR(RegBuf<predicate_t>& dst,
-                   const RegBuf<predicate_t>& src0,
-                   const RegBuf<predicate_t>& src1,
-                   const RegBuf<predicate_t>& mask);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%src0` | `!pto.mask` | First source predicate |
-| `%src1` | `!pto.mask` | Second source predicate |
-| `%mask` | `!pto.mask` | Optional masking predicate |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%dst` | `!pto.mask` | Bitwise XOR of src0 and src1 |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **Operand widths**: All predicate operands MUST have the same width.
-
-## Exceptions
-
-- Illegal if predicate operand widths are not consistent.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Bitwise XOR | Simulated | Supported | Supported |
-
-## Examples
-
-### Conditional inversion via XOR
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void invert_with_mask(RegBuf<predicate_t>& dst,
-                      const RegBuf<predicate_t>& to_invert,
-                      const RegBuf<predicate_t>& mask) {
-    // dst = mask XOR to_invert (inverts to_invert where mask is 1)
-    PXOR(dst, mask, to_invert, mask);
-}
-```
-
-### SSA form — XOR for predicate difference
-
-```mlir
-// %mask_a: lanes active in set A
-// %mask_b: lanes active in set B
-
-// Symmetric difference: lanes active in exactly one set
-%diff = pto.pxor %mask_a, %mask_b, %mask_a : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-
-// Intersection: lanes active in both sets (via De Morgan: A AND B = NOT(A XOR B))
-%inv = pto.pnot %diff, %diff : !pto.mask, !pto.mask -> !pto.mask
-%intersection = pto.pand %mask_a, %mask_b, %mask_a : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md)
-- Previous op in family: [pto.por](./por.md)
-- Next op in family: [pto.pnot](./pnot.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor_zh.md
deleted file mode 100644
index 289136f4..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pxor
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pxor.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pld.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pld.md
deleted file mode 100644
index 57696284..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pld.md
+++ /dev/null
@@ -1,124 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-load-store/pld.md` -->
-
-# pto.pld
-
-Standalone reference page for `pto.pld`. This page belongs to the [Predicate Load Store](../../predicate-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Load the full predicate register from a UB location with a register-relative address offset.
-
-## Mechanism
-
-`pto.pld` reads a predicate word from a UB address computed as `base + areg * sizeof(predicate)`, then materializes it as `!pto.mask`. The offset is sourced from a scalar register, making the effective address data-dependent.
-
-For predicate width `Pw`, UB base `base`, and offset register `areg`:
-
-$$ \mathrm{addr} = base + areg \times 8 $$
-$$ \mathrm{mask} = \mathrm{READ\_UB}_{64}(\mathrm{addr}) $$
-
-The offset register value is interpreted as a byte displacement in units of 8 bytes (64 bits). The register must contain a value such that the resulting effective address is 64-bit aligned.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pld %mask, %ub_ptr[%areg], "DIST" : !pto.mask, !pto.ptr<i64, ub>, i32
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask = pto.pld %ub_ptr, %areg, "DIST" : !pto.ptr<i64, ub>, i32 -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pld ins(%ub_ptr, %areg, "DIST" : !pto.ptr<i64, ub>, i32) outs(%mask : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PLD(RegBuf<predicate_t>& dst,
-                  const Ptr<ub_space_t, ub_t>& base,
-                  int32_t areg,
-                  const char* dist = "NORM");
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%ub_ptr` | `!pto.ptr<i64, ub>` | UB base address |
-| `%areg` | `i32` | Scalar register holding the byte offset in 8-byte units |
-| `"DIST"` | string attribute | Distribution mode: `"NORM"`, `"US"`, or `"DS"` |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | Loaded predicate register |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **UB address space**: `%ub_ptr` MUST have address space `ub`.
-- **Offset alignment**: The offset register MUST be set such that `base + areg * 8` is 64-bit aligned. Misaligned effective addresses are **illegal**.
-- **Distribution mode**: The `dist` attribute MUST be one of `"NORM"`, `"US"`, or `"DS"`. Other modes are **illegal** for this form.
-- **Predicate width**: The load transfers exactly 64 bits, which MUST match the active element type context.
-- **Single active predicate**: Loading a new predicate does not implicitly save a prior predicate. Programs that need to preserve predicate state MUST save it first.
-
-## Exceptions
-
-- Illegal if `%ub_ptr` is not a UB-space pointer.
-- Illegal if the effective address (base + areg * 8) is not 64-bit aligned.
-- Illegal if `dist` attribute is not a supported distribution mode.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Register-offset predicate load | Simulated | Supported | Supported |
-| `"NORM"` distribution mode | Supported | Supported | Supported |
-| `"US"` / `"DS"` distribution modes | Simulated | Supported | Supported |
-
-## Examples
-
-### Load predicate with register offset
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void load_with_offset(RegBuf<predicate_t>& dst,
-                      Ptr<ub_space_t, ub_t> base,
-                      int32_t slot) {
-    // slot is in units of 8 bytes (one predicate word per slot)
-    PLD(dst, base, slot, "NORM");
-}
-```
-
-### SSA form
-
-```mlir
-// UB base at %ub_base; %c1 holds slot index (in 8-byte units)
-%mask = pto.pld %ub_base, %c1, "NORM" : !pto.ptr<i64, ub>, i32 -> !pto.mask
-
-// Use predicate in predicated vector operation
-%result = pto.vsel %v_a, %v_b, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Load Store](../../predicate-load-store.md)
-- Previous op in family: [pto.plds](./plds.md)
-- Next op in family: [pto.pldi](./pldi.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pld_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pld_zh.md
deleted file mode 100644
index a8446a28..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pld_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pld
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pld.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pldi.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pldi.md
deleted file mode 100644
index e20fbc3f..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pldi.md
+++ /dev/null
@@ -1,125 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-load-store/pldi.md` -->
-
-# pto.pldi
-
-Standalone reference page for `pto.pldi`. This page belongs to the [Predicate Load Store](../../predicate-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Load the full predicate register from a UB location with an immediate (compile-time constant) byte offset.
-
-## Mechanism
-
-`pto.pldi` reads a predicate word from a UB address computed as `base + imm * 8`, then materializes it as `!pto.mask`. The offset is a compile-time immediate, enabling address resolution at assembly time.
-
-For predicate width `Pw`, UB base `base`, and immediate offset `imm`:
-
-$$ \mathrm{addr} = base + imm \times 8 $$
-$$ \mathrm{mask} = \mathrm{READ\_UB}_{64}(\mathrm{addr}) $$
-
-The immediate offset is encoded directly in the instruction word, in units of 8 bytes (64 bits).
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pldi %mask, %ub_ptr[%imm], "DIST" : !pto.mask, !pto.ptr<i64, ub>, i32
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask = pto.pldi %ub_ptr, %imm, "DIST" : !pto.ptr<i64, ub>, i32 -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pldi ins(%ub_ptr, %imm, "DIST" : !pto.ptr<i64, ub>, i32) outs(%mask : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PLDI(RegBuf<predicate_t>& dst,
-                   const Ptr<ub_space_t, ub_t>& base,
-                   int32_t imm,
-                   const char* dist = "NORM");
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%ub_ptr` | `!pto.ptr<i64, ub>` | UB base address |
-| `%imm` | `i32` | Immediate byte offset in 8-byte units (compile-time constant) |
-| `"DIST"` | string attribute | Distribution mode: `"NORM"`, `"US"`, or `"DS"` |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | Loaded predicate register |
-
-## Side Effects
-
-None.
-
-## Constraints
-
-- **UB address space**: `%ub_ptr` MUST have address space `ub`.
-- **Offset alignment**: The effective address MUST be 64-bit aligned. That is, `imm * 8` MUST be a multiple of 8. Misaligned effective addresses are **illegal**.
-- **Immediate range**: The offset immediate MUST fit in the instruction encoding.具体的立即数范围由目标 Profile 定义；超出范围的值为 **illegal**。
-- **Distribution mode**: The `dist` attribute MUST be one of `"NORM"`, `"US"`, or `"DS"`.
-- **Predicate width**: The load transfers exactly 64 bits, which MUST match the active element type context.
-
-## Exceptions
-
-- Illegal if `%ub_ptr` is not a UB-space pointer.
-- Illegal if the effective address is not 64-bit aligned.
-- Illegal if the immediate offset is out of range for the target profile.
-- Illegal if `dist` attribute is not a supported distribution mode.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Immediate-offset predicate load | Simulated | Supported | Supported |
-| `"NORM"` distribution mode | Supported | Supported | Supported |
-| `"US"` / `"DS"` distribution modes | Simulated | Supported | Supported |
-| Immediate offset range | Implementation-defined | 0–255 (8-byte units) | 0–1023 (8-byte units) |
-
-## Examples
-
-### Load predicate with immediate offset
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void load_immediate(RegBuf<predicate_t>& dst,
-                    Ptr<ub_space_t, ub_t> base) {
-    // Load predicate from base + 3 * 8 = base + 24 bytes
-    PLDI(dst, base, 3, "NORM");
-}
-```
-
-### SSA form
-
-```mlir
-// Load predicate from slot 2 (2 * 8 = 16 bytes offset)
-%mask = pto.pldi %ub_base, 2, "NORM" : !pto.ptr<i64, ub>, i32 -> !pto.mask
-
-// Use in predicated vector select
-%result = pto.vsel %v_true, %v_false, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Load Store](../../predicate-load-store.md)
-- Previous op in family: [pto.pld](./pld.md)
-- Next op in family: [pto.psts](./psts.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pldi_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pldi_zh.md
deleted file mode 100644
index fde65bac..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pldi_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pldi
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pldi.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/plds.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/plds.md
deleted file mode 100644
index 2c6b3500..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/plds.md
+++ /dev/null
@@ -1,114 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-load-store/plds.md` -->
-
-# pto.plds
-
-Standalone reference page for `pto.plds`. This page belongs to the [Predicate Load Store](../../predicate-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Load the full predicate register from a contiguous UB location.
-
-## Mechanism
-
-`pto.plds` reads a predicate word from a UB address and materializes it as `!pto.mask`. The operation covers the full predicate width for the active element type (64 bits for f32, 128 bits for f16/bf16, 256 bits for i8/u8).
-
-For predicate width `Pw` and UB address `base`:
-
-$$ \mathrm{mask} = \mathrm{READ\_UB}_{64}(base) $$
-
-The predicate register is updated atomically. All bits are meaningful only within the current element-type context; unused upper bits for narrower types are **implementation-defined**.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-plds %mask, %ub_ptr : !pto.mask, !pto.ptr<i64, ub>
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%mask = pto.plds %ub_ptr : !pto.ptr<i64, ub> -> !pto.mask
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.plds ins(%ub_ptr : !pto.ptr<i64, ub>) outs(%mask : !pto.mask)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PLDS(RegBuf<predicate_t>& dst, const Ptr<ub_space_t, ub_t>& src);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%ub_ptr` | `!pto.ptr<i64, ub>` | UB base address (must be 64-bit aligned) |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%mask` | `!pto.mask` | Loaded predicate register |
-
-## Side Effects
-
-None. Does not implicitly fence or synchronize with any pipeline.
-
-## Constraints
-
-- **UB address space**: `%ub_ptr` MUST have address space `ub`. Pointers to other spaces are illegal.
-- **Alignment**: The effective address MUST be 64-bit aligned. Misaligned addresses are **illegal**.
-- **Predicate width**: The load transfers exactly 64 bits. The caller MUST ensure this matches the active element type context.
-- **Single active predicate**: Loading a new predicate does not implicitly clear or save a prior predicate. Programs that need to preserve predicate state MUST save it to UB before loading.
-
-## Exceptions
-
-- Illegal if `%ub_ptr` is not a UB-space pointer.
-- Illegal if the effective address is not 64-bit aligned.
-- Illegal if predicate width does not match the active element type context.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Contiguous load | Simulated | Supported | Supported |
-| 64-bit alignment requirement | Enforced | Enforced | Enforced |
-| Predicate width (f32 / f16,bf16 / i8) | N=64/128/256 | N=64/128/256 | N=64/128/256 |
-
-## Examples
-
-### Load predicate from UB
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void load_saved_mask(RegBuf<predicate_t>& dst, Ptr<ub_space_t, ub_t> src) {
-    PLDS(dst, src);
-}
-```
-
-### SSA form
-
-```mlir
-// Load predicate from UB slot 0
-%mask = pto.plds %ub_mask_slot0 : !pto.ptr<i64, ub> -> !pto.mask
-
-// Use predicate in vector select
-%result = pto.vsel %v_true, %v_false, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Load Store](../../predicate-load-store.md)
-- Next op in family: [pto.pld](./pld.md)
-- Previous op in family: (none — first in family)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/plds_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/plds_zh.md
deleted file mode 100644
index a4e96e72..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/plds_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.plds
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](plds.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pst.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pst.md
deleted file mode 100644
index d28d8654..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pst.md
+++ /dev/null
@@ -1,124 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-load-store/pst.md` -->
-
-# pto.pst
-
-Standalone reference page for `pto.pst`. This page belongs to the [Predicate Load Store](../../predicate-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Store the full predicate register to a UB location with a register-relative address offset.
-
-## Mechanism
-
-`pto.pst` writes a predicate word from `!pto.mask` to a UB address computed as `base + areg * 8`. The offset is sourced from a scalar register, enabling data-dependent addressing.
-
-For predicate `mask`, UB base `base`, and offset register `areg`:
-
-$$ \mathrm{addr} = base + areg \times 8 $$
-$$ \mathrm{WRITE\_UB}_{64}(\mathrm{addr}, mask) $$
-
-The predicate register is read atomically. Only bits within the current element-type predicate width are transferred.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pst %mask, %ub_ptr[%areg], "DIST" : !pto.mask, !pto.ptr<i64, ub>, i32
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-pto.pst %mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr<i64, ub>, i32
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pst ins(%mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr<i64, ub>, i32)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PST(RegBuf<predicate_t>& src,
-                   const Ptr<ub_space_t, ub_t>& base,
-                   int32_t areg,
-                   const char* dist = "PK");
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%mask` | `!pto.mask` | Predicate register to store |
-| `%ub_ptr` | `!pto.ptr<i64, ub>` | UB base address |
-| `%areg` | `i32` | Scalar register holding the byte offset in 8-byte units |
-| `"DIST"` | string attribute | Distribution mode: `"NORM"` or `"PK"` |
-
-## Expected Outputs
-
-None. This form is defined by its side effect on UB memory.
-
-## Side Effects
-
-- Writes the predicate register value to the UB location.
-- UB memory at the target address is modified.
-
-## Constraints
-
-- **UB address space**: `%ub_ptr` MUST have address space `ub`.
-- **Offset alignment**: The effective address MUST be 64-bit aligned. Misaligned effective addresses are **illegal**.
-- **Distribution mode**: The `dist` attribute MUST be `"NORM"` or `"PK"`. The `"PK"` mode packs two 32-bit predicate segments into one 64-bit word for stores.
-- **Predicate width**: The store transfers exactly 64 bits, which MUST match the active element type context.
-- **Write atomicity**: The 64-bit predicate word is written atomically.
-
-## Exceptions
-
-- Illegal if `%ub_ptr` is not a UB-space pointer.
-- Illegal if the effective address is not 64-bit aligned.
-- Illegal if `dist` attribute is not `"NORM"` or `"PK"`.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Register-offset predicate store | Simulated | Supported | Supported |
-| `"NORM"` distribution mode | Supported | Supported | Supported |
-| `"PK"` (packed) distribution mode | Not supported | Supported | Supported |
-
-## Examples
-
-### Store predicate with register offset
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void store_with_offset(RegBuf<predicate_t>& src,
-                       Ptr<ub_space_t, ub_t> base,
-                       int32_t slot) {
-    // slot is in units of 8 bytes (one predicate word per slot)
-    PST(src, base, slot, "NORM");
-}
-```
-
-### SSA form
-
-```mlir
-// Generate predicate from comparison
-%mask = pto.vcmp %v0, %v1, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-
-// Store predicate to UB at base + slot * 8
-pto.pst %mask, %ub_base, %slot, "NORM" : !pto.mask, !pto.ptr<i64, ub>, i32
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Load Store](../../predicate-load-store.md)
-- Previous op in family: [pto.psts](./psts.md)
-- Next op in family: [pto.psti](./psti.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pst_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pst_zh.md
deleted file mode 100644
index 7a7056eb..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pst_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pst
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pst.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psti.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psti.md
deleted file mode 100644
index 6bfb1203..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psti.md
+++ /dev/null
@@ -1,126 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-load-store/psti.md` -->
-
-# pto.psti
-
-Standalone reference page for `pto.psti`. This page belongs to the [Predicate Load Store](../../predicate-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Store the full predicate register to a UB location with an immediate (compile-time constant) byte offset.
-
-## Mechanism
-
-`pto.psti` writes a predicate word from `!pto.mask` to a UB address computed as `base + imm * 8`. The offset is a compile-time immediate, enabling address resolution at assembly time.
-
-For predicate `mask`, UB base `base`, and immediate offset `imm`:
-
-$$ \mathrm{addr} = base + imm \times 8 $$
-$$ \mathrm{WRITE\_UB}_{64}(\mathrm{addr}, mask) $$
-
-The immediate offset is encoded directly in the instruction word, in units of 8 bytes (64 bits).
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-psti %mask, %ub_ptr[%imm], "DIST" : !pto.mask, !pto.ptr<i64, ub>, i32
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-pto.psti %mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr<i64, ub>, i32
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.psti ins(%mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr<i64, ub>, i32)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PSTI(RegBuf<predicate_t>& src,
-                    const Ptr<ub_space_t, ub_t>& base,
-                    int32_t imm,
-                    const char* dist = "PK");
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%mask` | `!pto.mask` | Predicate register to store |
-| `%ub_ptr` | `!pto.ptr<i64, ub>` | UB base address |
-| `%imm` | `i32` | Immediate byte offset in 8-byte units (compile-time constant) |
-| `"DIST"` | string attribute | Distribution mode: `"NORM"` or `"PK"` |
-
-## Expected Outputs
-
-None. This form is defined by its side effect on UB memory.
-
-## Side Effects
-
-- Writes the predicate register value to the UB location.
-- UB memory at the target address is modified.
-
-## Constraints
-
-- **UB address space**: `%ub_ptr` MUST have address space `ub`.
-- **Offset alignment**: The effective address MUST be 64-bit aligned. That is, `imm * 8` MUST be a multiple of 8. Misaligned effective addresses are **illegal**.
-- **Immediate range**: The offset immediate MUST fit in the instruction encoding.具体的立即数范围由目标 Profile 定义；超出范围的值为 **illegal**。
-- **Distribution mode**: The `dist` attribute MUST be `"NORM"` or `"PK"`.
-- **Predicate width**: The store transfers exactly 64 bits, which MUST match the active element type context.
-- **Write atomicity**: The 64-bit predicate word is written atomically.
-
-## Exceptions
-
-- Illegal if `%ub_ptr` is not a UB-space pointer.
-- Illegal if the effective address is not 64-bit aligned.
-- Illegal if the immediate offset is out of range for the target profile.
-- Illegal if `dist` attribute is not `"NORM"` or `"PK"`.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Immediate-offset predicate store | Simulated | Supported | Supported |
-| `"NORM"` distribution mode | Supported | Supported | Supported |
-| `"PK"` (packed) distribution mode | Not supported | Supported | Supported |
-| Immediate offset range | Implementation-defined | 0–255 (8-byte units) | 0–1023 (8-byte units) |
-
-## Examples
-
-### Store predicate with immediate offset
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void store_immediate(RegBuf<predicate_t>& src,
-                     Ptr<ub_space_t, ub_t> base) {
-    // Store predicate to base + 2 * 8 = base + 16 bytes
-    PSTI(src, base, 2, "NORM");
-}
-```
-
-### SSA form
-
-```mlir
-// Generate predicate from comparison
-%mask = pto.vcmp %v0, %v1, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-
-// Store predicate to UB at base + 4 * 8 = base + 32 bytes
-pto.psti %mask, %ub_base, 4, "NORM" : !pto.mask, !pto.ptr<i64, ub>, i32
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Load Store](../../predicate-load-store.md)
-- Previous op in family: [pto.pst](./pst.md)
-- Next op in family: [pto.pstu](./pstu.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psti_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psti_zh.md
deleted file mode 100644
index 549f0d09..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psti_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.psti
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](psti.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psts.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psts.md
deleted file mode 100644
index f5bc4627..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psts.md
+++ /dev/null
@@ -1,114 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-load-store/psts.md` -->
-
-# pto.psts
-
-Standalone reference page for `pto.psts`. This page belongs to the [Predicate Load Store](../../predicate-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Store the full predicate register to a contiguous UB location.
-
-## Mechanism
-
-`pto.psts` writes a predicate word from `!pto.mask` to a UB address. The operation covers the full predicate width for the active element type (64 bits for f32, 128 bits for f16/bf16, 256 bits for i8/u8).
-
-For predicate width `Pw` and UB address `base`:
-
-$$ \mathrm{WRITE\_UB}_{64}(base, mask) $$
-
-The predicate register is read atomically. Only bits within the current element-type predicate width are transferred; bits outside are **implementation-defined**.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-psts %mask, %ub_ptr : !pto.mask, !pto.ptr<i64, ub>
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-pto.psts %mask, %ub_ptr : !pto.mask, !pto.ptr<i64, ub>
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.psts ins(%mask, %ub_ptr : !pto.mask, !pto.ptr<i64, ub>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PSTS(RegBuf<predicate_t>& src, const Ptr<ub_space_t, ub_t>& dst);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%mask` | `!pto.mask` | Predicate register to store |
-| `%ub_ptr` | `!pto.ptr<i64, ub>` | UB destination address (must be 64-bit aligned) |
-
-## Expected Outputs
-
-None. This form is defined by its side effect on UB memory.
-
-## Side Effects
-
-- Writes the predicate register value to the UB location.
-- UB memory at the target address is modified.
-
-## Constraints
-
-- **UB address space**: `%ub_ptr` MUST have address space `ub`. Pointers to other spaces are illegal.
-- **Alignment**: The effective address MUST be 64-bit aligned. Misaligned addresses are **illegal**.
-- **Predicate width**: The store transfers exactly 64 bits. The caller MUST ensure this matches the active element type context.
-- **Write atomicity**: The 64-bit predicate word is written atomically.
-
-## Exceptions
-
-- Illegal if `%ub_ptr` is not a UB-space pointer.
-- Illegal if the effective address is not 64-bit aligned.
-- Illegal if predicate width does not match the active element type context.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Contiguous store | Simulated | Supported | Supported |
-| 64-bit alignment requirement | Enforced | Enforced | Enforced |
-| Predicate width (f32 / f16,bf16 / i8) | N=64/128/256 | N=64/128/256 | N=64/128/256 |
-
-## Examples
-
-### Store predicate to UB
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void save_mask(RegBuf<predicate_t>& src, Ptr<ub_space_t, ub_t> dst) {
-    PSTS(src, dst);
-}
-```
-
-### SSA form
-
-```mlir
-// Generate comparison mask
-%mask = pto.vcmp %v0, %v1, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-
-// Store predicate to UB for later reuse
-pto.psts %mask, %ub_mask_slot0 : !pto.mask, !pto.ptr<i64, ub>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Load Store](../../predicate-load-store.md)
-- Next op in family: [pto.pst](./pst.md)
-- Previous op in family: [pto.pldi](./pldi.md)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psts_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psts_zh.md
deleted file mode 100644
index a9244553..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/psts_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.psts
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](psts.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pstu.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pstu.md
deleted file mode 100644
index cf7694a7..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pstu.md
+++ /dev/null
@@ -1,138 +0,0 @@
-<!-- Generated from `docs/isa/scalar/ops/predicate-load-store/pstu.md` -->
-
-# pto.pstu
-
-Standalone reference page for `pto.pstu`. This page belongs to the [Predicate Load Store](../../predicate-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Stream predicate register to UB with alignment state tracking. High-throughput variant that relaxes alignment requirements at the cost of weaker write atomicity guarantees.
-
-## Mechanism
-
-`pto.pstu` writes a predicate word from `!pto.mask` to a UB address while tracking and updating alignment state. Unlike `psts`, this operation does not require 64-bit alignment and may batch multiple predicate writes into a single DMA transaction.
-
-For alignment state `align_in`, predicate `mask`, and base address `base`:
-
-$$ align\_out = align\_in \oplus mask $$
-$$ base\_out = base + \mathrm{sizeof}(predicate) $$
-
-The `%align_out` state carries forward into the next `pstu` call, enabling streaming writes without per-op synchronization.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-pstu %align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr<T, ub> -> !pto.align, !pto.ptr<T, ub>
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%align_out, %base_out = pto.pstu %align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr<T, ub> -> !pto.align, !pto.ptr<T, ub>
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.pstu ins(%align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr<T, ub>)
-       outs(%align_out, %base_out : !pto.align, !pto.ptr<T, ub>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST void PSTU(PredicateReg& dst,
-                    Ptr<ub_space_t, ub_t> ub_ptr,
-                    predicate_t align_in,
-                    predicate_t align_out);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%align_in` | `!pto.align` | Alignment state from previous `pstu` or `pld`-family operation |
-| `%mask` | `!pto.mask` | Predicate register to stream-store |
-| `%base_in` | `!pto.ptr<T, ub>` | UB base address (no alignment requirement) |
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%align_out` | `!pto.align` | Updated alignment state for next `pstu` call |
-| `%base_out` | `!pto.ptr<T, ub>` | Incremented base address (base + predicate width in bytes) |
-
-## Side Effects
-
-- Writes the predicate register value to UB memory at the target address.
-- Updates alignment state for use by subsequent `pstu` calls.
-- UB memory at the target address is modified; write atomicity per 64-bit word is **not guaranteed**.
-
-## Constraints
-
-- **Alignment state**: `%align_in` MUST be the alignment state from the previous `pstu` call, or from a `pld`-family operation. Using an uninitialized alignment state is **illegal**.
-- **Alignment state chaining**: Programs MUST pass `%align_out` from one `pstu` to the `%align_in` of the next. Breaking the chain without re-initializing the alignment state is **illegal**.
-- **Write atomicity**: Unlike `psts`, the 64-bit predicate word is NOT guaranteed to be atomically written. Programs that require exact predicate state restoration MUST use `psts`, not `pstu`.
-- **UB address space**: `%base_in` MUST have address space `ub`.
-
-## Exceptions
-
-- Illegal if `%align_in` is not initialized from a prior `pstu` or `pld` operation.
-- Illegal if alignment state chain is broken.
-- Illegal if `%base_in` is not a UB-space pointer.
-- `pstu` MUST NOT be used when exact predicate save/restore is required.
-
-## Target-Profile Restrictions
-
-| Aspect | CPU Sim | A2/A3 | A5 |
-|--------|:-------:|:------:|:--:|
-| Stream predicate store | Not supported | Supported | Supported |
-| Alignment state tracking | Not applicable | Supported | Supported |
-| Write atomicity guarantee | Not applicable | Not guaranteed | Not guaranteed |
-
-CPU simulator does not implement `pstu`. Portable programs MUST use `psts` for exact predicate persistence or provide a CPU-sim fallback.
-
-## Examples
-
-### Streaming predicate writes
-
-```c
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void stream_masks(Ptr<ub_space_t, ub_t> dst_base,
-                  predicate_t* masks,
-                  int count) {
-    predicate_t align_state = 0;
-    for (int i = 0; i < count; ++i) {
-        PSTU(masks[i], dst_base, align_state, align_state);
-        dst_base = dst_base + (predicate_width_bytes);
-    }
-}
-```
-
-### SSA form — chaining stream stores
-
-```mlir
-// Initialize alignment state (e.g., from a dummy load or zero)
-%align0 = pto.plds %ub_dummy : !pto.ptr<i64, ub> -> !pto.mask
-
-// Stream store first predicate; align_out carries forward
-%align1, %base1 = pto.pstu %align0, %mask0, %base0 : !pto.align, !pto.mask, !pto.ptr<i64, ub> -> !pto.align, !pto.ptr<i64, ub>
-
-// Stream store second predicate using updated alignment state
-%align2, %base2 = pto.pstu %align1, %mask1, %base1 : !pto.align, !pto.mask, !pto.ptr<i64, ub> -> !pto.align, !pto.ptr<i64, ub>
-```
-
-> **Note**: For exact predicate save/restore across kernel boundaries, use `psts` instead. `pstu` is intended for high-throughput streaming scenarios where some loss of per-word atomicity is acceptable.
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate Load Store](../../predicate-load-store.md)
-- Previous op in family: [pto.psti](./psti.md)
-- Next op in family: (none — last in family)
-- Control-shell overview: [Control and configuration](../../control-and-configuration.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pstu_zh.md b/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pstu_zh.md
deleted file mode 100644
index 8c95f577..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/ops/predicate-load-store/pstu_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.pstu
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pstu.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/pipeline-sync.md b/docs/mkdocs/src/docs/isa/scalar/pipeline-sync.md
deleted file mode 100644
index 8e190038..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/pipeline-sync.md
+++ /dev/null
@@ -1,52 +0,0 @@
-<!-- Generated from `docs/isa/scalar/pipeline-sync.md` -->
-
-# Pipeline Sync
-
-These `pto.*` forms establish explicit producer-consumer ordering across PTO execution stages. They belong to the scalar/control surface even when they coordinate vector-facing pipelines, because what they expose architecturally is dependency state rather than vector payload math.
-
-## Synchronization Hierarchy
-
-The four synchronization modes form a containment hierarchy:
-
-```
-Event-based synchronization  (set_flag / wait_flag)
-        ↑
-Buffer-token protocol  (get_buf / rls_buf)  — requires event-based under the hood
-        ↑
-Memory barrier  (mem_bar)  — may be used inside vector-visible scope
-        ↑
-Inter-core coordination  (set_cross_core / wait_flag_dev / set_intra_block / wait_intra_core)
-```
-
-- **Event-based** (`set_flag` / `wait_flag`): The foundational mode. Sets or waits on a named event signal between producer and consumer pipes. Used as the primitive for all higher-level modes.
-- **Buffer-token** (`get_buf` / `rls_buf`): A protocol built on top of event-based synchronization for double-buffered execution. `get_buf` acquires a buffer token and implicitly sets an event; `rls_buf` releases the token and implicitly sets a dependent event.
-- **Memory barrier** (`mem_bar`): Enforces visibility of memory operations within a vector-visible execution scope. Does not establish cross-stage ordering on its own.
-- **Inter-core** (`set_cross_core` / `wait_flag_dev` / `set_intra_block` / `wait_intra_core`): Coordinate between execution units or cores. These are profile-restricted and **MAY NOT** be available on all targets.
-
-Programs **MUST NOT** assume that a higher-level mode (e.g., buffer-token) replaces the need for event-based ordering; the protocol requires event-based synchronization underneath.
-
-## What This Family Covers
-
-- event-based synchronization between producer and consumer pipes
-- buffer-token protocols for double-buffered execution
-- explicit memory barriers inside vector-visible execution scope
-- target-profile inter-core coordination forms
-
-## Per-Op Pages
-
-- [pto.set_flag](./ops/pipeline-sync/set-flag.md)
-- [pto.wait_flag](./ops/pipeline-sync/wait-flag.md)
-- [pto.pipe_barrier](./ops/pipeline-sync/pipe-barrier.md)
-- [pto.get_buf](./ops/pipeline-sync/get-buf.md)
-- [pto.rls_buf](./ops/pipeline-sync/rls-buf.md)
-- [pto.mem_bar](./ops/pipeline-sync/mem-bar.md)
-- [pto.set_cross_core](./ops/pipeline-sync/set-cross-core.md)
-- [pto.wait_flag_dev](./ops/pipeline-sync/wait-flag-dev.md)
-- [pto.set_intra_block](./ops/pipeline-sync/set-intra-block.md)
-- [pto.wait_intra_core](./ops/pipeline-sync/wait-intra-core.md)
-
-## Related Material
-
-- [Control and configuration](./control-and-configuration.md)
-- [Vector Families: Pipeline Sync](../vector/pipeline-sync.md)
-- [Machine Model: Ordering And Synchronization](../machine-model/ordering-and-synchronization.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/pipeline-sync_zh.md b/docs/mkdocs/src/docs/isa/scalar/pipeline-sync_zh.md
deleted file mode 100644
index 3ced8d88..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/pipeline-sync_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Pipeline Sync
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](pipeline-sync.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/predicate-generation-and-algebra.md b/docs/mkdocs/src/docs/isa/scalar/predicate-generation-and-algebra.md
deleted file mode 100644
index 6386ade4..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/predicate-generation-and-algebra.md
+++ /dev/null
@@ -1,110 +0,0 @@
-<!-- Generated from `docs/isa/scalar/predicate-generation-and-algebra.md` -->
-
-# Predicate Generation And Algebra
-
-Predicate generation and algebra operations create, combine, pack, unpack, and interleave `!pto.mask` values on the scalar/control surface. The `!pto.mask` type is the lane-masking mechanism that `pto.v*` vector operations consume.
-
-## The `!pto.mask` Type
-
-`!pto.mask` is a predicate mask type whose width is tied to the active element type rather than being a fixed number of bits:
-
-| Element Type | Vector Width N | Predicate Width |
-|-------------|:-------------:|:--------------:|
-| f32 | 64 | 64 bits |
-| f16 / bf16 | 128 | 128 bits |
-| i8 / u8 | 256 | 256 bits |
-
-A predicate mask with bit value `1` at position `i` means lane `i` is **active**; bit value `0` means lane `i` is **inactive**. Vector operations execute on active lanes only; inactive lanes produce implementation-defined results.
-
-## Sub-category Overview
-
-| Sub-category | Operations | Description | Static / Dynamic |
-|--------------|-----------|-------------|-----------------|
-| Pattern-based construction | `pset_b8`, `pset_b16`, `pset_b32` | Build mask from named pattern | Static (compile-time pattern) |
-| Comparison generation (≥) | `pge_b8`, `pge_b16`, `pge_b32` | Generate mask: `i < scalar` | Dynamic (runtime scalar) |
-| Comparison generation (<) | `plt_b8`, `plt_b16`, `plt_b32` | Generate mask: `i ≥ scalar`; also updates scalar | Dynamic (runtime scalar) |
-| Predicate pack | `ppack` | Narrow: pack two N-bit masks into one 2N-bit mask | Static (partition token) |
-| Predicate unpack | `punpack` | Widen: extract half from a 2N-bit mask | Static (partition token) |
-| Boolean algebra | `pand`, `por`, `pxor`, `pnot` | AND / OR / XOR / NOT | Dynamic (runtime operands) |
-| Predicate select | `psel` | `mask0 ? mask1 : mask2` | Dynamic (runtime operands) |
-| Deinterleave | `pdintlv_b8` | Split one 2N-bit mask into two N-bit masks | Static |
-| Interleave | `pintlv_b16` | Combine two N-bit masks into one 2N-bit mask | Static |
-
-## Pattern Tokens
-
-`pset_*` operations accept pattern tokens that encode compile-time-known mask shapes:
-
-| Pattern | Predicate Width | Meaning |
-|---------|:--------------:|---------|
-| `PAT_ALL` | All N | All lanes active |
-| `PAT_ALLF` | All N | All lanes inactive |
-| `PAT_H` | N/2 | High half active (upper N/2 lanes) |
-| `PAT_Q` | N/4 | Upper quarter active |
-| `PAT_VL1` … `PAT_VL128` | N | First N lanes active |
-| `PAT_M3` | N | Modular pattern: repeat every 3 lanes |
-| `PAT_M4` | N | Modular pattern: repeat every 4 lanes |
-
-## Partition Tokens
-
-`ppack` and `punpack` use partition tokens to specify which half of the predicate register is accessed:
-
-| Token | Meaning |
-|-------|---------|
-| `LOWER` | Lower N bits of the 2N-bit predicate register |
-| `HIGHER` | Upper N bits of the 2N-bit predicate register |
-
-## Shared Constraints
-
-All predicate generation and algebra operations MUST satisfy:
-
-1. **Operand type**: All predicate operands MUST be `!pto.mask`. Mixing predicate operands with scalar or vector register operands is **illegal**.
-2. **Predicate width consistency**: All operands in a single operation MUST share the same predicate width. Operations that mix N-bit and 2N-bit predicates MUST use explicit pack/unpack.
-3. **Pattern token validity**: Pattern tokens MUST be supported by the target profile. Using a pattern token outside its supported width context is **illegal**.
-4. **Scalar operand type**: For `pge_*` and `plt_*` operations, the scalar operand type MUST match the variant suffix (`_b8` → i8, `_b16` → i16, `_b32` → i32).
-5. **Side effect**: No predicate generation or algebra operation writes to UB or modifies architectural state beyond producing a predicate result.
-
-## Relationship Between pset, pge, and plt
-
-- `pset_*` → **static** mask, fully determined at compile time from the pattern token
-- `pge_*` → **dynamic** mask, depends on a runtime scalar value; predicate lane `i` is active iff `i < scalar`
-- `plt_*` → **dynamic** mask AND scalar update; predicate lane `i` is active iff `i < scalar`, and `scalar_out = scalar - N`
-
-`plt_*` operations are designed for software-pipelined remainder loops where the scalar counter is decremented by the vector length each iteration.
-
-## Per-Op Pages
-
-### Pattern-based Construction
-- [pto.pset_b8](./ops/predicate-generation-and-algebra/pset-b8.md)
-- [pto.pset_b16](./ops/predicate-generation-and-algebra/pset-b16.md)
-- [pto.pset_b32](./ops/predicate-generation-and-algebra/pset-b32.md)
-
-### Comparison Generation (Greater-or-Equal)
-- [pto.pge_b8](./ops/predicate-generation-and-algebra/pge-b8.md)
-- [pto.pge_b16](./ops/predicate-generation-and-algebra/pge-b16.md)
-- [pto.pge_b32](./ops/predicate-generation-and-algebra/pge-b32.md)
-
-### Comparison Generation (Less-Than)
-- [pto.plt_b8](./ops/predicate-generation-and-algebra/plt-b8.md)
-- [pto.plt_b16](./ops/predicate-generation-and-algebra/plt-b16.md)
-- [pto.plt_b32](./ops/predicate-generation-and-algebra/plt-b32.md)
-
-### Pack / Unpack
-- [pto.ppack](./ops/predicate-generation-and-algebra/ppack.md)
-- [pto.punpack](./ops/predicate-generation-and-algebra/punpack.md)
-
-### Boolean Algebra
-- [pto.pand](./ops/predicate-generation-and-algebra/pand.md)
-- [pto.por](./ops/predicate-generation-and-algebra/por.md)
-- [pto.pxor](./ops/predicate-generation-and-algebra/pxor.md)
-- [pto.pnot](./ops/predicate-generation-and-algebra/pnot.md)
-- [pto.psel](./ops/predicate-generation-and-algebra/psel.md)
-
-### Interleave / Deinterleave
-- [pto.pdintlv_b8](./ops/predicate-generation-and-algebra/pdintlv-b8.md)
-- [pto.pintlv_b16](./ops/predicate-generation-and-algebra/pintlv-b16.md)
-
-## Related Material
-
-- [Control and configuration](./control-and-configuration.md)
-- [Vector Families: Predicate And Materialization](../vector/predicate-and-materialization.md)
-- [Predicate Load Store](./predicate-load-store.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/predicate-generation-and-algebra_zh.md b/docs/mkdocs/src/docs/isa/scalar/predicate-generation-and-algebra_zh.md
deleted file mode 100644
index d4805af4..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/predicate-generation-and-algebra_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Predicate Generation And Algebra
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](predicate-generation-and-algebra.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/predicate-load-store.md b/docs/mkdocs/src/docs/isa/scalar/predicate-load-store.md
deleted file mode 100644
index 344dd4ae..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/predicate-load-store.md
+++ /dev/null
@@ -1,111 +0,0 @@
-<!-- Generated from `docs/isa/scalar/predicate-load-store.md` -->
-
-# Predicate Load Store
-
-Predicate load/store family moves predicate-register state (`!pto.mask`) between UB-visible storage and the architectural predicate surface. Predicates are the lane-masking mechanism that `pto.v*` vector operations consume.
-
-## Mechanism
-
-Predicate state lives on the scalar/control surface. `pld*/pst*` operations transfer predicate bits to or from UB memory locations, enabling predicates to persist across kernel boundaries or to be shared with scalar address calculations.
-
-### Data Flow
-
-```
-Predicate Register File ──(plds/pld/pldi)──► UB location (64-bit aligned)
-UB location ──(psts/pst/psti/pstu)──► Predicate Register File
-```
-
-### Predicate Width
-
-| Element Type | Vector Width N | Predicate Width |
-|-------------|:-------------:|:--------------:|
-| f32 | 64 | 64 bits |
-| f16 / bf16 | 128 | 128 bits (2 × 64-bit transfers) |
-| i8 / u8 | 256 | 256 bits (4 × 64-bit transfers) |
-
-A single predicate load/store operation covers the full predicate width for the element type in use. Partial predicate loads are **not supported**.
-
-### Alignment Requirements
-
-| Operation | Alignment Requirement | Consequence of Violation |
-|-----------|----------------------|--------------------------|
-| `plds` / `psts` | 64-bit (8 bytes) at UB address | Illegal if address not 8-byte aligned |
-| `pld` / `pst` (areg offset) | 64-bit; offset must be register-aligned | Illegal if address or offset violates alignment |
-| `pldi` / `psti` (immediate offset) | 64-bit; offset must be compile-time constant | Illegal if immediate violates alignment |
-| `pstu` (stream form) | None; tracks alignment state internally | Alignment state is implementation-defined on first use |
-
-## Distribution Modes
-
-Distribution modes (`dist` attribute) control how predicate bits are packed into UB storage. All load/store forms accept a `dist` attribute:
-
-| Mode | Description | Load Behavior | Store Behavior |
-|------|-------------|---------------|----------------|
-| `NORM` | Normal packing | Read 64-bit predicate word directly | Write 64-bit predicate word directly |
-| `PK` | Packed (store only) | Not applicable | Pack two 32-bit predicate segments into one 64-bit word |
-| `US` | Unsigned streaming | UB bits as-is | UB bits as-is |
-| `DS` | Signed streaming | UB bits as-is, sign-extend | UB bits as-is |
-
-## Shared Constraints
-
-All predicate load/store operations MUST satisfy:
-
-1. **UB address space**: The pointer operand MUST have type `!pto.ptr<T, ub>`. Predicates cannot be transferred directly to/from GM.
-2. **Alignment**: The effective UB address (base + offset) MUST be 64-bit aligned. The stream form (`pstu`) relaxes this but imposes its own ordering requirements.
-3. **Predicate width match**: The transfer covers the full predicate width for the active element type. Partial transfers are not permitted.
-4. **Event ordering**: When used in a producer-consumer chain with DMA, the program MUST use `set_flag`/`wait_flag` to order the predicate transfer before or after the dependent operation.
-5. **Single active predicate**: At any point in program order, at most one predicate register is architecturally active. Concurrent predicate transfers that would overwrite an in-flight predicate are **illegal**.
-
-## Stream Form (`pstu`)
-
-`pto.pstu` is the high-throughput stream variant of predicate store. It differs from `psts` in the following ways:
-
-| Aspect | `psts` | `pstu` |
-|--------|--------|--------|
-| Alignment | 64-bit required | None required |
-| Write atomicity | Single predicate word is atomic | Writes may be batched; individual 64-bit words are **not** guaranteed atomic |
-| Alignment state | Not updated | Updates `%align_out` with new alignment base |
-| Use case | Exact predicate save/restore | Streaming predicate writes with internal buffering |
-
-Programs that require exact predicate state restoration (e.g., saving and restoring a mask for later reuse) MUST use `psts`. Programs that stream predicates as part of a larger pipeline SHOULD use `pstu`.
-
-## Predicate Lifecycle
-
-A typical predicate load/store lifecycle:
-
-```
-// Kernel entry: load saved predicate
-%mask = pto.plds %ub_saved : !pto.ptr<i64, ub> -> !pto.mask
-
-// Use predicate for vector computation
-%result = pto.vsel %v_true, %v_false, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// At kernel exit: save predicate for next kernel
-pto.psts %mask, %ub_saved : !pto.mask, !pto.ptr<i64, ub>
-```
-
-## Target-Profile Restrictions
-
-| Feature | CPU Simulator | A2/A3 | A5 |
-|---------|:------------:|:------:|:--:|
-| `plds` / `psts` | Simulated | Supported | Supported |
-| `pld` / `pst` (areg) | Simulated | Supported | Supported |
-| `pldi` / `psti` (immediate) | Simulated | Supported | Supported |
-| `pstu` stream form | Not supported | Supported | Supported |
-| `PK` distribution mode | Not supported | Supported | Supported |
-| Alignment relaxation (`pstu`) | Not applicable | Supported | Supported |
-
-## Per-Op Pages
-
-- [pto.plds](./ops/predicate-load-store/plds.md) — Contiguous predicate load
-- [pto.pld](./ops/predicate-load-store/pld.md) — Predicate load with areg offset
-- [pto.pldi](./ops/predicate-load-store/pldi.md) — Predicate load with immediate offset
-- [pto.psts](./ops/predicate-load-store/psts.md) — Contiguous predicate store
-- [pto.pst](./ops/predicate-load-store/pst.md) — Predicate store with areg offset
-- [pto.psti](./ops/predicate-load-store/psti.md) — Predicate store with immediate offset
-- [pto.pstu](./ops/predicate-load-store/pstu.md) — Predicate unaligned stream store
-
-## Related Material
-
-- [Control and configuration](./control-and-configuration.md)
-- [Vector Families: Predicate And Materialization](../vector/predicate-and-materialization.md)
-- [Predicate Generation And Algebra](./predicate-generation-and-algebra.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/predicate-load-store_zh.md b/docs/mkdocs/src/docs/isa/scalar/predicate-load-store_zh.md
deleted file mode 100644
index ff206213..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/predicate-load-store_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Predicate Load Store
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](predicate-load-store.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/shared-arith.md b/docs/mkdocs/src/docs/isa/scalar/shared-arith.md
deleted file mode 100644
index 4db581cc..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/shared-arith.md
+++ /dev/null
@@ -1,84 +0,0 @@
-<!-- Generated from `docs/isa/scalar/shared-arith.md` -->
-
-# Scalar And Control Families: Shared Scalar Arithmetic
-
-PTO source programs use the shared MLIR `arith` surface for scalar math around tile and vector regions. These ops are part of the documented PTO source surface, but they are not PTO mnemonics themselves.
-
-## Summary
-
-Shared scalar arithmetic provides constants, scalar math, comparisons, casts, and selects that feed PTO payload regions. It exists so PTO does not need to invent a separate scalar arithmetic ISA for bookkeeping that MLIR already models well.
-
-## Mechanism
-
-`arith` values stay in ordinary scalar SSA form. They are used to:
-
-- materialize constants and loop bounds
-- compute offsets, dynamic shapes, and tail counts
-- build predicates for `scf.if` or `scf.while`
-- adapt scalar widths and types around PTO boundaries
-
-When the program needs tile or vector payload math, it must switch back to the PTO instruction surfaces. `arith` is the scalar shell, not a substitute for `pto.t*` or `pto.v*`.
-
-## Inputs
-
-Shared scalar arithmetic consumes scalar values of these broad kinds:
-
-- `index`
-- integer values
-- floating-point values
-- boolean-like predicates produced by comparison operations
-
-## Expected Outputs
-
-It produces scalar SSA values that are later consumed by:
-
-- loop bounds and control decisions
-- tile-valid-region calculations
-- pointer or offset calculations
-- scalar operands to PTO instructions
-
-## Side Effects
-
-`arith` operations are value-producing only. They do not allocate buffers, trigger DMA, change vector masks, or establish synchronization by themselves.
-
-## Constraints
-
-- Shared scalar arithmetic **MUST** remain scalar. It does not define vector-register or tile-payload behavior.
-- PTO pages **MUST** document `arith` as part of the supported source surface when a kernel author needs scalar setup around PTO regions.
-- Type conversions or comparisons that affect later PTO legality **MUST** be stated explicitly rather than implied.
-
-## Exceptions
-
-The following are **ILLEGAL**:
-
-- using `arith` to stand in for payload vector math
-- leaving signedness, width change, or `index` conversion behavior implicit at a PTO boundary
-- assuming backend-specific scalar widths beyond what the program spells explicitly
-
-## Target-Profile Restrictions
-
-The `arith` contract is largely target-neutral. Backend restrictions appear only when an `arith` result is later consumed by a target-restricted PTO instruction or by a target-specific lowering path.
-
-## Examples
-
-### Scalar Setup Around A PTO Region
-
-```mlir
-%c0 = arith.constant 0 : index
-%c64 = arith.constant 64 : index
-%tile_offset = arith.muli %tile_idx, %c64 : index
-%is_tail = arith.cmpi slt, %remaining, %c64 : index
-```
-
-### Branch Predicate For Structured Control
-
-```mlir
-%needs_tail = arith.cmpi slt, %valid_cols, %tile_cols : index
-%active_cols = arith.select %needs_tail, %valid_cols, %tile_cols : index
-```
-
-## Related Ops And Family Links
-
-- [Shared Structured Control Flow](./shared-scf.md)
-- [Scalar And Control Families: Control And Configuration](./control-and-configuration.md)
-- [Programming Model: Tiles And Valid Regions](../programming-model/tiles-and-valid-regions.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/shared-arith_zh.md b/docs/mkdocs/src/docs/isa/scalar/shared-arith_zh.md
deleted file mode 100644
index f5db94e7..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/shared-arith_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Scalar And Control Families: Shared Scalar Arithmetic
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](shared-arith.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/scalar/shared-scf.md b/docs/mkdocs/src/docs/isa/scalar/shared-scf.md
deleted file mode 100644
index 7619bf6e..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/shared-scf.md
+++ /dev/null
@@ -1,90 +0,0 @@
-<!-- Generated from `docs/isa/scalar/shared-scf.md` -->
-
-# Scalar And Control Families: Shared Structured Control Flow
-
-PTO source programs use shared MLIR `scf` operations to express loops, branches, and loop-carried state around PTO regions. These are part of the documented source surface, but they are not PTO mnemonic families.
-
-## Summary
-
-Shared structured control flow gives PTO a control shell that stays analyzable and explicit. It avoids inventing custom PTO branch syntax for logic that is already represented clearly by `scf.for`, `scf.if`, `scf.while`, `scf.condition`, and `scf.yield`.
-
-## Mechanism
-
-`scf` surrounds PTO regions rather than replacing them. It is used to:
-
-- express counted loops around repeated tile or vector work
-- carry scalar or tile state across iterations
-- model structured conditional execution
-- keep control flow visible to analyses and lowerings
-
-This matters especially for the vector surface, where `__VEC_SCOPE__` is modeled using structured control rather than an opaque launch node.
-
-## Inputs
-
-Shared structured control flow consumes:
-
-- scalar predicates
-- loop bounds and step values
-- region-carried SSA state
-- yielded values from nested branches or loops
-
-## Expected Outputs
-
-It produces:
-
-- well-structured control regions
-- explicit loop-carried values
-- branch-selected scalar or tile state
-
-## Side Effects
-
-`scf` itself does not create DMA, synchronization, or payload effects. Those effects come from the PTO instructions inside the structured regions.
-
-## Constraints
-
-- PTO control flow **SHOULD** stay in structured `scf` form unless a more specific architecture-visible mechanism is required.
-- Region-carried values and branch results **MUST** be explicit through `scf.yield`.
-- Predicate construction for `scf` control **SHOULD** come from the shared scalar surface, not from undocumented control side channels.
-
-## Exceptions
-
-The following are **ILLEGAL**:
-
-- pretending `scf` is a PTO mnemonic family
-- hiding loop-carried state that later affects PTO legality
-- collapsing structured control into vague prose instead of documenting the carried values and branch conditions
-
-## Target-Profile Restrictions
-
-The `scf` surface is largely target-neutral. Restrictions appear when a region contains target-profile-specific PTO instructions or when a backend imposes extra structure on a vector-execution scope.
-
-## Examples
-
-### Counted Loop Around Vector Work
-
-```mlir
-scf.for %i = %c0 to %tile_count step %c1 {
-  %offset = arith.muli %i, %tile_stride : index
-  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-  %v = pto.vlds %ub[%offset] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-  pto.vsts %abs, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-}
-```
-
-### Structured Conditional Around Tile Update
-
-```mlir
-%need_tail = arith.cmpi slt, %valid_cols, %tile_cols : index
-scf.if %need_tail {
-  pto.tsubs ins(%tile, %bias : !pto.tile_buf<...>, f32) outs(%tile : !pto.tile_buf<...>)
-} else {
-  pto.tadds ins(%tile, %bias : !pto.tile_buf<...>, f32) outs(%tile : !pto.tile_buf<...>)
-}
-```
-
-## Related Ops And Family Links
-
-- [Shared Scalar Arithmetic](./shared-arith.md)
-- [Scalar And Control Families: Control And Configuration](./control-and-configuration.md)
-- [Machine Model: Ordering And Synchronization](../machine-model/ordering-and-synchronization.md)
diff --git a/docs/mkdocs/src/docs/isa/scalar/shared-scf_zh.md b/docs/mkdocs/src/docs/isa/scalar/shared-scf_zh.md
deleted file mode 100644
index baddb5ee..00000000
--- a/docs/mkdocs/src/docs/isa/scalar/shared-scf_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Scalar And Control Families: Shared Structured Control Flow
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](shared-scf.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/state-and-types/README_zh.md b/docs/mkdocs/src/docs/isa/state-and-types/README_zh.md
deleted file mode 100644
index ce7b8f3e..00000000
--- a/docs/mkdocs/src/docs/isa/state-and-types/README_zh.md
+++ /dev/null
@@ -1,21 +0,0 @@
-<!-- Generated from `docs/isa/state-and-types/README_zh.md` -->
-
-# 状态、类型与位置
-
-本章描述 PTO 的类型系统和位置意图：数据类型、Ttile 角色、location intent 以及操作的合法性约束。
-
-## 本章内容
-
-- [类型系统](state-and-types/type-system.md) — 完整数据类型表（FP8/F16/BF16/F32 + 整型）、Vector Width 表、NaN/Inf 行为、类型转换规则
-- [位置意图与合法性](state-and-types/location-intent-and-legality.md) — Location Intent 分类（Vec / Mat / Acc / Left / Right / Scalar）、四阶段合法性检查流程（Type Check → Shape Check → Layout Check → Target Profile Check）
-
-## 阅读建议
-
-建议按以下顺序阅读：
-
-1. 先读 [类型系统](state-and-types/type-system.md)，理解 PTO 支持的元素类型和向量宽度
-2. 再读 [位置意图与合法性](state-and-types/location-intent-and-legality.md)，理解 Tile Type、Layout 和 Target Profile 对操作合法性的影响
-
-## 章节定位
-
-本章属于手册的第 5 章。在进入指令集章节之前，应理解类型系统，因为每条指令的操作数都有严格的类型约束。Location Intent 和合法性检查流程是理解 Tile Type 组合（Left/Right/Acc/Mat/Vec）限制的关键。
diff --git a/docs/mkdocs/src/docs/isa/state-and-types/data-format.md b/docs/mkdocs/src/docs/isa/state-and-types/data-format.md
deleted file mode 100644
index 28c16b7a..00000000
--- a/docs/mkdocs/src/docs/isa/state-and-types/data-format.md
+++ /dev/null
@@ -1,176 +0,0 @@
-<!-- Generated from `docs/isa/state-and-types/data-format.md` -->
-
-# Data Format Reference
-
-This page describes the **physical data format** — how tiles, vectors, and scalars are represented in memory and in hardware registers. It covers memory spaces, element packing, address alignment, VLane architecture, and the relationship between the PTO logical view and the underlying storage.
-
-## Memory Spaces
-
-PTO distinguishes three memory spaces, each with different access semantics and bandwidth characteristics:
-
-| Memory Space | Location | Access Unit | Bandwidth | Access Pattern |
-|-------------|----------|-------------|-----------|----------------|
-| **GM** (Global Memory) | Off-chip device DRAM | Byte-granular | Low | Random access via UB DMA |
-| **UB** (Unified Buffer) | On-chip SRAM | 32-byte block | High | Bulk DMA transfer; no direct scalar access |
-| **Tile Register File** (TRF) | On-chip tile buffer | Element-granular | Highest | Direct compute access; not directly addressable by scalar code |
-
-Data movement between GM and UB is performed by the **DMA engine** (MTE1/MTE2/MTE3 pipelines). Data movement between UB and TRF is performed by **load/store operations** (`TLOAD`, `TSTORE`, `VLDS`, `VSTS`). Data inside the TRF is accessed by tile/vector compute pipelines directly without going through the UB.
-
-## Tile Buffer Format
-
-A tile occupies a contiguous region in either the TRF or UB. Its logical shape `(Rows, Cols)` is independent of its physical storage format.
-
-### In-Memory Format (UB)
-
-In the UB, tiles are stored in their `BLayout` order — either `RowMajor` or `ColMajor`. Each element occupies `sizeof(DType)` bytes.
-
-For `BLayout = RowMajor`, shape `(R, C)`:
-
-$$ \text{addr}(r, c) = (r \times C + c) \times \mathrm{sizeof(DType)} $$
-
-For `BLayout = ColMajor`, shape `(R, C)`:
-
-$$ \text{addr}(r, c) = (c \times R + r) \times \mathrm{sizeof(DType)} $$
-
-### In-Register Format (TRF)
-
-The TRF (Tile Register File) holds tiles in their native `BLayout`. The TRF is not byte-addressable — tile data is moved in and out via explicit `TLOAD`/`TSTORE` operations. Compute pipelines (Vector, Matrix) access tile data directly from the TRF without going through the UB.
-
-### Address Alignment
-
-| Access Type | Required Alignment |
-|-------------|-------------------|
-| GM read/write | Element-size aligned (2 bytes for f16/i16, 4 bytes for f32) |
-| UB DMA transfer | 32-byte block aligned (DMA engine unit) |
-| TRF load/store | Element-size aligned |
-
-The DMA engine operates on 32-byte blocks (`BLOCK_BYTE_SIZE = 32`). Misaligned GM addresses result in implementation-defined behavior.
-
-## Element Type Encoding
-
-### Standard Types
-
-| Type | C++ Type | SSA Name | Size (bytes) | Register Width |
-|------|----------|----------|:------------:|:-------------:|
-| IEEE FP16 | `half` | `f16` | 2 | 128 lanes |
-| Brain FP16 | `bfloat16_t` | `bf16` | 2 | 128 lanes |
-| IEEE FP32 | `float` | `f32` | 4 | 64 lanes |
-| Signed int8 | `int8_t` | `i8` | 1 | 256 lanes |
-| Unsigned int8 | `uint8_t` | `u8` | 1 | 256 lanes |
-| Signed int16 | `int16_t` | `i16` | 2 | 128 lanes |
-| Unsigned int16 | `uint16_t` | `u16` | 2 | 128 lanes |
-| Signed int32 | `int32_t` | `i32` | 4 | 64 lanes |
-| Unsigned int32 | `uint32_t` | `u32` | 4 | 64 lanes |
-
-### A5-Only Types
-
-| Type | C++ Type | SSA Name | Size (bytes) | Notes |
-|------|----------|----------|:------------:|-------|
-| FP8 E4M3 | `float8_e4m3_t` | `f8e4m3` | 1 | 256 lanes |
-| FP8 E5M2 | `float8_e5m2_t` | `f8e5m2` | 1 | 256 lanes |
-| HI Float8 | `hifloat8_t` | `hifloat8` | 1 | 256 lanes |
-| Float4 E1M2x2 | `float4_e1m2x2_t` | `float4_e1m2x2` | 1 | 256 lanes (packed 2×2) |
-| Float4 E2M1x2 | `float4_e2m1x2_t` | `float4_e2m1x2` | 1 | 256 lanes (packed 2×2) |
-
-## Vector Register Format (VLane Architecture)
-
-On A5 (Ascend 9xx-class), the vector register is organized as **8 VLanes** of 32 bytes each. A VLane is the atomic unit for group reduction operations. This architecture is architecturally visible in PTO.
-
-```
-vreg (256 bytes total):
-┌─────────┬─────────┬─────────┬─────┬─────────┬─────────┐
-│ VLane 0 │ VLane 1 │ VLane 2 │ ... │ VLane 6 │ VLane 7 │
-│   32B   │   32B   │   32B   │     │   32B   │   32B   │
-└─────────┴─────────┴─────────┴─────┴─────────┴─────────┘
-```
-
-Vector registers hold `N` elements of type `DType` packed contiguously with no padding. The register width is always 256 bytes (2048 bits):
-
-| Element Type | Lane Count N | Bytes/Lane | Total |
-|-------------|:-----------:|:----------:|:-----:|
-| `f32` | 64 | 4 | 256 B |
-| `f16` / `bf16` / `i16` / `u16` | 128 | 2 | 256 B |
-| `i8` / `u8` / FP8 / HI-FP8 | 256 | 1 | 256 B |
-| `float4_*` (packed) | 256 (effective) | 1 | 256 B |
-
-### Group Reduction and VLanes
-
-Group reduction operations (`vcgadd`, `vcgmax`, `vcgmin`) reduce within each VLane independently. The reduction produces one result per VLane (one value per 32-byte lane), which is then broadcast or stored:
-
-```c
-// Per-VLane group reduction: each VLane independently reduces its K elements
-int K = N / 8;  // elements per VLane (e.g., 8 for f32, 16 for f16)
-for (int g = 0; g < 8; g++) {
-    T sum = 0;
-    for (int i = 0; i < K; i++)
-        sum += src[g*K + i];
-    dst[g*K] = sum;           // write result to first position of each VLane
-    for (int i = 1; i < K; i++)
-        dst[g*K + i] = 0;    // zero-fill remaining positions
-}
-```
-
-This is architecturally visible: the result is not a single scalar but one value per VLane.
-
-## Pad Value Encoding
-
-The `Pad` parameter in `Tile<DType, ..., Pad>` specifies the value of out-of-valid-region elements. Declared in `include/pto/common/constants.hpp`.
-
-### Standard Pad Values
-
-| Pad Value | Meaning | `float` Encoding | `half`/`bf16` Encoding | `i8`/`u8` Encoding |
-|-----------|---------|-------------------|-------------------------|---------------------|
-| `Zero` | Initialize to zero | `0x00000000` | `0x0000` | `0x00` |
-| `Null` | Undefined; must not be read | `0x00000000` | `0x0000` | `0x00` |
-| `Min` | Fill with type minimum | `0xff800000` (≈ −0) | `0xfc00` | `0xff` |
-| `Max` | Fill with type maximum | `0x7f800000` (+Inf) | `0x7c00` | `0x7f` |
-
-### Custom Pad Values (A5)
-
-The `PadValueCustom(value)` helper allows compile-time-specified float patterns as pad values. This is useful for operations that need a specific fill value (e.g., `-1.0f` for softmax):
-
-```cpp
-// Custom pad value: all out-of-valid-region elements become -1.0f
-using TilePadNeg1 = Tile<TileType::Vec, float, 16, 16, RowMajor, NoneBox, None, PadValueCustom(-1.0f)>;
-```
-
-Custom pad values encode the float bit pattern in the upper bits of the 64-bit `PadValue` enum. They are processed by `PadValueMap` and applied via `GetPadValue()` at load time.
-
-## Fractal Layout Encoding
-
-The `TileLayoutCustom` enum in `include/pto/common/constants.hpp` encodes the concrete layout used at runtime:
-
-| `TileLayoutCustom` | BLayout | SLayout | Fractal | Block Size | Typical Use |
-|--------------------|---------|---------|---------|:---------:|-------------|
-| `ND` | RowMajor | NoneBox | — | — | Standard tile; most ops |
-| `DN` | ColMajor | NoneBox | — | — | Fortran-order tile |
-| `NZ` | ColMajor | RowMajor | NZ | 512 B | LHS matmul on A5 |
-| `ZN` | RowMajor | ColMajor | ZN | 512 B | Symmetric NZ variant |
-| `ZZ` | RowMajor | RowMajor | ZZ | 512 B | CUBE-specific pattern |
-
-The `BLOCK_BYTE_SIZE = 32` constant and `FRACTAL_NZ_ROW = 16` and `CUBE_BLOCK_SIZE = 512` give the fractal block dimensions used in address generation.
-
-## Constants Reference
-
-| Constant | Value | Units | Use |
-|----------|-------|-------|-----|
-| `BLOCK_BYTE_SIZE` | 32 | bytes | DMA block transfer unit |
-| `FIXP_BURST_UNIT_LEN` | 64 | half-words | DMA burst length |
-| `FRACTAL_NZ_ROW` | 16 | elements | Fractal row dimension for NZ/ZN |
-| `CUBE_BLOCK_SIZE` | 512 | bytes | CUBE fractal block |
-| `C0_SIZE_BYTE` | 32 | bytes | Cube C0 dimension (in bytes) |
-| `MX_COL_LEN` | 2 | elements | MX matmul column block |
-| `MX_ROW_LEN` | 16 | elements | MX matmul row block |
-| `MX_BLOCK_SIZE` | 32 | elements | MX matmul block |
-| `TMP_UB_SIZE` | 8 × 1024 | bytes | Temporary UB buffer size |
-| `TMP_UB_OFFSET` | 184 × 1024 | bytes | Temporary UB offset |
-| `MASK_LEN` | 64 | bits | Predicate mask width |
-| `BLOCK_LEN` | 16 | elements | Standard block length |
-| `VLane_COUNT` | 8 | lanes | VLanes per vector register (A5) |
-
-## See Also
-
-- [Type System](./type-system.md) — Element type inventory, NaN/Inf rules, conversion rules
-- [Layout Reference](./layout.md) — BLayout, SLayout, Fractal, TileType–Layout compatibility
-- [Tiles and Valid Regions](../programming-model/tiles-and-valid-regions.md) — Valid-region semantics and programming model
-- [Memory Model](../memory-model/consistency-baseline.md) — GM, UB, TRF hierarchy and ordering guarantees
diff --git a/docs/mkdocs/src/docs/isa/state-and-types/layout.md b/docs/mkdocs/src/docs/isa/state-and-types/layout.md
deleted file mode 100644
index b5a07c53..00000000
--- a/docs/mkdocs/src/docs/isa/state-and-types/layout.md
+++ /dev/null
@@ -1,197 +0,0 @@
-<!-- Generated from `docs/isa/state-and-types/layout.md` -->
-
-# Layout Reference
-
-This page is the canonical reference for **BLayout**, **SLayout**, **Fractal Layout**, **GlobalTensor Layout**, and **Compact Mode** in PTO. For the programming model context and valid-region semantics, see [Tiles and Valid Regions](../programming-model/tiles-and-valid-regions.md).
-
-## Two Layout Dimensions
-
-PTO layouts operate at two levels:
-
-1. **GlobalTensor Layout** — how the `GlobalTensor` (GM view) is laid out in off-chip memory. This is the `Layout::ND` / `Layout::DN` / `Layout::NZ` template parameter on `GlobalTensor`.
-2. **Tile Layout** — how the tile buffer (UB or TRF) is organized internally. This is the combination of `BLayout`, `SLayout`, and `Fractal` on a `Tile<...>`.
-
-These two levels must be compatible when a `GlobalTensor` is loaded into a tile via `TLOAD`, and when a tile is stored to a `GlobalTensor` via `TSTORE`.
-
-## GlobalTensor Layout (GM View)
-
-The `GlobalTensor` is a view over off-chip GM. Its layout parameter determines the stride pattern in GM:
-
-| Layout | Stride Pattern | Description | Use Case |
-|--------|---------------|-------------|----------|
-| `Layout::ND` | Row-major, C-contiguous | `stride[R] = Cols, stride[W] = Cols*Width, ...` | Standard row-major tensors |
-| `Layout::DN` | Column-major, Fortran-contiguous | `stride[C] = Rows, stride[R] = Rows*Col, ...` | Column-major tensors |
-| `Layout::NZ` | Row-major fractal (Z-order) | GM data is stored in Z-order for fractal tile compatibility | A5 matmul LHS with NZ layout |
-
-The GM layout must be compatible with the tile's internal layout during `TLOAD`/`TSTORE`. The compatibility rules are documented on the [TLOAD](../tile/ops/memory-and-data-movement/tload.md) and [TSTORE](../tile/ops/memory-and-data-movement/tstore.md) instruction pages.
-
-## Block Layout (BLayout)
-
-`BLayout` describes the in-memory stride between adjacent elements along the row and column axes within a tile buffer. It is the tile's **storage order** — the order in which element data is laid out in the Tile Register File (TRF) or UB buffer.
-
-### Values
-
-| BLayout | Row-Direction Stride | Col-Direction Stride | Mental Model |
-|---------|---------------------|---------------------|--------------|
-| `RowMajor` | `Cols` (elements per row) | `1` (contiguous) | C/C++/PyTorch convention |
-| `ColMajor` | `1` (strided) | `Rows` (elements per column) | Fortran/Julia convention |
-
-For a `RowMajor` tile of shape `(R, C)`, element `(r, c)` is at byte offset:
-
-$$ \mathrm{offset}(r, c) = (r \times C + c) \times \mathrm{sizeof(DType)} $$
-
-For a `ColMajor` tile of shape `(R, C)`:
-
-$$ \mathrm{offset}(r, c) = (c \times R + r) \times \mathrm{sizeof(DType)} $$
-
-### Usage
-
-`RowMajor` is the default for most operations. `ColMajor` is accepted by a subset of operations on A5; consult per-op Target-Profile Restrictions.
-
-## Stripe Layout (SLayout)
-
-`SLayout` describes whether the tile's sub-elements use a **uniform rectangular layout** or a **fractal/strided layout**. It controls whether the tile is addressed as a flat 2D rectangle or with a strided access pattern.
-
-### Values
-
-| SLayout | Description | Requires |
-|---------|-------------|----------|
-| `NoneBox` | Uniform rectangular tile: all `(Rows, Cols)` elements are equally spaced | Default for most operations |
-| `RowMajor` | Strided row layout: addresses elements with a row-major stride pattern | `Fractal ∈ {NZ, FR}` |
-| `ColMajor` | Strided column layout: addresses elements with a column-major stride pattern | `Fractal ∈ {ZN, RN}` |
-
-When `SLayout = NoneBox`, the tile behaves as a standard rectangular buffer. When `SLayout ∈ {RowMajor, ColMajor}`, the `Fractal` parameter further specifies the stride formula.
-
-## Fractal Layout
-
-When `SLayout ≠ NoneBox`, the `Fractal` parameter encodes the precise striding pattern for matrix multiplication or other strided-access patterns. Fractal layouts are designed to match the CUBE engine's internal dataflow for high-performance matmul.
-
-### Fractal Address Formula
-
-Fractal layouts use a **Z-order (Morton code)** stride pattern. Elements are not stored in simple row-major or column-major order; instead, they follow a space-filling curve that improves data reuse in the CUBE engine.
-
-For `Fractal = NZ` with `SLayout = RowMajor`:
-
-$$ \mathrm{offset}(r, c) = \bigl(\mathrm{zigzag\_index}(r, c)\bigr) \times \mathrm{sizeof(DType)} $$
-
-The zigzag index maps 2D coordinates to a 1D Z-order sequence. The mapping is hardware-defined; PTO authors should not compute fractal offsets manually — rely on the frontend to handle address generation via `TASSIGN`.
-
-### Fractal Layout Values
-
-| Fractal | SLayout | BLayout | Stride Pattern | Typical Use |
-|---------|---------|---------|----------------|-------------|
-| `None` | `NoneBox` | Any | Standard rectangular | Elementwise ops, general compute |
-| `NZ` | `RowMajor` | `ColMajor` | Z-order row-major fractal | LHS matmul operand on A5 |
-| `ZN` | `ColMajor` | `RowMajor` | Z-order col-major fractal | Symmetric variant of `NZ` |
-| `FR` | `RowMajor` | `ColMajor` | Row-fractal (fixed-stride variant) | CUBE-specific pattern |
-| `RN` | `ColMajor` | `RowMajor` | Row-N-fractal | CUBE-specific pattern |
-
-> **Note for A5/A2/A3:** The exact fractal block dimensions are `FRACTAL_NZ_ROW = 16` (elements per fractal row) and `CUBE_BLOCK_SIZE = 512` (bytes per fractal block). These affect address generation in hardware but are not part of the ISA contract for authors.
-
-## Compact Mode
-
-Compact mode (also called **tail/part mode**) handles edge tiles where the physical tile dimensions are larger than the valid region. When a matrix dimension is not evenly divisible by the tile size, padding is added to fill the physical tile, and compact mode determines how that padding is managed.
-
-### Why Compact Mode Matters
-
-In matmul, when `M % tile_M ≠ 0` or `N % tile_N ≠ 0`, the last tile in each row/column has fewer valid elements. Compact mode controls:
-
-1. Whether padding elements are included in the matmul computation
-2. Whether the fractal layout addresses only valid elements or skips over padding
-3. How `TEXTRACT` and `TINSERT` handle partial tiles
-
-### Compact Mode in TEXTRACT
-
-`TEXTRACT` supports four compact modes that control how the extracted data is arranged:
-
-| Mode | Description | Behavior |
-|------|-------------|----------|
-| `ND2NZ` | Normal → NZ fractal | Extract from normal row-major tile into NZ fractal tile. Padding rows/cols are skipped; valid data is packed contiguously in Z-order. |
-| `NZ2ND` | NZ fractal → Normal | Extract from NZ fractal tile back to normal row-major tile. Valid data is unpacked from Z-order to row-major. |
-| `ND` | Normal → Normal | Straight copy, no layout transformation. |
-| `ND2NZ2` | Normal → NZ (row-major group) | Like `ND2NZ` but groups rows in blocks of 2 for specific CUBE access patterns. |
-
-The A2/A3 compact test (`textract_compact`) validates all four modes with edge-case tile dimensions (where `baseM`, `baseN`, or `baseK` are non-zero).
-
-### Compact Mode in TMATMUL_MX
-
-For MX-format matmul (`TMATMUL_MX`), the Left tile (`TileType::Left`) with `NZ` fractal layout uses compact addressing. When the LHS matrix has fewer rows than the tile's physical height, the fractal address generator only produces addresses for valid rows. Padding rows are excluded from both address computation and CUBE processing.
-
-### Compact Addressing in A5 TMov
-
-The A5 `TMovmx` operation with `ZZ` layout (`NZZN` / `ZZNN` variants) uses compact addressing when tile dimensions exceed the valid region. The test `tmov_mx` with `base_m != 0` validates this behavior.
-
-### When to Use Compact Mode
-
-- Use `ND2NZ` / `NZ2ND` when transferring data between normal and fractal layouts across non-multiple tile boundaries
-- Use `ND` for same-layout transfers (no transformation overhead)
-- Use `ND2NZ2` when the CUBE requires row-grouped data alignment
-- Compact mode is **automatic** in matmul when valid region < physical tile size; the fractal address generator handles it transparently
-
-## TileType–Layout Compatibility Matrix
-
-The combination of `TileType`, `BLayout`, `SLayout`, and `Fractal` is **jointly constrained**. Not all nine-parameter combinations are legal.
-
-| TileType | Supported BLayout | Supported SLayout | Supported Fractal | Typical Ops |
-|----------|------------------|-------------------|-------------------|-------------|
-| `Vec` | `RowMajor`, `ColMajor` | `NoneBox` | `None` | `TADD`, `TMUL`, `TCVT`, `TLOAD/TSTORE` |
-| `Mat` | `RowMajor`, `ColMajor` | `NoneBox` | `None` | `TGEMV`, `TGEMV_ACC`, `TGEMV_BIAS` |
-| `Acc` | `RowMajor`, `ColMajor` | `NoneBox` | `None` | `TMATMUL`, `TMATMUL_ACC` output |
-| `Left` | `RowMajor` | `RowMajor` | `NZ` | LHS of `TMATMUL_MX` |
-| `Right` | `RowMajor` | `NoneBox` | `NN` (implicit) | RHS of `TMATMUL_MX` |
-| `Scalar` | `RowMajor` | `NoneBox` | `None` | Single-element scalar tiles |
-
-Using a combination not listed in this table is an **illegal PTO program**. The verifier or backend will reject it.
-
-## Padding
-
-Elements outside the valid region may be initialized with a padding value. The `Pad` parameter controls this:
-
-| Pad Value | Meaning |
-|-----------|---------|
-| `Zero` | Out-of-valid-region elements are initialized to zero |
-| `Null` | Out-of-valid-region elements are undefined; must not be read |
-| `Invalid` | Elements are marked invalid; reading is undefined behavior |
-
-Custom pad values on A5: `PadValueCustom(value)` allows compile-time-specified float patterns as pad values (e.g., `-1.0f` for softmax masking).
-
-## Layout Conversion Patterns
-
-### Normal → Fractal (TEXTRACT with ND2NZ)
-
-```cpp
-// Extract from normal Vec tile to Left tile with NZ fractal layout
-using SrcTile = Tile<TileType::Vec, int8_t, 16, 16, RowMajor, NoneBox, None, Null>;
-using DstTile = Tile<TileType::Left, int8_t, 16, 16, RowMajor, RowMajor, NZ, Null>;
-TEXTRACT(dstLeft, srcVec, ExtractMode::ND2NZ);
-```
-
-### Fractal → Normal (TINSERT with NZ2ND)
-
-```cpp
-// Insert from Left tile with NZ fractal back to normal Vec tile
-using SrcTile = Tile<TileType::Left, int8_t, 16, 16, RowMajor, RowMajor, NZ, Null>;
-using DstTile = Tile<TileType::Vec, int8_t, 16, 16, RowMajor, NoneBox, None, Null>;
-TINSERT(dstVec, srcLeft, InsertMode::NZ2ND);
-```
-
-## Constants Reference
-
-| Constant | Value | Units | Use |
-|----------|-------|-------|-----|
-| `BLOCK_BYTE_SIZE` | 32 | bytes | DMA block transfer unit |
-| `FIXP_BURST_UNIT_LEN` | 64 | half-words | DMA burst length |
-| `FRACTAL_NZ_ROW` | 16 | elements | Fractal row dimension for NZ/ZN |
-| `CUBE_BLOCK_SIZE` | 512 | bytes | CUBE fractal block |
-| `MX_COL_LEN` | 2 | elements | MX matmul column block |
-| `MX_ROW_LEN` | 16 | elements | MX matmul row block |
-| `MX_BLOCK_SIZE` | 32 | elements | MX matmul block |
-
-## See Also
-
-- [Tiles and Valid Regions](../programming-model/tiles-and-valid-regions.md) — Programming model context, valid-region semantics
-- [Element Types and SSA Names](./type-system.md) — Complete element type inventory
-- [Tile Buffer SSA Type](./type-system.md#tile-buffer-types) — `!pto.tile<...>` vs `!pto.tile_buf<...>`
-- [TEXTRACT](../tile/ops/layout-and-rearrangement/textract.md) — Layout conversion with compact mode
-- [TINSERT](../tile/ops/layout-and-rearrangement/tinsert.md) — Layout conversion with compact mode
-- [Tile Instruction Surface](../instruction-surfaces/tile-instructions.md) — How layouts interact with tile operations
diff --git a/docs/mkdocs/src/docs/isa/state-and-types/location-intent-and-legality.md b/docs/mkdocs/src/docs/isa/state-and-types/location-intent-and-legality.md
deleted file mode 100644
index ce721477..00000000
--- a/docs/mkdocs/src/docs/isa/state-and-types/location-intent-and-legality.md
+++ /dev/null
@@ -1,184 +0,0 @@
-<!-- Generated from `docs/isa/state-and-types/location-intent-and-legality.md` -->
-
-# Location Intent And Legality
-
-PTO legality depends on more than element type and shape. Many operations also care about where a value is intended to live or what role it plays in the selected surface. This page defines the location intent taxonomy and the legality checking pipeline.
-
-## Location Intent Taxonomy
-
-Every tile operand in PTO carries a **location intent** — a declared role that determines which execution pipeline processes it and what operations are legal on it. The location intent is encoded in the `loc=` field of the tile type.
-
-### Location Intent Values
-
-| Location Intent | Pipeline | Description | Typical Use |
-|----------------|----------|-------------|-------------|
-| `loc=vec` | Vector Pipeline (V) | General-purpose vector tile | Elementwise ops, `TADD`, `TMUL`, `TCVT`, `TLOAD/TSTORE` |
-| `loc=mat` | Matrix Multiply (M/CUBE) | Matrix multiply operand (A or B) | `TGEMV`, `TGEMV_ACC`, `TGEMV_BIAS` |
-| `loc=acc` | Matrix Multiply (M/CUBE) | Accumulator / output tile | `TMATMUL`, `TMATMUL_ACC`, `TMATMUL_BIAS` output |
-| `loc=left` | Matrix Multiply (M/CUBE) | Left-hand operand of MX-format matmul | `TMATMUL_MX` LHS (NZ layout, `SLayout::RowMajor`) |
-| `loc=right` | Matrix Multiply (M/CUBE) | Right-hand operand of MX-format matmul | `TMATMUL_MX` RHS (`SLayout::NoneBox`, `NN` fractal) |
-| `loc=scalar` | Scalar Unit | Scalar tile (1×1) | Scalar operations on tile surface |
-
-### Location Intent in Tile Type
-
-In SSA/IR form, location intent is part of the tile type:
-
-```
-!pto.tile<loc=vec, f32, 16, 16, RowMajor, NoneBox, None, Zero>
-!pto.tile_buf<loc=left, int8, 16, 16, RowMajor, RowMajor, NZ, Null>
-!pto.tile_buf<loc=acc, int32, 16, 16, RowMajor, NoneBox, None, Zero>
-```
-
-In C++ API, location intent is expressed via the `TileType` template parameter:
-
-```cpp
-using VecTile = Tile<TileType::Vec, float, 16, 16>;
-using AccTile = Tile<TileType::Acc, float, 16, 16>;
-using LeftTile = Tile<TileType::Left, int8_t, 16, 16, RowMajor, RowMajor, NZ, Null>;
-```
-
-## Legality Checking Pipeline
-
-PTO performs legality checking in four sequential stages. A program is legal only if it passes all four stages:
-
-```
-┌─────────────────────────────────────────┐
-│  Stage 1: TYPE CHECK                    │
-│  Element types match? Sizes compatible?  │
-│  → If fail: type error (diagnostic)     │
-└─────────────────┬───────────────────────┘
-                  │ PASS
-                  ▼
-┌─────────────────────────────────────────┐
-│  Stage 2: SHAPE CHECK                   │
-│  Physical shape (Rows, Cols) legal?     │
-│  Valid region (Rv, Cv) within bounds?   │
-│  → If fail: shape error (diagnostic)    │
-└─────────────────┬───────────────────────┘
-                  │ PASS
-                  ▼
-┌─────────────────────────────────────────┐
-│  Stage 3: LAYOUT CHECK                  │
-│  BLayout+SLayout+Fractal combo legal    │
-│  for this TileType and instruction?     │
-│  → If fail: layout error (diagnostic)   │
-└─────────────────┬───────────────────────┘
-                  │ PASS
-                  ▼
-┌─────────────────────────────────────────┐
-│  Stage 4: TARGET PROFILE CHECK          │
-│  TileType + dtype supported on target?   │
-│  MX format, FP8, fractal legal on A5?   │
-│  → If fail: profile error (diagnostic)  │
-└─────────────────┴───────────────────────┘
-                  │ PASS
-                  ▼
-              LEGAL PROGRAM
-```
-
-### Stage 1: Type Check
-
-**Rule**: The element type of all operands MUST be compatible with the operation.
-
-For binary tile operations (`TADD`, `TMUL`, etc.):
-```
-dtype(src0) == dtype(src1) == dtype(dst)
-```
-
-For type-converting operations (`TCVT`):
-```
-dtype(src) and dtype(dst) must be in the same conversion group (see Type System page)
-sizeof(dtype(src)) may != sizeof(dtype(dst)) for converting ops
-```
-
-**Diagnostic**: `type mismatch: expected f32 but found f16 in operand 1`
-
-### Stage 2: Shape Check
-
-**Rule**: The physical shape of all operands MUST be within the legal bounds for the instruction and target profile.
-
-```
-1 <= Rows <= MAX_ROWS(profile)    -- e.g., 65535 on A5, 8192 on A2/A3
-1 <= Cols <= MAX_COLS(profile)    -- e.g., 4095 on all profiles
-0 <= Rv <= Rows                   -- valid region within physical bounds
-0 <= Cv <= Cols
-```
-
-**Diagnostic**: `shape out of range: Cols=8192 exceeds maximum of 4095 for TDIV on A2/A3`
-
-### Stage 3: Layout Check
-
-**Rule**: The combination of `BLayout`, `SLayout`, and `Fractal` MUST be a supported combination for the operand's `TileType` and the instruction.
-
-See the Layout Combinations table in the [Tiles And Valid Regions](../programming-model/tiles-and-valid-regions.md) page for the complete list of supported combinations.
-
-**Examples**:
-```
-Vec tile with NZ layout:          ILLEGAL (Vec tiles do not support fractal layouts)
-Left tile with ColMajor layout:   ILLEGAL (Left tiles must be RowMajor)
-Mat tile with ColMajor NZ fractal: ILLEGAL (Mat tiles must use standard layouts)
-```
-
-**Diagnostic**: `layout mismatch: Vec tile with fractal layout not supported by TADD`
-
-### Stage 4: Target Profile Check
-
-**Rule**: The operand's `TileType`, element type, and layout MUST be supported on the selected target profile.
-
-Examples:
-```
-FP8 e4m3 type on A2/A3:          ILLEGAL (FP8 not supported on A2/A3)
-vstu (unaligned vector store):    ILLEGAL on CPU and A2/A3 (A5 only)
-Left/Right MX format tiles:       ILLEGAL on CPU and A2/A3 (A5 only)
-```
-
-**Diagnostic**: `profile restriction: FP8 types require A5 profile`
-
-## Legality by Instruction Family
-
-Different instruction families have different legality rules beyond the four-stage pipeline:
-
-### Elementwise Tile-Tile (TADD, TMUL, etc.)
-
-- All operands MUST be `loc=vec`.
-- `BLayout`, `SLayout`, `Fractal` MUST be compatible with `Vec`.
-- `dtype` MUST be in the elementwise family type list (varies by profile).
-
-### Matmul (TMATMUL, TGEMV, etc.)
-
-- Left operand: `TileType::Left` (for MX format) or `TileType::Mat`
-- Right operand: `TileType::Right` (for MX format) or `TileType::Mat`
-- Accumulator: `TileType::Acc`
-- Shape constraints: `Rows_A == Rows_C`, `Cols_A == Rows_B`, `Cols_B == Cols_C`
-
-### Vector Compute (vadd, vmul, etc.)
-
-- Operands MUST be `!pto.vreg<NxDTYPE>`.
-- Mask operand MUST be `!pto.mask` with matching width.
-- `dtype` MUST be in the vector family type list (varies by profile).
-
-## GM-Facing Operands (GlobalTensor)
-
-GlobalTensor operands follow a separate legality path:
-
-| Check | Rule |
-|-------|------|
-| Dtype size | `sizeof(tile.dtype) == sizeof(gtensor.dtype)` |
-| Layout compatibility | `gtensor.Layout` (ND/DN/NZ) must be compatible with `tile.SLayout` |
-| Shape positive | All shape dimensions > 0 |
-| Valid region | `Rv > 0` and `Cv > 0` |
-
-## Cases That Are Not Allowed
-
-- Using vector-buffer assumptions on a tile-surface operand without an explicit bridge.
-- Documenting location-sensitive families as though any local storage role were equivalent.
-- Hiding target-profile narrowing inside generic "implementation-defined" wording.
-- Relying on the CPU simulator's permissive legality checking as evidence of A5 legality.
-
-## See Also
-
-- [Type System](./type-system.md)
-- [Tiles And Valid Regions](../programming-model/tiles-and-valid-regions.md)
-- [Tile Instruction Surface](../instruction-surfaces/tile-instructions.md)
-- [Vector Instruction Surface](../instruction-surfaces/vector-instructions.md)
-- [Portability And Target Profiles](../reference/portability-and-target-profiles.md)
diff --git a/docs/mkdocs/src/docs/isa/state-and-types/location-intent-and-legality_zh.md b/docs/mkdocs/src/docs/isa/state-and-types/location-intent-and-legality_zh.md
deleted file mode 100644
index 927b2197..00000000
--- a/docs/mkdocs/src/docs/isa/state-and-types/location-intent-and-legality_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Location Intent And Legality
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](location-intent-and-legality.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册状态与类型](../../../manual/03-state-and-types_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/state-and-types/type-system.md b/docs/mkdocs/src/docs/isa/state-and-types/type-system.md
deleted file mode 100644
index c1fff6fc..00000000
--- a/docs/mkdocs/src/docs/isa/state-and-types/type-system.md
+++ /dev/null
@@ -1,144 +0,0 @@
-<!-- Generated from `docs/isa/state-and-types/type-system.md` -->
-
-# Type System
-
-PTO uses a compact visible type system, but legality does not stop at raw type names. Type classes tell you what kind of architectural object you are dealing with. Other legality dimensions such as layout, location, valid region, and target profile determine whether a use is actually allowed.
-
-## Element Types
-
-PTO supports a rich set of element types across floating-point, integer, and specialized categories.
-
-### Floating-Point Types
-
-| Type | SSA Name | Bits | Description | A2/A3 | A5 |
-|------|----------|------|-------------|:------:|:--:|
-| IEEE FP16 | `f16` / `half` | 16 | IEEE 754 half-precision | Yes | Yes |
-| BF16 | `bf16` / `bfloat16_t` | 16 | Brain float 16 (8-bit exponent) | Yes | Yes |
-| IEEE FP32 | `f32` | 32 | IEEE 754 single-precision | Yes | Yes |
-| FP8 E4M3 | `f8e4m3` / `float8_e4m3_t` | 8 | 4-bit exponent, 3-bit mantissa | No | Yes |
-| FP8 E5M2 | `f8e5m2` / `float8_e5m2_t` | 8 | 5-bit exponent, 2-bit mantissa | No | Yes |
-| HI Float8 | `hifloat8_t` | 8 | High-precision float8 | No | Yes |
-| Float4 E1M2x2 | `float4_e1m2x2_t` | 4 | 4-bit float4, packed 2x2 | No | Yes |
-| Float4 E2M1x2 | `float4_e2m1x2_t` | 4 | 4-bit float4, packed 2x2 | No | Yes |
-
-### Integer Types
-
-| Type | SSA Name | Bits | Signedness | A2/A3 | A5 |
-|------|----------|------|------------|:------:|:--:|
-| int8 | `i8` | 8 | Signed | Yes | Yes |
-| uint8 | `u8` | 8 | Unsigned | Yes | Yes |
-| int16 | `i16` | 16 | Signed | Yes | Yes |
-| uint16 | `u16` | 16 | Unsigned | Yes | Yes |
-| int32 | `i32` | 32 | Signed | Yes | Yes |
-| uint32 | `u32` | 32 | Unsigned | Yes | Yes |
-| int64 | `i64` | 64 | Signed | Yes | Yes |
-| uint64 | `u64` | 64 | Unsigned | Yes | Yes |
-
-## Vector Width
-
-The vector register width `N` (the number of lanes) is determined by the element type and the target profile:
-
-| Element Type | Vector Width N | Bytes/Register | Notes |
-|-------------|:-------------:|:-------------:|-------|
-| f32 | 64 | 256 B | 64 × 32-bit |
-| f16, bf16 | 128 | 256 B | 128 × 16-bit |
-| i16, u16 | 128 | 256 B | 128 × 16-bit |
-| i8, u8 | 256 | 256 B | 256 × 8-bit |
-| f8e4m3, f8e5m2 | 256 | 256 B | 256 × 8-bit |
-
-Vector width is **portable** across all profiles: CPU simulation, A2/A3, and A5 all present the same `N` value for each element type. The difference is that A5 executes vector operations natively on hardware, while CPU/A2/A3 emulate them.
-
-## Vector Register Types
-
-Vector register SSA type: `!pto.vreg<NxDTYPE>`
-
-```
-!pto.vreg<64xf32>   -- 64 lanes of f32
-!pto.vreg<128xf16>  -- 128 lanes of f16
-!pto.vreg<256xi8>   -- 256 lanes of i8
-```
-
-## Tile Buffer Types
-
-Tile buffer SSA type (see [Tiles And Valid Regions](../programming-model/tiles-and-valid-regions.md) for full parameter list):
-
-```
-!pto.tile<loc=vec, f32, 16, 16, RowMajor, NoneBox, None, Zero>
-!pto.tile_buf<loc=mat, bf16, 16, 16, RowMajor, NoneBox, None, Null>
-!pto.tile_buf<loc=left, int8, 16, 16, RowMajor, RowMajor, NZ, Null>
-```
-
-## NaN and Inf Behavior
-
-For floating-point types, PTO follows IEEE 754 semantics with the following implementation-defined variation points:
-
-| Behavior | Rule |
-|----------|------|
-| Quiet NaN propagation | Quiet NaN in → quiet NaN out (preserves signaling bit) |
-| Signaling NaN | Signaling NaN may be quieted by hardware before use |
-| Inf arithmetic | Inf is produced and propagated as IEEE 754 requires |
-| Denormalized numbers | Hardware may flush denormals to zero (FTZ behavior) |
-| Rounding | Controlled by `rnd` attribute: `rne` (default), `rz`, `rp`, `rm` |
-
-The FTZ (flush-to-zero) behavior for denormals is **implementation-defined** — the manual does not mandate a specific choice. The `rnd` attribute allows control over rounding direction for operations that change exponent range (e.g., `vcvt` between f16 and f32).
-
-## Type Conversion Rules
-
-### Between Floating-Point Types
-
-| Source | Dest | Behavior |
-|--------|------|----------|
-| f16 → bf16 | Conversion | Reinterpret f16 bits as bf16 (no numerical conversion) |
-| bf16 → f16 | Conversion | Reinterpret bf16 bits as f16 (no numerical conversion) |
-| f16/bf16 → f32 | Promotion | Extend to f32; exact representable values are preserved |
-| f32 → f16/bf16 | Narrowing | Round according to `rnd` attribute; NaN/Inf handled per IEEE 754 |
-| f8 → f16/f32 | Promotion | Extend; exact representable values are preserved |
-| f16/f32 → f8 | Narrowing | Round according to `rnd` attribute; may overflow to Inf |
-
-### Between Integer Types
-
-| Source | Dest | Behavior |
-|--------|------|----------|
-| Widening (e.g., i8 → i16) | Zero/_sign extend | Zero-extend for unsigned; sign-extend for signed |
-| Narrowing (e.g., i16 → i8) | Truncation | Truncate high bits; may lose significant bits |
-| i32 → f32 | Conversion | Exact for values in [-2^24, 2^24]; may lose precision outside |
-| f32 → i32 | Conversion | Truncates toward zero; may overflow (implementation-defined) |
-
-### Between Float and Integer
-
-| Source | Dest | Behavior |
-|--------|------|----------|
-| f32 → i8/u8/i16/u16 | Narrowing | Truncate; may overflow |
-| f32 → i32/u32 | Narrowing | Truncate; may overflow |
-| i8/u8 → f32 | Promotion | Exact for small values; may lose precision for large values |
-
-### Type Conversion Operations
-
-| Operation | Surface | Description |
-|-----------|---------|-------------|
-| `pto.tcvt` | Tile | Elementwise type conversion on tile buffers |
-| `pto.vcvt` | Vector | Vector register type conversion |
-| `pto.vtrc` | Vector | Vector truncate/round (e.g., f32 → f16) |
-| `pto.vci` | Vector | Compress to integer (vector → integer result) |
-
-## Constraints
-
-- Family pages must define accepted operand/result classes.
-- Type errors must stay distinguishable from deeper legality failures (shape, layout, location intent, target profile).
-- Vector surface docs must make vector-register, mask, pointer, and alignment state explicit.
-- Tile surface docs must make tile role, shape, and valid-region interactions explicit.
-- No implicit type promotion: `tadd(t, i8_tile, f32_immediate)` is illegal unless an explicit `tcvt` converts one operand first.
-
-## Cases That Are Not Allowed
-
-- Treating type class checks as though they cover every backend legality fact.
-- Conflating scalar state with tile or vector payload state.
-- Documenting vector and tile payload classes as if they were interchangeable.
-- Relying on implicit type conversion without an explicit `tcvt`/`vcvt`.
-
-## See Also
-
-- [Location Intent And Legality](./location-intent-and-legality.md)
-- [Instruction Surfaces](../instruction-surfaces/README.md)
-- [Source Of Truth](../reference/source-of-truth.md)
-- [Tiles And Valid Regions](../programming-model/tiles-and-valid-regions.md)
diff --git a/docs/mkdocs/src/docs/isa/state-and-types/type-system_zh.md b/docs/mkdocs/src/docs/isa/state-and-types/type-system_zh.md
deleted file mode 100644
index a5921804..00000000
--- a/docs/mkdocs/src/docs/isa/state-and-types/type-system_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Type System
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](type-system.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册状态与类型](../../../manual/03-state-and-types_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/syntax-and-operands/README_zh.md b/docs/mkdocs/src/docs/isa/syntax-and-operands/README_zh.md
deleted file mode 100644
index 685a101c..00000000
--- a/docs/mkdocs/src/docs/isa/syntax-and-operands/README_zh.md
+++ /dev/null
@@ -1,18 +0,0 @@
-<!-- Generated from `docs/isa/syntax-and-operands/README_zh.md` -->
-
-# 语法与操作数
-
-本章描述 PTO ISA 的文本拼写、操作数形状、属性以及共享的命名约定。这是理解指令语法格式的前置章节。
-
-## 本章内容
-
-- [汇编模型](syntax-and-operands/assembly-model.md) — PTO-AS 三层语法（Assembly / SSA / DPS）BNF 定义、操作数修饰符、立即数编码规则
-- [操作数与属性](syntax-and-operands/operands-and-attributes.md) — 七类操作数（Tile / GlobalTensor / Scalar / Predicate / Event / UB Pointer / GM Pointer）的 SSA 类型表、属性完整列表（Compare / Rounding / Atomic / Transform / Distribution / Mask）、操作数约束规则
-
-## 阅读建议
-
-建议在深入具体指令之前，先阅读本章以理解语法格式和操作数约定。本章定义的 BNF 语法和操作数分类适用于手册中所有指令页。
-
-## 章节定位
-
-本章属于手册的第 4 章。与第 5 章（状态与类型）和第 6 章（内存模型）一起，构成进入指令集章节之前的"前置知识带"。
diff --git a/docs/mkdocs/src/docs/isa/syntax-and-operands/assembly-model.md b/docs/mkdocs/src/docs/isa/syntax-and-operands/assembly-model.md
deleted file mode 100644
index ba6c9ed0..00000000
--- a/docs/mkdocs/src/docs/isa/syntax-and-operands/assembly-model.md
+++ /dev/null
@@ -1,222 +0,0 @@
-<!-- Generated from `docs/isa/syntax-and-operands/assembly-model.md` -->
-
-# Assembly Spelling And Operands
-
-PTO ISA includes a textual assembly spelling — PTO-AS — but the architecture contract stays in the PTO ISA manual itself. This page defines the syntax rules, including the BNF grammar for all three forms, operand modifier rules, and attribute syntax. Per-instruction syntax pages add instruction-specific variants.
-
-## Three-Level Syntax System
-
-PTO defines three levels of textual syntax, all preserving the same ISA contract:
-
-| Level | Name | Form | Typical Use |
-|-------|------|------|-------------|
-| **Assembly Form (PTO-AS)** | Human-readable | `tadd %dst, %src0, %src1` | Documentation, pseudocode |
-| **SSA Form (AS Level 1)** | MLIR SSA | `%dst = pto.tadd %src0, %src1` | IR, code generators |
-| **DPS Form (AS Level 2)** | Functional DPS | `pto.tadd ins(...) outs(...)` | C++ intrinsic surface |
-
-All three forms are **semantically equivalent** — they describe the same ISA operation. A backend or verifier must accept any form and produce identical behavior.
-
-## BNF Grammar
-
-### Assembly Form (PTO-AS)
-
-```
-assembly-program  ::= assembly-stmt*
-assembly-stmt     ::= label? op-name operands? ":" type-ref ("#" attribute)*
-label             ::= identifier ":"
-op-name           ::= ("pto.")? identifier ("_" identifier)*
-operands          ::= operand ("," operand)*
-operand           ::= register | immediate | memory-operand | mask-operand
-register          ::= "%" identifier
-immediate         ::= integer | hex-integer | floating-point
-hex-integer       ::= "0x" [0-9a-fA-F]+
-floating-point    ::= [0-9]+ "." [0-9]+ ("e" [+-]? [0-9]+)?
-memory-operand     ::= register "[" register ("," register)* "]"
-mask-operand      ::= "%" identifier ":" "!" pto.mask
-type-ref          ::= "!" pto "." type-key "<" type-params ">"
-type-key          ::= "tile" | "tile_buf" | "vreg" | "ptr" | "partition_tensor_view" | "mask" | "event"
-```
-
-### SSA Form (AS Level 1)
-
-```
-ssa-program       ::= ssa-stmt*
-ssa-stmt          ::= ssa-result "=" op-name operands ":" ssa-type -> ssa-type
-ssa-result        ::= "%" identifier
-op-name           ::= "pto." identifier ("_" identifier)*
-operands          ::= ssa-operand ("," ssa-operand)*
-ssa-operand       ::= ssa-result | immediate | memory-operand
-ssa-type          ::= ssa-type-key "<" type-params ">"
-ssa-type-key      ::= "tile" | "tile_buf" | "vreg" | "ptr" | "partition_tensor_view" | "mask"
-```
-
-### DPS Form (AS Level 2)
-
-```
-dps-program       ::= dps-stmt*
-dps-stmt          ::= op-name "ins(" dps-ins ")" "outs(" dps-outs ")"
-dps-ins           ::= dps-ins-item ("," dps-ins-item)*
-dps-outs          ::= dps-out-item ("," dps-out-item)*
-dps-ins-item      ::= ssa-result ":" ssa-type
-dps-out-item      ::= ssa-result ":" ssa-type
-```
-
-## Operand Modifier Rules
-
-### Tile Operands
-
-A tile operand may carry optional modifiers in PTO-AS:
-
-```
-%tile                     -- bare tile register
-%tile[%r, %c]            -- tile with GM base offset (row, col offsets)
-%tile!loc=vec            -- tile with location intent annotation
-```
-
-In SSA form, location intent and valid-region information are encoded in the tile type:
-
-```
-!pto.tile<loc=vec, f32, 16, 16, RowMajor, NoneBox, None, Zero>
-!pto.tile_buf<loc=vec, f32, 16, 16, RowMajor, NoneBox, None, Zero>
-```
-
-### GlobalTensor Operands
-
-In PTO-AS, a `GlobalTensor` operand appears as a `memref` or `partition_tensor_view`:
-
-```
-%tensor                      -- bare GlobalTensor register
-%tensor[%r, %c]             -- with 2D base offset
-%tensor[%b, %h, %w, %r, %c] -- with 5D base offset (partition_tensor_view)
-```
-
-### Predicate Operands
-
-A predicate operand is written as a mask register:
-
-```
-%mask : !pto.mask           -- predicate operand in SSA form
-```
-
-Vector instructions that take a mask write it as an explicit operand:
-
-```
-%result = pto.vadd %src0, %src1, %mask : ... -> ...
-```
-
-### Immediate Operands
-
-Immediate operands are encoded directly in the instruction:
-
-```
-tadds %dst, %src, 0x3F800000   -- 32-bit float immediate (1.0f)
-tshrs %dst, %src, 16            -- 16-bit shift amount
-taddc %dst, %src0, %src1       -- carry-variant, no immediate
-```
-
-## Instruction Suffixes
-
-PTO uses suffixes to distinguish operation variants:
-
-| Suffix | Meaning | Example |
-|--------|---------|---------|
-| *(none)* | Standard binary op | `tadd` |
-| `s` | Scalar variant: second operand is an immediate scalar | `tadds %dst, %src, 0x3F800000` |
-| `c` | Carry variant: saturating arithmetic | `taddc`, `tsubc` |
-| `sc` | Scalar + carry variant | `taddsc`, `tsubsc` |
-| `_fp` | Floating-point special handling | `tstore_fp`, `tinsert_fp` |
-| `_acc` | Accumulating variant | `tmatmul_acc` |
-| `_bias` | Bias-addition variant | `tmatmul_bias` |
-| `_mx` | MX format (int8 matmul) variant | `tgemv_mx` |
-
-## Attribute Syntax
-
-Attributes modify operation behavior. In PTO-AS, they appear after `#`:
-
-```
-tstore %tile, %tensor #atomic=add    -- atomic store
-tcmps %dst, %src, 0   #cmp=gt        -- comparison mode
-tmatmul %c, %a, %b    #phase=relu    -- matmul phase mode
-```
-
-In SSA form, attributes appear as `{key = value}`:
-
-```
-%result = pto.tcmp %src0, %src1 {cmp = "lt"} : ... -> ...
-```
-
-## Complete Examples
-
-### Tile Compute: Elementwise Addition
-
-**Assembly Form (PTO-AS)**:
-```
-tadd %dst, %src0, %src1 : !pto.tile<f32, 16, 16>
-```
-
-**SSA Form (AS Level 1)**:
-```
-%dst = pto.tadd %src0, %src1 : (!pto.tile<f32, 16, 16>, !pto.tile<f32, 16, 16>) -> !pto.tile<f32, 16, 16>
-```
-
-**DPS Form (AS Level 2)**:
-```
-pto.tadd ins(%src0, %src1 : !pto.tile_buf<f32, 16, 16>, !pto.tile_buf<f32, 16, 16>)
-          outs(%dst : !pto.tile_buf<f32, 16, 16>)
-```
-
-**C++ Intrinsic**:
-```cpp
-TADD(TileDst, TileSrc0, TileSrc1);
-```
-
-### Tile Load: From GlobalTensor
-
-**Assembly Form (PTO-AS)**:
-```
-tload %tile, %tensor[%r, %c] : (!pto.tile<f32,16,16>, !pto.memref<f32,5>) -> !pto.tile<f32,16,16>
-```
-
-**SSA Form (AS Level 1)**:
-```
-%tile = pto.tload %mem : !pto.partition_tensor_view<1x1x1x16x16xf32>
-    -> !pto.tile_buf<loc=vec, f32, 16, 16, RowMajor, NoneBox, None, Zero>
-```
-
-### Vector Compute: Vector Addition with Mask
-
-**SSA Form (AS Level 1)**:
-```
-%result = pto.vadd %src0, %src1, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32>
-```
-
-### Scalar Compare: Predicate Generation
-
-**SSA Form (AS Level 1)**:
-```
-%pred = pto.pge_b32 %src0, %src1 : (!pto.vreg<64xi32>, !pto.vreg<64xi32>) -> !pto.mask
-```
-
-## What Textual Spelling Does Not Replace
-
-Textual spelling does not replace:
-
-- the PTO ISA machine model
-- the PTO ISA memory model
-- target-profile rules for CPU, A2/A3, and A5
-- the architecture-level legality rules that are independent of textual spelling
-
-## Contract Notes
-
-- Textual assembly forms MUST preserve the same visible operation meaning as their documented intrinsic forms.
-- Assembly syntax rules MUST stay in the PTO ISA syntax-and-operands pages, not in backend-private notes.
-- Syntax variants that change semantics must be documented as explicit variants, not as undocumented assembler convenience.
-- The three syntactic levels (Assembly / SSA / DPS) are semantically equivalent; a backend MUST NOT assign different behavior to different syntactic forms of the same operation.
-
-## See Also
-
-- [Operands and Attributes](./operands-and-attributes.md)
-- [Type System](../state-and-types/type-system.md)
-- [Parallel Tile Operation ISA Version 1.0](../introduction/what-is-pto-visa.md)
-- [Instruction surfaces](../instruction-surfaces/README.md)
-- [Common conventions](../conventions.md)
diff --git a/docs/mkdocs/src/docs/isa/syntax-and-operands/assembly-model_zh.md b/docs/mkdocs/src/docs/isa/syntax-and-operands/assembly-model_zh.md
deleted file mode 100644
index bd043a3b..00000000
--- a/docs/mkdocs/src/docs/isa/syntax-and-operands/assembly-model_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Assembly Spelling And Operands
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](assembly-model.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册 PTO 汇编](../../../manual/06-assembly_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/syntax-and-operands/operands-and-attributes.md b/docs/mkdocs/src/docs/isa/syntax-and-operands/operands-and-attributes.md
deleted file mode 100644
index dbf0e6f2..00000000
--- a/docs/mkdocs/src/docs/isa/syntax-and-operands/operands-and-attributes.md
+++ /dev/null
@@ -1,198 +0,0 @@
-<!-- Generated from `docs/isa/syntax-and-operands/operands-and-attributes.md` -->
-
-# Operands And Attributes
-
-PTO VISA operations work over a small set of operand kinds: tiles, global-memory views, scalars, predicates, and synchronization values. Attributes and modifiers refine operation behavior, but they do not replace operand legality.
-
-## Operand Kinds
-
-PTO defines seven operand kinds. Each kind maps to a specific SSA type and has distinct legality rules:
-
-| Kind | SSA Type | C++ API | Description |
-|------|----------|---------|-------------|
-| **Tile** | `!pto.tile<...>` / `!pto.tile_buf<...>` | `Tile<TileType, DType, Rows, Cols, ...>` | Tile operand with shape, layout, valid-region metadata |
-| **GlobalTensor** | `!pto.partition_tensor_view<...>` / `!pto.memref<...>` | `GlobalTensor<DType, Shape, Stride, Layout>` | GM-facing view; the source or destination of data movement |
-| **Scalar** | `i8`–`i64`, `u8`–`u64`, `f16`, `bf16`, `f32` | Built-in C++ types | Immediate values or runtime-computed scalars |
-| **Predicate** | `!pto.mask` | (IR-level) | Per-lane mask controlling which lanes participate in vector ops |
-| **Event** | `!pto.event` | `RecordEvent` (return type) | Synchronization token; carries ordering information between operations |
-| **UB Pointer** | `!pto.ptr<T, ub>` | (IR-level) | Pointer into Unified Buffer; used by vector load/store and DMA copy ops |
-| **GM Pointer** | `!pto.ptr<T, gm>` | `__gm__ T*` | Pointer into Global Memory; used by scalar load/store and DMA copy ops |
-
-## Operand Kind Details
-
-### Tile Operands
-
-Tile operands carry shape, layout, valid-region, and location-intent metadata. They are the primary payload type for `pto.t*` operations.
-
-**SSA tile type signature**:
-```
-!pto.tile<loc=LOC, DTYPE, ROWS, COLS, BLAYOUT, SLAYOUT, FRACTAL, PAD>
-```
-
-**Components**:
-| Component | Values | Description |
-|-----------|--------|-------------|
-| `loc` | `vec`, `mat`, `acc`, `scalar`, `left`, `right` | Location intent / pipeline destination |
-| `DTYPE` | `f16`, `bf16`, `f32`, `i8`, `u8`, ... | Element type |
-| `ROWS` | positive integer | Physical row count |
-| `COLS` | positive integer | Physical column count |
-| `BLAYOUT` | `RowMajor`, `ColMajor` | Block storage layout |
-| `SLAYOUT` | `NoneBox`, `RowMajor`, `ColMajor` | Stripe layout |
-| `FRACTAL` | `None`, `NZ`, `ZN`, `FR`, `RN` | Fractal encoding |
-| `PAD` | `Zero`, `Null`, `Invalid` | Padding value |
-
-### GlobalTensor Operands
-
-`GlobalTensor` operands describe a view of GM storage. They pair a pointer with shape and stride metadata.
-
-**SSA partition_tensor_view type**:
-```
-!pto.partition_tensor_view<BxHxWxRxCxdtype>
-```
-This is always 5D: batch, height, width, tile rows, tile columns.
-
-### Scalar Operands
-
-Scalar operands are immediate values encoded directly in the instruction or computed at runtime. They appear as:
-
-- 32-bit integer or float immediates in assembly
-- `i32`, `i64`, `f32` in SSA form
-- Standard C++ types in C++ intrinsics
-
-### Predicate Operands
-
-Predicate operands (`!pto.mask`) control which lanes participate in vector operations. They are produced by predicate-generation operations (`pset_b8`, `pge_b32`, `plt_b16`, etc.) and consumed by vector operations.
-
-A predicate with all bits set means "all lanes active". A predicate with some bits cleared means "only those lanes participate".
-
-### UB Pointer Operands
-
-UB pointer operands (`!pto.ptr<T, ub>`) specify addresses within the on-chip Unified Buffer. They are used by:
-
-- Vector load/store (`vlds`, `vsld`, `vgather2`, `vsts`, `vsst`, `vscatter`)
-- DMA copy operations (`copy_gm_to_ubuf`, `copy_ubuf_to_gm`)
-
-### GM Pointer Operands
-
-GM pointer operands (`!pto.ptr<T, gm>`) specify addresses in off-chip Global Memory. They are used by:
-
-- Scalar load/store (`load_scalar`, `store_scalar`)
-- DMA copy operations
-
-## Attributes
-
-Attributes modify the behavior of an operation without changing its operand types. Every attribute MUST have a documented value domain, and invalid attribute values MUST produce deterministic diagnostics.
-
-### Compare Attributes
-
-Used by `pto.tcmp`, `pto.vcmp`, and related comparison operations:
-
-| Attribute | Values | Description |
-|-----------|--------|-------------|
-| `cmp` | `"eq"`, `"ne"`, `"lt"`, `"le"`, `"gt"`, `"ge"` | Comparison predicate mode |
-| `cmpS` | (same) | Scalar compare variant: compares each element against an immediate |
-
-### Rounding Mode Attributes
-
-Used by conversion and narrowing operations:
-
-| Attribute | Values | Description |
-|-----------|--------|-------------|
-| `rnd` | `"rne"`, `"rz"`, `"rp"`, `"rm"` | Rounding mode: nearest-even, zero, positive-infinity, negative-infinity |
-
-### Atomic Mode Attributes
-
-Used by `pto.tstore`:
-
-| Attribute | Values | Description |
-|-----------|--------|-------------|
-| `atomic` | `"none"`, `"add"`, `"max"`, `"min"` | Atomic store mode |
-
-### Transform Mode Attributes
-
-Used by `pto.timg2col`, `pto.textract`, `pto.tinsert`:
-
-| Attribute | Values | Description |
-|-----------|--------|-------------|
-| `mode` | `"hw"`, `"wh"`, `"cubic"`, ... | Transform mode; domain depends on operation |
-
-### Matmul Phase Attributes
-
-Used by `pto.tmatmul`:
-
-| Attribute | Values | Description |
-|-----------|--------|-------------|
-| `phase` | `"relu"`, `"none"` | Post-matmul activation phase |
-
-### Distribution Mode Attributes
-
-Used by vector load/store (`pto.vlds`, `pto.vsts`):
-
-| Attribute | Values | Description |
-|-----------|--------|-------------|
-| `dist` | `"NORM"`, `"BRC_B8/B16/B32"`, `"US_B8/B16"`, `"DS_B8/B16"`, `"UNPK_B8/B16/B32"`, `"DINTLV_B32"`, `"SPLT2CHN_B8/B16"`, `"SPLT4CHN_B8"` | Distribution mode |
-
-### Mask Attributes
-
-Used by vector load with alignment-state update:
-
-| Attribute | Values | Description |
-|-----------|--------|-------------|
-| `mask` | `"POST_UPDATE"`, `"NO_POST_UPDATE"` | Whether to update alignment state after masked store |
-
-## Operand Constraint Rules
-
-### Tile Operand Constraints
-
-For a binary tile operation `optile(dst, src0, src1)`:
-
-1. **Type compatibility**: `dtype(src0) == dtype(src1) == dtype(dst)` (unless `TCVT` which explicitly changes dtype)
-2. **Shape compatibility**: `shape(src0) == shape(src1) == shape(dst)` (no implicit broadcasting)
-3. **Layout compatibility**: The combination of `BLayout`, `SLayout`, and `Fractal` MUST match the instruction family's requirements
-4. **Location intent**: Source and destination location intents MUST be compatible with the instruction (e.g., matmul requires `Left` + `Right` → `Acc`)
-
-### GlobalTensor Operand Constraints
-
-For `TLOAD(tile, tensor)`:
-
-1. **Dtype size**: `sizeof(tile.dtype) == sizeof(tensor.dtype)`
-2. **Layout compatibility**: The `tensor.Layout` (ND/DN/NZ) MUST be compatible with `tile.TileType` and `tile.SLayout`
-3. **Positive dimensions**: All shape dimensions MUST be > 0
-
-### Predicate Operand Constraints
-
-For a masked vector operation `opvec(result, src, mask)`:
-
-1. **Mask width**: The mask width MUST match the vector width of the operation
-2. **Mask production/consumption**: A predicate MUST be produced by a predicate-generation op before being consumed
-
-### Immediate/Scalar Constraints
-
-1. **Range**: Immediate values MUST be within the representable range of their declared type
-2. **Shift amounts**: Shift amounts MUST be non-negative and less than the element bit-width
-3. **Broadcast**: Scalar operands may be broadcast to match tile/vector shape; this is explicit in the operation (e.g., `tadds`)
-
-## Rule Example
-
-If an instruction accepts a tile plus a scalar mode attribute, legality still depends on both:
-
-- whether the tile tuple is legal
-- whether the attribute value is in the documented domain
-
-A legal tile does not make an illegal modifier acceptable, and a valid modifier does not repair an illegal tile tuple.
-
-## Contract Notes
-
-- Every required attribute MUST define an allowed value domain.
-- Invalid attribute values MUST produce deterministic diagnostics.
-- Operand roles and attribute meaning MUST stay aligned across intrinsics, PTO-AS, and per-op reference pages.
-- There is no implicit type promotion; a type mismatch between operands is always illegal unless an explicit conversion operation (`TCVT`, `vcvt`) is present.
-- Broadcasting is explicit: `tadds` broadcasts the scalar operand to match the tile shape; `tadd` does not broadcast.
-
-## See Also
-
-- [Assembly Model](./assembly-model.md)
-- [Type System](../state-and-types/type-system.md)
-- [Tiles And Valid Regions](../programming-model/tiles-and-valid-regions.md)
-- [GlobalTensor And Data Movement](../programming-model/globaltensor-and-data-movement.md)
-- [Instruction Contract Template](../reference/format-of-instruction-descriptions.md)
diff --git a/docs/mkdocs/src/docs/isa/syntax-and-operands/operands-and-attributes_zh.md b/docs/mkdocs/src/docs/isa/syntax-and-operands/operands-and-attributes_zh.md
deleted file mode 100644
index 721c5d56..00000000
--- a/docs/mkdocs/src/docs/isa/syntax-and-operands/operands-and-attributes_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Operands And Attributes
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](operands-and-attributes.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文章节手册 PTO 汇编](../../../manual/06-assembly_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/README.md b/docs/mkdocs/src/docs/isa/tile/README.md
deleted file mode 100644
index 5375ca2a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/README.md
+++ /dev/null
@@ -1,51 +0,0 @@
-<!-- Generated from `docs/isa/tile/README.md` -->
-
-# Tile ISA Reference
-
-This section documents the `pto.t*` tile instruction surface of PTO ISA. Pages are organized by family, with standalone per-op pages under `tile/ops/`.
-
-## Families
-
-| Family | Description | Operations |
-|--------|-------------|------------|
-| [Sync and Config](./sync-and-config.md) | Resource binding, event setup, mode control | 9 |
-| [Elementwise Tile-Tile](./elementwise-tile-tile.md) | Lane-wise binary and unary operations | 28 |
-| [Tile-Scalar and Immediate](./tile-scalar-and-immediate.md) | Tile combined with scalar operand | 20 |
-| [Reduce and Expand](./reduce-and-expand.md) | Row/column reductions and expansions | 28 |
-| [Memory and Data Movement](./memory-and-data-movement.md) | GM↔tile transfer, gather/scatter | 6 |
-| [Matrix and Matrix-Vector](./matrix-and-matrix-vector.md) | GEMV, matmul, and variants | 8 |
-| [Layout and Rearrangement](./layout-and-rearrangement.md) | Reshape, transpose, extract, insert | 13 |
-| [Irregular and Complex](./irregular-and-complex.md) | Sort, quantize, histogram, print | 14 |
-
-## Quick Reference
-
-### Common Tile Types
-
-| Type | Location | Typical Use |
-|------|----------|-------------|
-| `TileType::Vec` | UB | General elementwise operations |
-| `TileType::Mat` | L1 | Matrix multiply operations |
-| `TileType::Left` | L0A | Matrix multiply A operand |
-| `TileType::Right` | L0B | Matrix multiply B operand |
-| `TileType::Acc` | L0C | Matrix multiply accumulator |
-
-### Memory Capacities (A5)
-
-| Tile Type | Memory | Capacity | Alignment |
-|-----------|--------|----------|----------|
-| `Vec` | UB | 256 KB | 32 B |
-| `Mat` | L1 | 512 KB | 32 B |
-| `Left` | L0A | 64 KB | 32 B |
-| `Right` | L0B | 64 KB | 32 B |
-| `Acc` | L0C | 256 KB | 32 B |
-| `Bias` | Bias | 4 KB | 32 B |
-
-## Navigation
-
-The left sidebar provides standalone per-op pages for all tile surface instructions. Use the family overviews above to understand shared constraints and mechanisms before reading individual opcode pages.
-
-## See Also
-
-- [Tile instruction surface](../instruction-surfaces/tile-instructions.md)
-- [Tile families](../instruction-families/tile-families.md)
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/README_zh.md b/docs/mkdocs/src/docs/isa/tile/README_zh.md
deleted file mode 100644
index e194a1a5..00000000
--- a/docs/mkdocs/src/docs/isa/tile/README_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Tile ISA Reference
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](README.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [现有中文指令说明](../README_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/elementwise-tile-tile.md b/docs/mkdocs/src/docs/isa/tile/elementwise-tile-tile.md
deleted file mode 100644
index 126bc988..00000000
--- a/docs/mkdocs/src/docs/isa/tile/elementwise-tile-tile.md
+++ /dev/null
@@ -1,135 +0,0 @@
-<!-- Generated from `docs/isa/tile/elementwise-tile-tile.md` -->
-
-# Elementwise Tile-Tile Family
-
-Elementwise tile-tile operations perform lane-wise binary and unary operations over tile valid regions. These are the most commonly used tile compute operations in PTO programs.
-
-## Operations
-
-| Operation | Description | Category | C++ Intrinsic |
-|-----------|-------------|----------|----------------|
-| [pto.tadd](./ops/elementwise-tile-tile/tadd.md) | Elementwise addition | Binary | `TADD(dst, src0, src1)` |
-| [pto.tabs](./ops/elementwise-tile-tile/tabs.md) | Elementwise absolute value | Unary | `TABS(dst, src)` |
-| [pto.tand](./ops/elementwise-tile-tile/tand.md) | Elementwise bitwise AND | Binary | `TAND(dst, src0, src1)` |
-| [pto.tor](./ops/elementwise-tile-tile/tor.md) | Elementwise bitwise OR | Binary | `TOR(dst, src0, src1)` |
-| [pto.tsub](./ops/elementwise-tile-tile/tsub.md) | Elementwise subtraction | Binary | `TSUB(dst, src0, src1)` |
-| [pto.tmul](./ops/elementwise-tile-tile/tmul.md) | Elementwise multiplication | Binary | `TMUL(dst, src0, src1)` |
-| [pto.tmin](./ops/elementwise-tile-tile/tmin.md) | Elementwise minimum | Binary | `TMIN(dst, src0, src1)` |
-| [pto.tmax](./ops/elementwise-tile-tile/tmax.md) | Elementwise maximum | Binary | `TMAX(dst, src0, src1)` |
-| [pto.tcmp](./ops/elementwise-tile-tile/tcmp.md) | Elementwise comparison | Binary | `TCMP(dst, src0, src1, cmp)` |
-| [pto.tdiv](./ops/elementwise-tile-tile/tdiv.md) | Elementwise division | Binary | `TDIV(dst, src0, src1)` |
-| [pto.tshl](./ops/elementwise-tile-tile/tshl.md) | Elementwise shift left | Binary | `TSHL(dst, src0, src1)` |
-| [pto.tshr](./ops/elementwise-tile-tile/tshr.md) | Elementwise shift right | Binary | `TSHR(dst, src0, src1)` |
-| [pto.txor](./ops/elementwise-tile-tile/txor.md) | Elementwise bitwise XOR | Binary | `TXOR(dst, src0, src1)` |
-| [pto.tlog](./ops/elementwise-tile-tile/tlog.md) | Elementwise natural logarithm | Unary | `TLOG(dst, src)` |
-| [pto.trecip](./ops/elementwise-tile-tile/trecip.md) | Elementwise reciprocal | Unary | `TRECIP(dst, src)` |
-| [pto.tprelu](./ops/elementwise-tile-tile/tprelu.md) | Elementwise parameterized ReLU | Binary | `TPRELU(dst, src0, src1)` |
-| [pto.taddc](./ops/elementwise-tile-tile/taddc.md) | Saturating elementwise addition | Binary | `TADDC(dst, src0, src1)` |
-| [pto.tsubc](./ops/elementwise-tile-tile/tsubc.md) | Saturating elementwise subtraction | Binary | `TSUBC(dst, src0, src1)` |
-| [pto.tcvt](./ops/elementwise-tile-tile/tcvt.md) | Elementwise type conversion | Unary | `TCVT(dst, src)` |
-| [pto.tsel](./ops/elementwise-tile-tile/tsel.md) | Elementwise conditional selection | Ternary | `TSEL(dst, src0, src1, cmp)` |
-| [pto.trsqrt](./ops/elementwise-tile-tile/trsqrt.md) | Elementwise reciprocal square root | Unary | `TRSQRT(dst, src)` |
-| [pto.tsqrt](./ops/elementwise-tile-tile/tsqrt.md) | Elementwise square root | Unary | `TSQRT(dst, src)` |
-| [pto.texp](./ops/elementwise-tile-tile/texp.md) | Elementwise exponential | Unary | `TEXP(dst, src)` |
-| [pto.tnot](./ops/elementwise-tile-tile/tnot.md) | Elementwise bitwise NOT | Unary | `TNOT(dst, src)` |
-| [pto.trelu](./ops/elementwise-tile-tile/trelu.md) | Elementwise ReLU | Unary | `TRELU(dst, src)` |
-| [pto.tneg](./ops/elementwise-tile-tile/tneg.md) | Elementwise negation | Unary | `TNEG(dst, src)` |
-| [pto.trem](./ops/elementwise-tile-tile/trem.md) | Elementwise remainder | Binary | `TREM(dst, src0, src1)` |
-| [pto.tfmod](./ops/elementwise-tile-tile/tfmod.md) | Elementwise floating-point modulo | Binary | `TFMOD(dst, src0, src1)` |
-
-## Mechanism
-
-Binary operations combine two source tiles lane-by-lane. Unary operations transform one source tile lane-by-lane. The iteration domain is the destination tile's valid region.
-
-For each lane `(r, c)` in the destination's valid region:
-
-$$ \mathrm{dst}_{r,c} = f(\mathrm{src0}_{r,c}, \mathrm{src1}_{r,c}) $$
-
-For ternary selection (`TSEL`):
-
-$$ \mathrm{dst}_{r,c} = (\mathrm{cmp}_{r,c} \neq 0) \; ?\; \mathrm{src0}_{r,c} \;:\; \mathrm{src1}_{r,c} $$
-
-## Valid Region Compatibility
-
-All elementwise tile-tile operations iterate over the **destination tile's valid region**. For each lane `(r, c)` in the destination's valid region:
-
-- The corresponding lane `(r, c)` from each source tile is read, **regardless of whether that lane is within the source tile's own valid region**
-- Source tiles whose valid region does not cover `(r, c)` read **implementation-defined values**
-- Programs MUST NOT rely on any particular value being read from an out-of-region source lane unless the operation explicitly documents the behavior
-
-## Saturating Variants
-
-Operations with the `_c` suffix perform saturating arithmetic instead of wrapping arithmetic:
-
-| Base Op | Saturating Op | Overflow/Underflow Behavior |
-|---------|--------------|--------------------------|
-| `TADD` | `TADDC` | Clamp to type min/max |
-| `TSUB` | `TSUBC` | Clamp to type min/max |
-
-Programs MUST NOT assume that `TADDC` and `TADD` produce identical results when overflow does not occur; they MAY differ even for in-range values due to implementation precision choices.
-
-## Type Support by Target Profile
-
-| Element Type | CPU Simulator | A2/A3 | A5 |
-|------------|:-------------:|:------:|:--:|
-| f32 (float) | Yes | Yes | Yes |
-| f16 (half) | Yes | Yes | Yes |
-| bf16 (bfloat16_t) | Yes | Yes | Yes |
-| i8 / u8 | Yes | Yes | Yes |
-| i16 / u16 | Yes | Yes | Yes |
-| i32 / u32 | Yes | Yes | Yes |
-| i64 / u64 | Yes | Yes | Yes |
-| f8e4m3 / f8e5m2 | No | No | Yes |
-
-## Constraints
-
-- Tile layout, shape, and valid-region state affect legality.
-- Type support varies by target profile (see per-op pages for exact restrictions).
-- Comparison operations (`TCMP`) produce a **predicate tile**; arithmetic operations produce a **numeric tile**.
-- Conversion operations (`TCVT`) may change element type between source and destination; dtype sizes may differ.
-- All source and destination tiles MUST have the same physical shape `(Rows, Cols)`.
-- Shift operations (`TSHL`, `TSHR`) interpret the second operand as an unsigned shift count; shift count MUST be `<` element bit-width.
-
-## Cases That Are Not Allowed
-
-- **MUST NOT** assume implicit broadcasting, reshaping, or valid-region repair.
-- **MUST NOT** rely on a defined value from a source tile lane outside its valid region.
-- **MUST NOT** assume `TADDC`/`TSUBC` are bit-identical to `TADD`/`TSUB` for all inputs.
-- **MUST NOT** use a shift count `>=` element bit-width.
-
-## C++ Intrinsic
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Binary elementwise
-template <typename TileDst, typename TileSrc0, typename TileSrc1>
-PTO_INST RecordEvent TADD(TileDst& dst, TileSrc0& src0, TileSrc1& src1);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1>
-PTO_INST RecordEvent TMUL(TileDst& dst, TileSrc0& src0, TileSrc1& src1);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1>
-PTO_INST RecordEvent TADDC(TileDst& dst, TileSrc0& src0, TileSrc1& src1);
-
-// Unary elementwise
-template <typename TileDst, typename TileSrc>
-PTO_INST RecordEvent TABS(TileDst& dst, TileSrc& src);
-
-template <typename TileDst, typename TileSrc>
-PTO_INST RecordEvent TEXP(TileDst& dst, TileSrc& src);
-
-// Type conversion
-template <typename TileDst, typename TileSrc>
-PTO_INST RecordEvent TCVT(TileDst& dst, TileSrc& src);
-
-// Comparison (produces predicate tile)
-template <typename TileDst, typename TileSrc0, typename TileSrc1>
-PTO_INST RecordEvent TCMP(TileDst& dst, TileSrc0& src0, TileSrc1& src1, CompareMode cmp);
-```
-
-## See Also
-
-- [Tile families](../instruction-families/tile-families.md) — Family overview
-- [Tile instruction surface](../instruction-surfaces/tile-instructions.md) — Surface description
diff --git a/docs/mkdocs/src/docs/isa/tile/elementwise-tile-tile_zh.md b/docs/mkdocs/src/docs/isa/tile/elementwise-tile-tile_zh.md
deleted file mode 100644
index df06c70d..00000000
--- a/docs/mkdocs/src/docs/isa/tile/elementwise-tile-tile_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Elementwise Tile-Tile Family
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](elementwise-tile-tile.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/irregular-and-complex.md b/docs/mkdocs/src/docs/isa/tile/irregular-and-complex.md
deleted file mode 100644
index 764fbb1f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/irregular-and-complex.md
+++ /dev/null
@@ -1,122 +0,0 @@
-<!-- Generated from `docs/isa/tile/irregular-and-complex.md` -->
-
-# Irregular And Complex Family
-
-Irregular operations cover tile compute that does not fit the standard elementwise, reduce, or memory models. These include debugging, sorting, quantization, index-based data movement, triangular matrix operations, and partial reductions.
-
-## Operations
-
-| Operation | Description | Category | Target Profile |
-|-----------|-------------|----------|:-------------:|
-| [pto.tprint](./ops/irregular-and-complex/tprint.md) | Print tile data for debugging | Debug | All |
-| [pto.tmrgsort](./ops/irregular-and-complex/tmrgsort.md) | Merging sort of tile rows | Sort | All |
-| [pto.tsort32](./ops/irregular-and-complex/tsort32.md) | Sort 32-bit values | Sort | All |
-| [pto.tgather](./ops/irregular-and-complex/tgather.md) | Gather tile elements by index | Gather | All |
-| [pto.tgatherb](./ops/irregular-and-complex/tgatherb.md) | Batch gather | Gather | All |
-| [pto.tscatter](./ops/irregular-and-complex/tscatter.md) | Scatter tile elements by index | Scatter | All |
-| [pto.tci](./ops/irregular-and-complex/tci.md) | Complex index operation | Index | All |
-| [pto.ttri](./ops/irregular-and-complex/ttri.md) | Triangular matrix extraction/operation | Matrix | All |
-| [pto.tpartadd](./ops/irregular-and-complex/tpartadd.md) | Partial addition | Reduce | All |
-| [pto.tpartmul](./ops/irregular-and-complex/tpartmul.md) | Partial multiplication | Reduce | All |
-| [pto.tpartmax](./ops/irregular-and-complex/tpartmax.md) | Partial maximum | Reduce | All |
-| [pto.tpartmin](./ops/irregular-and-complex/tpartmin.md) | Partial minimum | Reduce | All |
-| [pto.tquant](./ops/irregular-and-complex/tquant.md) | Quantize tile to integer format | Quantize | A2/A3, A5 |
-| [pto.tdequant](./ops/irregular-and-complex/tdequant.md) | Dequantize tile to floating-point | Quantize | A2/A3, A5 |
-| [pto.tpack](./ops/irregular-and-complex/tpack.md) | Pack tile data | Pack | A5 only |
-| [pto.trandom](./ops/irregular-and-complex/trandom.md) | Random number generation | Random | A5 only |
-| [pto.thistogram](./ops/irregular-and-complex/thistogram.md) | Histogram computation | Histogram | A5 only |
-
-## Mechanism
-
-### Sort (TMREGSORT, TSORT32)
-
-Sort elements within each row. The sort order (ascending/descending) is specified by an attribute or parameter. `TSORT32` sorts 32-bit values; `TMREGSORT` performs a merging sort across tile rows.
-
-### Gather/Scatter (TGATHER, TGATHERB, TSCATTER)
-
-Gather reads from non-contiguous GM locations based on an index tile. Scatter writes to non-contiguous GM locations. Unlike `MGATHER`/`MSCATTER` which operate on tile buffers, these operations work with tile registers directly in UB.
-
-$$ \mathrm{dst}_i = \mathrm{src}_{\mathrm{index}_i} \quad \text{(gather)} $$
-
-$$ \mathrm{dst}_{\mathrm{index}_i} = \mathrm{src}_i \quad \text{(scatter)} $$
-
-### Partial Reductions (TPARTADD, TPARTMUL, TPARTMAX, TPARTMIN)
-
-Partial reductions compute intermediate results that are later combined across tiles. Unlike full row/column reductions, partial reductions produce tiles with reduced but non-singular extent — they divide the reduction axis into segments.
-
-### Quantization (TQUANT, TDEQUANT)
-
-Convert between floating-point and quantized integer representations. Quantized formats include INT8, UINT8, INT4, UINT4, FP4, NF4. Requires scale and zero-point tensors. These operations are **not available** on the CPU simulator.
-
-### A5-Only Operations
-
-| Operation | Description |
-|-----------|-------------|
-| `TPACK` | Pack tile data into a compact format |
-| `TRANDOM` | Generate random numbers into tile |
-| `THISTOGRAM` | Compute histogram of tile elements |
-
-## Type Support by Target Profile
-
-| Element Type | CPU Simulator | A2/A3 | A5 |
-|------------|:-------------:|:------:|:--:|
-| f32 (float) | Yes | Yes | Yes |
-| f16 (half) | Yes | Yes | Yes |
-| bf16 (bfloat16_t) | Yes | Yes | Yes |
-| i8 / u8 | Yes | Yes | Yes |
-| i16 / u16 | Yes | Yes | Yes |
-| i32 / u32 | Yes | Yes | Yes |
-| i64 / u64 | Yes | Yes | Yes |
-| Quantized formats (INT4/FP4/NF4) | No | Yes | Yes |
-
-## Constraints
-
-- Sort operations require compatible element types (bit-width appropriate for the sort variant).
-- Quantization requires valid scale (non-zero) and zero-point values within representable range.
-- Scatter requires a valid index tile with non-negative indices within the destination bounds.
-- Partial reductions may have different behavior across profiles.
-- A5-only operations MUST NOT be used on CPU simulator, A2, or A3.
-
-## Cases That Are Not Allowed
-
-- **MUST NOT** use quantization with invalid scale (zero or NaN) or out-of-range zero-point.
-- **MUST NOT** scatter to indices outside the destination tile's declared shape bounds.
-- **MUST NOT** use A5-only operations (`TPACK`, `TRANDOM`, `THISTOGRAM`) on CPU simulator, A2, or A3.
-- **MUST NOT** use sort operations with element types incompatible with the sort variant (e.g., `TSORT32` on i8).
-
-## Performance Notes
-
-Irregular operations may have different performance characteristics compared to regular elementwise operations. Some backends may fall back to a sequence of simpler operations. Quantization operations on CPU simulator are emulated and may be significantly slower than hardware paths.
-
-## C++ Intrinsic
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Sort (sorting order attribute: Ascending/Descending)
-template <typename TileT>
-PTO_INST RecordEvent TMREGSORT(TileT& dst, SortOrder order = SortOrder::Ascending);
-
-template <typename TileT>
-PTO_INST RecordEvent TSORT32(TileT& dst, SortOrder order = SortOrder::Ascending);
-
-// Gather/Scatter
-template <typename TileDst, typename TileIdx, typename TileSrc>
-PTO_INST RecordEvent TGATHER(TileDst& dst, TileIdx& indices, TileSrc& src);
-
-template <typename TileDst, typename TileIdx, typename TileSrc>
-PTO_INST RecordEvent TSCATTER(TileDst& dst, TileIdx& indices, TileSrc& src);
-
-// Quantization
-template <typename TileDst, typename TileSrc, typename TileScale, typename TileZp>
-PTO_INST RecordEvent TQUANT(TileDst& dst, TileSrc& src, TileScale& scale, TileZp& zp);
-
-template <typename TileDst, typename TileSrc, typename TileScale, typename TileZp>
-PTO_INST RecordEvent TDEQUANT(TileDst& dst, TileSrc& src, TileScale& scale, TileZp& zp);
-```
-
-## See Also
-
-- [Tile families](../instruction-families/tile-families.md) — Family overview
-- [Tile instruction surface](../instruction-surfaces/tile-instructions.md) — Surface description
diff --git a/docs/mkdocs/src/docs/isa/tile/irregular-and-complex_zh.md b/docs/mkdocs/src/docs/isa/tile/irregular-and-complex_zh.md
deleted file mode 100644
index 1fb261c3..00000000
--- a/docs/mkdocs/src/docs/isa/tile/irregular-and-complex_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Irregular And Complex Family
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](irregular-and-complex.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/layout-and-rearrangement.md b/docs/mkdocs/src/docs/isa/tile/layout-and-rearrangement.md
deleted file mode 100644
index 3851ee98..00000000
--- a/docs/mkdocs/src/docs/isa/tile/layout-and-rearrangement.md
+++ /dev/null
@@ -1,125 +0,0 @@
-<!-- Generated from `docs/isa/tile/layout-and-rearrangement.md` -->
-
-# Layout And Rearrangement Family
-
-Layout operations change how tile data is organized within the unified buffer. These are **pure data-movement operations** that do not modify element values.
-
-## Operations
-
-| Operation | Description | Category | C++ Intrinsic |
-|-----------|-------------|----------|---------------|
-| [pto.tmov](./ops/layout-and-rearrangement/tmov.md) | Move/copy tile data | Copy | `TMOV(dst, src)` |
-| [pto.tmov_fp](./ops/layout-and-rearrangement/tmov-fp.md) | Move/copy with fill/pad | Copy | `TMOV_FP(dst, src, fp)` |
-| [pto.treshape](./ops/layout-and-rearrangement/treshape.md) | Change tile shape | Transform | `TRESHAPE(dst, src, newShape)` |
-| [pto.ttrans](./ops/layout-and-rearrangement/ttrans.md) | Transpose tile dimensions | Transform | `TTRANS(dst, src)` |
-| [pto.textract](./ops/layout-and-rearrangement/textract.md) | Extract a subtile | Extract | `TEXTRACT(dst, src, offset)` |
-| [pto.textract_fp](./ops/layout-and-rearrangement/textract-fp.md) | Extract with fill/pad | Extract | `TEXTRACT_FP(dst, src, offset, fp)` |
-| [pto.tinsert](./ops/layout-and-rearrangement/tinsert.md) | Insert a subtile into a tile | Insert | `TINSERT(dst, src, offset)` |
-| [pto.tinsert_fp](./ops/layout-and-rearrangement/tinsert-fp.md) | Insert with fill/pad | Insert | `TINSERT_FP(dst, src, offset, fp)` |
-| [pto.tfillpad](./ops/layout-and-rearrangement/tfillpad.md) | Fill tile padding region | Fill | `TFILLPAD(dst, fp)` |
-| [pto.tfillpad_inplace](./ops/layout-and-rearrangement/tfillpad-inplace.md) | Fill padding in place | Fill | `TFILLPAD_INPLACE(dst, fp)` |
-| [pto.tfillpad_expand](./ops/layout-and-rearrangement/tfillpad-expand.md) | Fill padding and expand | Fill | `TFILLPAD_EXPAND(dst, fp)` |
-| [pto.timg2col](./ops/layout-and-rearrangement/timg2col.md) | Image to column transformation | Transform | `TIMG2COL(dst, src, cfg)` |
-
-## Mechanism
-
-### Copy (TMOV, TMOV_FP)
-
-Copy all elements from source tile to destination tile. The FP variant additionally fills padding regions with a specified fill value.
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} $$
-
-### Transform (TRESHAPE, TTRANS, TIMG2COL)
-
-Change the declared shape or layout without changing which logical elements are read/written:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{index}(i,j)} $$
-
-- `TTRANS`: swaps row and column indices: `dst[i,j] = src[j,i]`
-- `TRESHAPE`: reinterprets the flat element sequence with a new `(Rows, Cols)` shape
-- `TIMG2COL`: rearranges image patches into column format for convolution lowering
-
-### Extract/Insert (TEXTRACT, TINSERT, TEXTRACT_FP, TINSERT_FP)
-
-Extract a sub-tile from a tile, or insert a sub-tile into a tile at a specified position `(row_offset, col_offset)`. FP variants fill padding regions with a fill value.
-
-```
-TEXTRACT: dst = src[row_offset : row_offset + dst.Rv, col_offset : col_offset + dst.Cv]
-TINSERT:  dst[row_offset : row_offset + src.Rv, col_offset : col_offset + src.Cv] = src
-```
-
-### Fill (TFILLPAD, TFILLPAD_INPLACE, TFILLPAD_EXPAND)
-
-Fill the padding region (declared tile area outside the valid region) with a specified fill value. The INPLACE variant modifies the source tile directly. The EXPAND variant additionally expands the valid region.
-
-## Type Support by Target Profile
-
-| Element Type | CPU Simulator | A2/A3 | A5 |
-|------------|:-------------:|:------:|:--:|
-| f32 (float) | Yes | Yes | Yes |
-| f16 (half) | Yes | Yes | Yes |
-| bf16 (bfloat16_t) | Yes | Yes | Yes |
-| i8 / u8 | Yes | Yes | Yes |
-| i16 / u16 | Yes | Yes | Yes |
-| i32 / u32 | Yes | Yes | Yes |
-| i64 / u64 | Yes | Yes | Yes |
-| f8e4m3 / f8e5m2 | No | No | Yes |
-
-## Constraints
-
-- `TRESHAPE` requires the total element count to remain unchanged: `src.Rv × src.Cv == dst.Rv × dst.Cv`.
-- `TTRANS` requires square shape (`Rv == Cv`) or produces a transposed declared shape.
-- `TEXTRACT` requires the sub-tile shape to divide evenly into the source tile declared shape.
-- `TINSERT` requires the inserted tile to fit within the destination's declared shape.
-- FP variants (`*_fp`) require a valid fill value (`fp`) compatible with the tile element type.
-- `TIMG2COL` requires specific kernel/padding/stride configuration; profile-dependent behavior.
-
-## Cases That Are Not Allowed
-
-- **MUST NOT** `TRESHAPE` to a shape with a different total element count.
-- **MUST NOT** `TEXTRACT` with offsets outside the source tile's declared shape.
-- **MUST NOT** `TINSERT` such that the inserted tile extends beyond the destination's declared shape.
-- **MUST NOT** use FP8 types with `TIMG2COL` on CPU simulator or A2/A3.
-
-## C++ Intrinsic
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Basic move/copy
-template <typename TileDst, typename TileSrc>
-PTO_INST RecordEvent TMOV(TileDst& dst, TileSrc& src);
-
-// Move with fill/pad
-template <typename TileDst, typename TileSrc, typename FillT>
-PTO_INST RecordEvent TMOV_FP(TileDst& dst, TileSrc& src, FillT fp);
-
-// Reshape tile shape
-template <typename TileDst, typename TileSrc>
-PTO_INST RecordEvent TRESHAPE(TileDst& dst, TileSrc& src, ShapeRef newShape);
-
-// Transpose
-template <typename TileDst, typename TileSrc>
-PTO_INST RecordEvent TTRANS(TileDst& dst, TileSrc& src);
-
-// Extract/Insert at offset
-template <typename TileDst, typename TileSrc>
-PTO_INST RecordEvent TEXTRACT(TileDst& dst, TileSrc& src, int rowOffset, int colOffset);
-
-template <typename TileDst, typename TileSrc>
-PTO_INST RecordEvent TINSERT(TileDst& dst, TileSrc& src, int rowOffset, int colOffset);
-
-// Fill padding region
-template <typename TileT, typename FillT>
-PTO_INST RecordEvent TFILLPAD(TileT& dst, FillT fp);
-
-// Image to column (convolution lowering)
-template <typename TileDst, typename TileSrc, typename Cfg>
-PTO_INST RecordEvent TIMG2COL(TileDst& dst, TileSrc& src, Cfg cfg);
-```
-
-## See Also
-
-- [Tile families](../instruction-families/tile-families.md) — Family overview
-- [Tile instruction surface](../instruction-surfaces/tile-instructions.md) — Surface description
diff --git a/docs/mkdocs/src/docs/isa/tile/layout-and-rearrangement_zh.md b/docs/mkdocs/src/docs/isa/tile/layout-and-rearrangement_zh.md
deleted file mode 100644
index f2adf960..00000000
--- a/docs/mkdocs/src/docs/isa/tile/layout-and-rearrangement_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Layout And Rearrangement Family
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](layout-and-rearrangement.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/matrix-and-matrix-vector.md b/docs/mkdocs/src/docs/isa/tile/matrix-and-matrix-vector.md
deleted file mode 100644
index 25c08fda..00000000
--- a/docs/mkdocs/src/docs/isa/tile/matrix-and-matrix-vector.md
+++ /dev/null
@@ -1,118 +0,0 @@
-<!-- Generated from `docs/isa/tile/matrix-and-matrix-vector.md` -->
-
-# Matrix And Matrix-Vector Family
-
-Matrix and matrix-vector operations perform tiled linear algebra on `TileType::Mat` tiles (cube tiles). They use specialized matrix multiply units and may have dedicated DMA paths.
-
-## Operations
-
-| Operation | Description | Variants | TileType | C++ Intrinsic |
-|-----------|-------------|----------|----------|---------------|
-| [pto.tgemv](./ops/matrix-and-matrix-vector/tgemv.md) | General matrix-vector product | basic, acc, bias, mx | `Mat` | `TGEMV(C, A, x)` |
-| [pto.tgemv_acc](./ops/matrix-and-matrix-vector/tgemv-acc.md) | GEMV with accumulation | — | `Mat` | `TGEMV_ACC(C, A, x)` |
-| [pto.tgemv_bias](./ops/matrix-and-matrix-vector/tgemv-bias.md) | GEMV with bias addition | — | `Mat` | `TGEMV_BIAS(C, A, x, bias)` |
-| [pto.tgemv_mx](./ops/matrix-and-matrix-vector/tgemv-mx.md) | MX-format GEMV | — | `Mat` | `TGEMV_MX(C, A, x, scale)` |
-| [pto.tmatmul](./ops/matrix-and-matrix-vector/tmatmul.md) | General matrix-matrix multiply | basic, acc, bias, mx | `Mat` | `TMATMUL(C, A, B)` |
-| [pto.tmatmul_acc](./ops/matrix-and-matrix-vector/tmatmul-acc.md) | Matmul with accumulation | — | `Mat` | `TMATMUL_ACC(C, A, B)` |
-| [pto.tmatmul_bias](./ops/matrix-and-matrix-vector/tmatmul-bias.md) | Matmul with bias addition | — | `Mat` | `TMATMUL_BIAS(C, A, B, bias)` |
-| [pto.tmatmul_mx](./ops/matrix-and-matrix-vector/tmatmul-mx.md) | MX-format matrix multiply | — | `Mat` | `TMATMUL_MX(C, A, B, scale)` |
-
-## Mechanism
-
-### GEMV
-
-$$ \mathbf{y} = A \times \mathbf{x} + \mathbf{b} $$
-
-- Matrix tile `A`: shape `(M, K)`
-- Vector tile `x`: shape `(K)`
-- Bias tile `b` (optional): shape `(M)`
-- Result tile `y`: shape `(M)`
-
-### Matmul
-
-$$ C = A \times B + D $$
-
-- Tile `A`: shape `(M, K)`
-- Tile `B`: shape `(K, N)`
-- Bias tile `D` (optional): shape `(M, N)`
-- Result tile `C`: shape `(M, N)`
-
-### MX Format
-
-MX (Matrix Multiply) format is Huawei's hardware-optimized data format. It separates scale tensors and may use compressed data representation. The `*_mx` variants require:
-
-1. MX-formatted input tiles (`A`, `B`) with matching MX layout.
-2. A scale tensor (`scale`) with compatible shape.
-3. Accumulator tile (`C`) with shape matching the output shape.
-
-## Tile Type
-
-Matrix operations **require** `TileType::Mat` (cube tiles). `TileType::Vec` tiles MUST NOT be used with matrix operations. Cube tiles use a different physical layout optimized for matrix multiplication and have different valid-region semantics from vector tiles.
-
-## Type Support by Target Profile
-
-| Element Type | CPU Simulator | A2/A3 | A5 |
-|------------|:-------------:|:------:|:--:|
-| f32 (float) | Yes | Yes | Yes |
-| f16 (half) | Yes | Yes | Yes |
-| bf16 (bfloat16_t) | Yes | Yes | Yes |
-| i8 / u8 | No | Yes | Yes |
-| f8e4m3 / f8e5m2 | No | No | Yes |
-| MX format | No | No | Yes |
-
-## Constraints
-
-- **Tile type MUST be `TileType::Mat`** — `TileType::Vec` tiles MUST NOT be used.
-- Shape compatibility: `(M, K) × (K) → (M)` for GEMV; `(M, K) × (K, N) → (M, N)` for matmul.
-- MX format operations require matching MX layout between `A` and `B` tiles.
-- Bias variants require compatible bias tensor shape.
-- Accumulation variants require matching accumulator tile shape.
-- On A2/A3, int8/i8 matrix multiply requires `shape[0..1] % 16 == 0`.
-- On A5, FP8 matmul is supported but requires scale tensors.
-
-## Cases That Are Not Allowed
-
-- **MUST NOT** use `TileType::Vec` tiles with matrix operations.
-- **MUST NOT** use incompatible shape combinations (e.g., `(M, K) × (N, K) → …`).
-- **MUST NOT** mix MX and non-MX tiles in the same operation.
-- **MUST NOT** use FP8 matmul on CPU simulator or A2/A3.
-
-## C++ Intrinsic
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Basic matrix-vector
-template <typename TileC, typename TileA, typename TileX>
-PTO_INST RecordEvent TGEMV(TileC& C, TileA& A, TileX& x);
-
-// Matrix-vector with accumulation
-template <typename TileC, typename TileA, typename TileX>
-PTO_INST RecordEvent TGEMV_ACC(TileC& C, TileA& A, TileX& x);
-
-// Matrix-vector with bias
-template <typename TileC, typename TileA, typename TileX, typename TileBias>
-PTO_INST RecordEvent TGEMV_BIAS(TileC& C, TileA& A, TileX& x, TileBias& bias);
-
-// Basic matrix multiply
-template <typename TileC, typename TileA, typename TileB>
-PTO_INST RecordEvent TMATMUL(TileC& C, TileA& A, TileB& B);
-
-// Matrix multiply with accumulation
-template <typename TileC, typename TileA, typename TileB>
-PTO_INST RecordEvent TMATMUL_ACC(TileC& C, TileA& A, TileB& B);
-
-// Matrix multiply with bias
-template <typename TileC, typename TileA, typename TileB, typename TileBias>
-PTO_INST RecordEvent TMATMUL_BIAS(TileC& C, TileA& A, TileB& B, TileBias& bias);
-
-// MX-format matrix multiply
-template <typename TileC, typename TileA, typename TileB, typename TileScale>
-PTO_INST RecordEvent TMATMUL_MX(TileC& C, TileA& A, TileB& B, TileScale& scale);
-```
-
-## See Also
-
-- [Tile families](../instruction-families/tile-families.md) — Family overview
-- [Tile instruction surface](../instruction-surfaces/tile-instructions.md) — Surface description
diff --git a/docs/mkdocs/src/docs/isa/tile/matrix-and-matrix-vector_zh.md b/docs/mkdocs/src/docs/isa/tile/matrix-and-matrix-vector_zh.md
deleted file mode 100644
index c5f012b6..00000000
--- a/docs/mkdocs/src/docs/isa/tile/matrix-and-matrix-vector_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Matrix And Matrix-Vector Family
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](matrix-and-matrix-vector.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/memory-and-data-movement.md b/docs/mkdocs/src/docs/isa/tile/memory-and-data-movement.md
deleted file mode 100644
index d23b0ffc..00000000
--- a/docs/mkdocs/src/docs/isa/tile/memory-and-data-movement.md
+++ /dev/null
@@ -1,132 +0,0 @@
-<!-- Generated from `docs/isa/tile/memory-and-data-movement.md` -->
-
-# Memory And Data Movement Family
-
-Memory operations transfer data between global memory (GM) and tile buffers. These are the **only** tile operations that cross between tile-visible state and GM-visible state.
-
-## Operations
-
-| Operation | Description | Direction | C++ Intrinsic |
-|-----------|-------------|-----------|----------------|
-| [pto.tload](./ops/memory-and-data-movement/tload.md) | Load from GM into tile | GM → UB → Tile | `TLOAD(dst, gtensor)` |
-| [pto.tprefetch](./ops/memory-and-data-movement/tprefetch.md) | Prefetch from GM into tile (non-blocking) | GM → UB → Tile | `TPREFETCH(dst, gtensor)` |
-| [pto.tstore](./ops/memory-and-data-movement/tstore.md) | Store from tile to GM | Tile → UB → GM | `TSTORE(gtensor, src)` |
-| [pto.tstore_fp](./ops/memory-and-data-movement/tstore-fp.md) | Store with fill/pad | Tile → UB → GM | `TSTORE_FP(gtensor, src, fp)` |
-| [pto.mgather](./ops/memory-and-data-movement/mgather.md) | Gather scattered elements from GM | GM → UB → Tile | `MGATHER(dst, gtensor, indices)` |
-| [pto.mscatter](./ops/memory-and-data-movement/mscatter.md) | Scatter tile elements to GM | Tile → UB → GM | `MSCATTER(gtensor, indices, src)` |
-
-## Mechanism
-
-### Contiguous Transfer (TLOAD, TSTORE)
-
-Data is transferred in a rectangular region determined by the tile's valid region:
-
-```
-TLOAD:  dst[i,j] = src[ r0 + i, c0 + j ]   (i ∈ [0, dst.Rv), j ∈ [0, dst.Cv))
-TSTORE: dst[ r0 + i, c0 + j ] = src[i,j]
-```
-
-Transfer size: `dst.GetValidRow() × dst.GetValidCol()` elements.
-
-### Prefetch (TPREFETCH)
-
-`TPREFETCH` initiates a non-blocking DMA transfer from GM to the tile buffer. It does not stall the pipeline. A subsequent operation that reads the tile buffer must wait for the transfer to complete via `TSYNC` or `set_flag`/`wait_flag`.
-
-### Gather/Scatter (MGATHER, MSCATTER)
-
-An index tile specifies which GM elements to transfer:
-
-$$ \mathrm{dst}_i = \mathrm{src}_{\mathrm{index}_i} $$
-
-### Fill/Pad Variants (TSTORE_FP)
-
-In addition to transferring valid data, padding regions are filled with a specified fill value before storing.
-
-## Layout Compatibility
-
-| TileType | ND→ND | DN→DN | NZ→NZ | ND→NZ | DN→ZN | Notes |
-|----------|:-----:|:-----:|:-----:|:-----:|:-----:|-------|
-| `TileType::Vec` | Yes | Yes | Yes | No | No | |
-| `TileType::Mat` | Yes | Yes | Yes | Yes | Yes | |
-| `TileType::Acc` | Yes | No | Yes | No | No | Atomic store only |
-
-Additional constraints on A5:
-- `TileType::Vec` with `ND→NZ` or `DN→ZN`: requires `GlobalData::staticShape[0..2] == 1` and `TileData::SFractalSize == 512`.
-- `TileType::Vec` with `int64_t/uint64_t`: only `ND→ND` or `DN→DN` supported.
-
-## Type Support by Target Profile
-
-| Element Type | CPU Simulator | A2/A3 | A5 |
-|------------|:-------------:|:------:|:--:|
-| f32 (float) | Yes | Yes | Yes |
-| f16 (half) | Yes | Yes | Yes |
-| bf16 (bfloat16_t) | Yes | Yes | Yes |
-| i8 / u8 | Yes | Yes | Yes |
-| i16 / u16 | Yes | Yes | Yes |
-| i32 / u32 | Yes | Yes | Yes |
-| i64 / u64 | Yes | Yes | Yes |
-| f8e4m3 / f8e5m2 | No | No | Yes |
-| hifloat8_t / float4_e* | No | No | Yes |
-
-## Ordering
-
-Memory operations are subject to PTO's producer-consumer ordering rules. Programs MUST use explicit synchronization (`TSYNC`, `set_flag`/`wait_flag`) to ensure data is ready before use.
-
-See [Producer Consumer Ordering](../memory-model/producer-consumer-ordering.md) for the full ordering model.
-
-## Constraints
-
-- Source and destination element types MUST have the same size: `sizeof(tile.dtype) == sizeof(gtensor.dtype)`.
-- Transfer size is determined by the destination tile's valid region for `TLOAD`, or source tile's valid region for `TSTORE`.
-- Layout compatibility between GM layout and tile layout is profile-dependent (see layout compatibility table above).
-- Gather/scatter index tiles must have compatible shapes.
-- `TSTORE` with `TileType::Acc` supports `AtomicType`: `AtomicNone`, `AtomicAdd`, `AtomicMax`, `AtomicMin` (A5 only).
-- `TSTORE_FP` quantized-store is only legal for `TileType::Acc` on A2/A3 and A5.
-
-## Cases That Are Not Allowed
-
-- Transferring to or from an uninitialized tile register.
-- Using a GlobalTensor with strides incompatible with the transfer pattern.
-- Accessing GM addresses outside the tensor's declared shape.
-- Using `TSTORE_FP` with a non-Acc tile type.
-- Using atomic store variants on CPU simulator.
-
-## C++ Intrinsic
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Basic load
-template <typename TileData, typename GlobalData, typename... WaitEvents>
-PTO_INST RecordEvent TLOAD(TileData& dst, GlobalData& src, WaitEvents&... events);
-
-// Atomic store
-template <typename TileData, typename GlobalData,
-          AtomicType atomicType = AtomicType::AtomicNone, typename... WaitEvents>
-PTO_INST RecordEvent TSTORE(GlobalData& dst, TileData& src, WaitEvents&... events);
-
-// FP store (quantized, A2/A3+)
-template <typename TileData, typename GlobalData, typename FpTileData,
-          AtomicType atomicType = AtomicType::AtomicNone, typename... WaitEvents>
-PTO_INST RecordEvent TSTORE_FP(GlobalData& dst, TileData& src, FpTileData& fp,
-                               WaitEvents&... events);
-
-// Prefetch
-template <typename TileData, typename GlobalData>
-PTO_INST RecordEvent TPREFETCH(TileData& dst, GlobalData& src);
-
-// Gather/Scatter
-template <typename TileData, typename GlobalData, typename IndexData>
-PTO_INST RecordEvent MGATHER(TileData& dst, GlobalData& src, IndexData& indices);
-
-template <typename TileData, typename GlobalData, typename IndexData>
-PTO_INST RecordEvent MSCATTER(GlobalData& dst, IndexData& indices, TileData& src);
-```
-
-## See Also
-
-- [Memory model](../memory-model/consistency-baseline.md) — GM ordering and consistency
-- [Producer consumer ordering](../memory-model/producer-consumer-ordering.md) — Sync rules
-- [Tile families](../instruction-families/tile-families.md) — Family overview
-- [Tile instruction surface](../instruction-surfaces/tile-instructions.md) — Surface description
diff --git a/docs/mkdocs/src/docs/isa/tile/memory-and-data-movement_zh.md b/docs/mkdocs/src/docs/isa/tile/memory-and-data-movement_zh.md
deleted file mode 100644
index 94344911..00000000
--- a/docs/mkdocs/src/docs/isa/tile/memory-and-data-movement_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Memory And Data Movement Family
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](memory-and-data-movement.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tabs.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tabs.md
deleted file mode 100644
index 131078b0..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tabs.md
+++ /dev/null
@@ -1,163 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tabs.md` -->
-
-# pto.tabs
-
-Standalone reference page for `pto.tabs`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise absolute value of a tile.
-
-## Mechanism
-
-Elementwise absolute value of a tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \left|\mathrm{src}_{i,j}\right| $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tabs %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tabs ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tabs ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TABS(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (CPU sim)**:
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float`.
-    - The implementation iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-- **Implementation checks (Costmodel)**:
-    - `TileData::DType` must be one of: `int32_t`、`int16_t`、`int8_t`、`uint8_t`、`half`、`float`.
-
-- **Implementation checks (NPU)**:
-    - `TileData::DType` must be one of: `float` or `half`;
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TABS(dst, src);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TABS(dst, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tabs %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tabs ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tadd](./tadd.md)
-- Next op in family: [pto.tand](./tand.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tabs_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tabs_zh.md
deleted file mode 100644
index 64bb23df..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tabs_zh.md
+++ /dev/null
@@ -1,108 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tabs_zh.md` -->
-
-# TABS
-
-## 指令示意图
-
-![TABS tile operation](../figures/isa/TABS.svg)
-
-## 简介
-
-Tile 的逐元素绝对值。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \left|\mathrm{src}_{i,j}\right| $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tabs %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tabs ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tabs ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TABS(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (CPU sim)**:
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float`.
-    - The implementation iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-- **实现检查 (Costmodel)**:
-    - `TileData::DType` must be one of: `int32_t`、`int16_t`、`int8_t`、`uint8_t`、`half`、`float`.
-- **实现检查 (NPU)**:
-    - `TileData::DType` must be one of: `float` or `half`;
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TABS(dst, src);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TABS(dst, src);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tadd.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tadd.md
deleted file mode 100644
index a1e29156..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tadd.md
+++ /dev/null
@@ -1,174 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tadd.md` -->
-
-# pto.tadd
-
-Standalone reference page for `pto.tadd`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Lane-wise addition of two source tiles into a destination tile. The iteration domain is the destination tile's valid region.
-
-## Mechanism
-
-For each element `(i, j)` in the destination tile's valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} $$
-
-Only the destination tile's valid region defines the iteration domain. Source tiles are read lane-by-lane at the same `(i, j)` coordinates; source tiles whose valid region does not cover `(i, j)` produce implementation-defined values at those lanes.
-
-## Syntax
-
-### Assembly Form (PTO-AS)
-
-```text
-%dst = tadd %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 — SSA Form
-
-PTO-AS at Level 1 uses SSA-style result binding:
-
-```mlir
-%dst = pto.tadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 — DPS Form
-
-PTO-AS at Level 2 uses the Def-Use-Style (DPS) explicit operand binding:
-
-```mlir
-pto.tadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>)
-          outs(%dst : !pto.tile_buf<...>)
-```
-
-The `ins(...)` clause names operands in the input position; the `outs(...)` clause names the output. The tile buffer type `!pto.tile_buf<...>` is the in-memory storage form used at Level 2.
-
-### Micro-Operation Mapping
-
-The `pto.tadd` SSA operation maps to the following micro-operation sequence on the Tile Register File (TRF):
-
-```
-TRF_READ(src0, i, j)  →  A
-TRF_READ(src1, i, j)  →  B
-A + B                  →  C
-TRF_WRITE(dst, i, j, C)
-```
-
-The micro-operation level is not exposed to the ISA author; it is the responsibility of the backend to schedule these steps subject to pipeline constraints.
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TADD(TileDataDst& dst, TileDataSrc0& src0, TileDataSrc1& src1, WaitEvents&... events);
-```
-
-## Inputs
-
-| Operand | Role | Description |
-|---------|------|-------------|
-| `%src0` | Left tile | First source tile; read at `(i, j)` for each `(i, j)` in `dst` valid region |
-| `%src1` | Right tile | Second source tile; read at `(i, j)` for each `(i, j)` in `dst` valid region |
-| `WaitEvents...` | Optional synchronisation | `RecordEvent` tokens to wait on before issuing the operation |
-
-Both source tiles and the destination tile share the same element type. Layout and shape constraints are stated under Constraints.
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%dst` | `!pto.tile<...>` | Destination tile; all `(i, j)` in its valid region contain `src0[i,j] + src1[i,j]` after the operation |
-
-## Side Effects
-
-None beyond producing the destination tile. Does not implicitly fence unrelated tile traffic.
-
-## Constraints
-
-- **Type match**: All three tiles (`src0`, `src1`, `dst`) MUST have identical element types.
-- **Layout**: Both source tiles and the destination tile MUST have compatible layouts. See the TileType–Layout compatibility table in [Tiles and Valid Regions](../../programming-model/tiles-and-valid-regions.md).
-- **Valid region**: The iteration domain is `dst.GetValidRow()` × `dst.GetValidCol()`. Source tiles with smaller valid regions yield implementation-defined values outside their valid region.
-- **TileType**: The destination tile's TileType determines which pipelines execute the operation. See [Tiles and Valid Regions](../../programming-model/tiles-and-valid-regions.md) for TileType constraints.
-
-## Exceptions
-
-- Verifier rejects type mismatches between source and destination tiles.
-- Backend rejects unsupported element types, layouts, or shapes for the selected target profile.
-- Programs MUST NOT rely on the value of any destination lane that is outside `dst`'s declared valid region.
-
-## Target-Profile Restrictions
-
-| | CPU Simulator | A2/A3 | A5 |
-|-|--------------|-------|-----|
-| `f32` | Simulated | Supported | Supported |
-| `f16` | Simulated | Supported | Supported |
-| `bf16` | Simulated | Supported | Supported |
-| `i32` | Simulated | Supported | Supported |
-| `i16` | Simulated | Supported | Supported |
-| `i8` / `u8` | Simulated | No | Supported |
-| `i64` / `u64` | Simulated | No | No |
-| `f8e4m3` / `f8e5m2` | Simulated | No | Supported |
-| Layout | Any | RowMajor only | RowMajor only |
-
-A2/A3 requires `isRowMajor == true` for all operands. A5 additionally requires `isRowMajor == true` but supports more element types.
-
-## Examples
-
-### C++ — Auto Mode
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void add_tiles(Tile<Vec, float, 16, 16>& dst,
-               Tile<Vec, float, 16, 16>& src0,
-               Tile<Vec, float, 16, 16>& src1) {
-    // Compiler inserts TASSIGN and TSYNC automatically in Auto mode.
-    TADD(dst, src0, src1);
-}
-```
-
-### C++ — Manual Mode
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void add_tiles_manual(Tile<Vec, float, 16, 16>& dst,
-                      Tile<Vec, float, 16, 16>& src0,
-                      Tile<Vec, float, 16, 16>& src1) {
-    TASSIGN(src0, 0x1000);
-    TASSIGN(src1, 0x2000);
-    TASSIGN(dst,  0x3000);
-    RecordEvent e0 = TLOAD(src0, ga);
-    RecordEvent e1 = TLOAD(src1, gb);
-    TSYNC(e0, e1);
-    TADD(dst, src0, src1);
-    TSYNC();
-    TSTORE(gc, dst);
-}
-```
-
-### MLIR — SSA Form
-
-```mlir
-%result = pto.tadd %src0, %src1 : (!pto.tile<f32, 16, 16>, !pto.tile<f32, 16, 16>) -> !pto.tile<f32, 16, 16>
-```
-
-### MLIR — DPS Form
-
-```mlir
-pto.tadd ins(%src0, %src1 : !pto.tile_buf<f32, 16, 16>, !pto.tile_buf<f32, 16, 16>)
-          outs(%result : !pto.tile_buf<f32, 16, 16>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: (none)
-- Next op in family: [pto.tabs](./tabs.md)
-- Instruction surface: [Tile Instructions](../../instruction-surfaces/tile-instructions.md)
-- Type system: [Type System](../../state-and-types/type-system.md)
-- Valid regions: [Tiles and Valid Regions](../../programming-model/tiles-and-valid-regions.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tadd_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tadd_zh.md
deleted file mode 100644
index d4bc0e4c..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tadd_zh.md
+++ /dev/null
@@ -1,104 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tadd_zh.md` -->
-
-# TADD
-
-## 指令示意图
-
-![TADD tile operation](../figures/isa/TADD.svg)
-
-## 简介
-
-两个 Tile 的逐元素加法。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tadd %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`, `bfloat16_t`, `uint8_t`, `int8_t`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TADD(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TADD(dst, src0, src1);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/taddc.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/taddc.md
deleted file mode 100644
index d999e9ec..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/taddc.md
+++ /dev/null
@@ -1,135 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/taddc.md` -->
-
-# pto.taddc
-
-Standalone reference page for `pto.taddc`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise ternary add: `src0 + src1 + src2`.
-
-## Mechanism
-
-Elementwise ternary add: `src0 + src1 + src2`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} + \mathrm{src2}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = taddc %src0, %src1, %src2 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.taddc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.taddc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TADDC(TileData &dst, TileData &src0, TileData &src1, TileData &src2, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.taddc` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, c, out;
-  TADDC(out, a, b, c);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = taddc %src0, %src1, %src2 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.taddc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tprelu](./tprelu.md)
-- Next op in family: [pto.tsubc](./tsubc.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/taddc_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/taddc_zh.md
deleted file mode 100644
index a59841af..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/taddc_zh.md
+++ /dev/null
@@ -1,78 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/taddc_zh.md` -->
-
-# TADDC
-
-## 指令示意图
-
-![TADDC tile operation](../figures/isa/TADDC.svg)
-
-## 简介
-
-三元逐元素加法：`src0 + src1 + src2`。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} + \mathrm{src2}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = taddc %src0, %src1, %src2 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.taddc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.taddc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TADDC(TileData &dst, TileData &src0, TileData &src1, TileData &src2, WaitEvents &... events);
-```
-
-## 约束
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, c, out;
-  TADDC(out, a, b, c);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tadds_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tadds_zh.md
deleted file mode 100644
index acadef80..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tadds_zh.md
+++ /dev/null
@@ -1,109 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tadds_zh.md` -->
-
-# TADDS
-
-## 指令示意图
-
-![TADDS tile operation](../figures/isa/TADDS.svg)
-
-## 简介
-
-Tile 与标量的逐元素加法。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} + \mathrm{scalar} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tadds %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TADDS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TADDS(dst, src, 1.0f);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TADDS(dst, src, 1.0f);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tand.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tand.md
deleted file mode 100644
index 2c3182cb..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tand.md
+++ /dev/null
@@ -1,132 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tand.md` -->
-
-# pto.tand
-
-Standalone reference page for `pto.tand`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise bitwise AND of two tiles.
-
-## Mechanism
-
-Elementwise bitwise AND of two tiles. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \;\&\; \mathrm{src1}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tand %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tand ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TAND(TileData &dst, TileData &src0, TileData &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Supported element types are 1-byte or 2-byte integral types.
-    - `dst`, `src0`, and `src1` must use the same element type.
-    - `dst`, `src0`, and `src1` must be row-major.
-    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
-
-- **Implementation checks (A5)**:
-    - Supported element types are 1-byte, 2-byte, or 4-byte integral types.
-    - `dst`, `src0`, and `src1` must use the same element type.
-    - `dst`, `src0`, and `src1` must be row-major.
-    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
-  TileT a, b, out;
-  TAND(out, a, b);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tand %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tand ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tabs](./tabs.md)
-- Next op in family: [pto.tor](./tor.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tand_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tand_zh.md
deleted file mode 100644
index 7ff66a43..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tand_zh.md
+++ /dev/null
@@ -1,104 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tand_zh.md` -->
-
-# TAND
-
-## 指令示意图
-
-![TAND tile operation](../figures/isa/TAND.svg)
-
-## 简介
-
-两个 Tile 的逐元素按位与。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \;\&\; \mathrm{src1}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tand %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tand ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TAND(TileData &dst, TileData &src0, TileData &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - 支持的元素类型为 1 字节或 2 字节整数类型。
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
-- **实现检查 (A5)**:
-    - 支持的元素类型为 1 字节、2 字节或 4 字节整数类型。
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
-  TileT a, b, out;
-  TAND(out, a, b);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tand %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tand ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md
deleted file mode 100644
index b71bc851..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md
+++ /dev/null
@@ -1,107 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md` -->
-
-# TANDS
-
-## 指令示意图
-
-![TANDS tile operation](../figures/isa/TANDS.svg)
-
-## 简介
-
-Tile 与标量的逐元素按位与。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \;\&\; \mathrm{scalar} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tands %src, %scalar : !pto.tile<...>, i32
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tands ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TANDS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - 适用于整数元素类型。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
-- **实现检查 (A5)**:
-    - 适用于 `TEXPANDS` 和 `TAND` 支持的整数元素类型。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileDst dst;
-  TileSrc src;
-  TANDS(dst, src, 0xffu);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tands %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.tands ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcmp.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcmp.md
deleted file mode 100644
index 5cdee0df..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcmp.md
+++ /dev/null
@@ -1,165 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tcmp.md` -->
-
-# pto.tcmp
-
-Standalone reference page for `pto.tcmp`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Compare two tiles element-wise and write a packed predicate mask tile.
-
-## Mechanism
-
-For each element `(i, j)` in the destination's valid region:
-
-$$ \mathrm{dst}_{i,j} = \bigl(\mathrm{src0}_{i,j}\ \mathrm{cmpMode}\ \mathrm{src1}_{i,j}\bigr) $$
-
-where `cmpMode` is one of `EQ`, `NE`, `LT`, `LE`, `GT`, `GE`.
-
-The predicate mask is stored in `dst` using a target-defined packed encoding (not a boolean per lane).
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcmp %src0, %src1 {cmpMode = #pto.cmp<EQ>} : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcmp %src0, %src1 {cmpMode = #pto.cmp<EQ>} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcmp ins(%src0, %src1 {cmpMode = #pto.cmp<EQ>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tcmp %src0, %src1 {cmpMode = #pto.cmp<EQ>} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tcmp ins(%src0, %src1 {cmpMode = #pto.cmp<EQ>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/type.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TCMP(TileDataDst &dst, TileDataSrc &src0, TileDataSrc &src1,
-                          CmpMode cmpMode, WaitEvents &... events);
-```
-
-**Compare modes** (`pto::CmpMode`):
-
-| Mode | Meaning |
-|------|---------|
-| `CmpMode::EQ` | Equal |
-| `CmpMode::NE` | Not equal |
-| `CmpMode::LT` | Less than |
-| `CmpMode::LE` | Less than or equal |
-| `CmpMode::GT` | Greater than |
-| `CmpMode::GE` | Greater than or equal |
-
-## Inputs
-
-- `src0` is the first source tile (left-hand side of comparison).
-- `src1` is the second source tile (right-hand side of comparison).
-- `dst` names the destination predicate tile.
-- `cmpMode` specifies the comparison predicate.
-- The operation iterates over `dst`'s valid region; `src0` and `src1` are sampled at the same coordinates.
-
-## Expected Outputs
-
-`dst` carries the packed predicate mask tile. The mask encoding is target-defined; programs MUST NOT make assumptions about the bit layout.
-
-## Side Effects
-
-No architectural side effects beyond producing the predicate tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- The iteration domain is `dst.GetValidRow()` × `dst.GetValidCol()`.
-- `src0.GetValidRow()` and `src0.GetValidCol()` MUST match `dst`'s valid region.
-- `src1`'s shape/validity is not verified by runtime assertions; out-of-region lanes read **implementation-defined values**.
-- The output predicate tile uses a **packed encoding** (not one boolean per lane). Use `TSEL` with the predicate tile to apply it.
-
-## Cases That Are Not Allowed
-
-- **MUST NOT** assume any particular encoding of the predicate tile.
-- **MUST NOT** use `dst` with a dtype other than the target-defined predicate dtype.
-
-## Target-Profile Restrictions
-
-| Check | A2/A3 | A5 |
-|-------|:-----:|:--:|
-| Supported input types | `int32_t`, `half`, `float` | `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`, `int8_t`, `float`, `half` |
-| Output predicate dtype | `uint8_t` | `uint32_t` |
-| Tile location | `TileType::Vec` | `TileType::Vec` |
-| Layout | Row-major | Row-major |
-| Static valid bounds | Required | Required |
-| `src0` valid == `dst` valid | Required | Required |
-| `src1` validity | Not verified | Not verified |
-| `int32_t` with non-EQ mode | Ignores `cmpMode`, uses EQ | Full support |
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src0, src1;
-  MaskT mask(16, 2);
-  TCMP(mask, src0, src1, CmpMode::GT);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src0, src1;
-  MaskT mask(16, 2);
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(mask, 0x3000);
-  TCMP(mask, src0, src1, CmpMode::GT);
-}
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcmp %src0, %src1 {cmpMode = #pto.cmp<EQ>} : !pto.tile<...>
-pto.tcmp ins(%src0, %src1 {cmpMode = #pto.cmp<EQ>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tmax](./tmax.md)
-- Next op in family: [pto.tdiv](./tdiv.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcmp_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcmp_zh.md
deleted file mode 100644
index 80170411..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcmp_zh.md
+++ /dev/null
@@ -1,117 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tcmp_zh.md` -->
-
-# TCMP
-
-## 指令示意图
-
-![TCMP tile operation](../figures/isa/TCMP.svg)
-
-## 简介
-
-比较两个 Tile 并写入一个打包的谓词掩码。
-
-## 数学语义
-
-Conceptually, for each element `(i, j)` in the valid region, define a predicate:
-
-$$ p_{i,j} = \left(\mathrm{src0}_{i,j}\ \mathrm{cmpMode}\ \mathrm{src1}_{i,j}\right) $$
-
-The predicate mask is stored in `dst` using an implementation-defined packed layout.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tcmp %src0, %src1 {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcmp %src0, %src1{cmpMode = #pto<cmp xx>}: (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcmp ins(%src0, %src1{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcmp %src0, %src1{cmpMode = #pto<cmp xx>}: (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcmp ins(%src0, %src1{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp` and `include/pto/common/type.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TCMP(TileDataDst &dst, TileDataSrc &src0, TileDataSrc &src1, CmpMode cmpMode, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - Input type must be one of: `int32_t`, `half`, `float`.
-    - Output type must be `uint8_t`.
-    - `src0/src1/dst` tile location must be `TileType::Vec`.
-    - Static valid bounds: `TileDataSrc::ValidRow <= TileDataSrc::Rows` and `TileDataSrc::ValidCol <= TileDataSrc::Cols`.
-    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
-    - Note: `src1` shape/valid is not validated by explicit runtime assertions in this implementation.
-    - For `TileDataSrc::DType == int32_t`, the implementation uses the `EQ` compare path regardless of `cmpMode`.
-- **实现检查 (A5)**:
-    - Input type must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
-    - Output type must be `uint32_t`.
-    - Implemented (see `include/pto/npu/a5/TCmp.hpp`).
-    - The A5 implementation uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain and writes a packed predicate mask into `dst` (target-defined packing).
-- **Mask encoding**:
-    - The mask tile is interpreted as packed predicate bits in a target-defined layout.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src0, src1;
-  MaskT mask(16, 2);
-  TCMP(mask, src0, src1, CmpMode::GT);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src0, src1;
-  MaskT mask(16, 2);
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(mask, 0x3000);
-  TCMP(mask, src0, src1, CmpMode::GT);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcmps_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcmps_zh.md
deleted file mode 100644
index 55519667..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcmps_zh.md
+++ /dev/null
@@ -1,115 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tcmps_zh.md` -->
-
-# TCMPS
-
-## 指令示意图
-
-![TCMPS tile operation](../figures/isa/TCMPS.svg)
-
-## 简介
-
-将 Tile 与标量比较并写入逐元素比较结果。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \left(\mathrm{src}_{i,j}\ \mathrm{cmpMode}\ \mathrm{scalar}\right) $$
-
-The encoding/type of `dst` is implementation-defined (often a mask-like tile).
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tcmps %src, %scalar {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp` and `include/pto/common/type.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename T, typename... WaitEvents>
-PTO_INST RecordEvent TCMPS(TileDataDst& dst, TileDataSrc0& src0, T src1, CmpMode cmpMode, WaitEvents&... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `float`, `half`, `uint16_t`, `int16_t`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `float`, `half`, `uint16_t`, `int16_t`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **Common constraints**:
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0` and `dst` must have the same valid row/col.
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-- **Comparison modes**:
-    - Supports `CmpMode::EQ`, `CmpMode::NE`, `CmpMode::LT`, `CmpMode::GT`, `CmpMode::LE`, `CmpMode::GE`.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src;
-  DstT dst(16, 2);
-  TCMPS(dst, src, 0.0f, CmpMode::GT);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src;
-  DstT dst(16, 2);
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TCMPS(dst, src, 0.0f, CmpMode::GT);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcvt.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcvt.md
deleted file mode 100644
index 103f8470..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcvt.md
+++ /dev/null
@@ -1,188 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tcvt.md` -->
-
-# pto.tcvt
-
-Standalone reference page for `pto.tcvt`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise type conversion with a specified rounding mode and optional saturation mode.
-
-## Mechanism
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{cast}_{\mathrm{rmode},\mathrm{satmode}}\!\left(\mathrm{src}_{i,j}\right) $$
-
-where `rmode` is the rounding policy and `satmode` (if provided) controls saturation behavior.
-
-## Rounding Modes
-
-| Mode | Behavior |
-|------|----------|
-| `RoundMode::CAST_RINT` | Round to nearest, ties to even |
-| `RoundMode::CAST_RZ` | Round toward zero |
-| `RoundMode::CAST_RP` | Round toward +∞ |
-| `RoundMode::CAST_RM` | Round toward -∞ |
-| `RoundMode::CAST_RN` | Round to nearest, ties away from zero |
-
-## Saturation Modes
-
-When `SaturationMode` is provided (overload 2), saturation behavior is explicitly controlled:
-
-| Mode | Behavior |
-|------|----------|
-| `SaturationMode::NONE` | No saturation; wraps on overflow |
-| `SaturationMode::SAT` | Clamp to destination type's representable range |
-
-When `SaturationMode` is omitted (overload 1), the implementation chooses a target-defined default for the specific source/destination type pair.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcvt ins(%src {rmode = #pto.round_mode<CAST_RINT>}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tcvt ins(%src {rmode = #pto.round_mode<CAST_RINT>}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
-
-```cpp
-// Overload 1: implementation-chosen default saturation
-template <typename TileDataD, typename TileDataS, typename... WaitEvents>
-PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, RoundMode mode, WaitEvents &... events);
-
-// Overload 2: explicit saturation control (A2/A3, A5 only)
-template <typename TileDataD, typename TileDataS, typename... WaitEvents>
-PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, RoundMode mode,
-                          SaturationMode satMode, WaitEvents &... events);
-```
-
-Overload 2 (with explicit `SaturationMode`) is not currently implemented on the CPU simulator.
-
-## Inputs
-
-- `src` is the source tile whose elements are converted to the destination type.
-- `dst` names the destination tile receiving the converted values.
-- `mode` specifies the rounding mode.
-- `satMode` (overload 2 only) specifies saturation behavior.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile with converted element values. `src`'s element values are not modified.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `src` and `dst` MUST have compatible shapes (declared shape and valid region).
-- The source/destination type pair MUST be supported by the target profile.
-- The rounding mode MUST be supported for the given type pair.
-- The output tile `dst` MUST have a different element type from `src`.
-
-## Cases That Are Not Allowed
-
-- **MUST NOT** use a type pair not supported by the target profile.
-- **MUST NOT** use a rounding mode not supported for the given type pair.
-
-## Target-Profile Restrictions
-
-| Feature | CPU Simulator | A2/A3 | A5 |
-|---------|:-------------:|:------:|:--:|
-| Overload 1 (default sat) | Yes | Yes | Yes |
-| Overload 2 (explicit sat) | No | Yes | Yes |
-| f32 → f16 | Yes | Yes | Yes |
-| f16 → f32 | Yes | Yes | Yes |
-| f32 → bf16 | Yes | Yes | Yes |
-| bf16 → f32 | Yes | Yes | Yes |
-| f32 → int32_t | Yes | Yes | Yes |
-| int32_t → f32 | Yes | Yes | Yes |
-| f16 → bf16 | No | Yes | Yes |
-| FP8 types | No | No | Yes |
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, half, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TCVT(dst, src, RoundMode::CAST_RINT);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, half, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TCVT(dst, src, RoundMode::CAST_RINT);
-}
-```
-
-### Explicit Saturation (A2/A3, A5)
-
-```cpp
-// A2/A3 and A5 only
-TCVT(dst, src, RoundMode::CAST_RINT, SaturationMode::SAT);
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
-pto.tcvt ins(%src {rmode = #pto.round_mode<CAST_RINT>}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tsubc](./tsubc.md)
-- Next op in family: [pto.tsel](./tsel.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcvt_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcvt_zh.md
deleted file mode 100644
index 2e2481af..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tcvt_zh.md
+++ /dev/null
@@ -1,125 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tcvt_zh.md` -->
-
-# TCVT
-
-## 指令示意图
-
-![TCVT tile operation](../figures/isa/TCVT.svg)
-
-## 简介
-
-带指定舍入模式的逐元素类型转换。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{cast}_{\mathrm{rmode}}\!\left(\mathrm{src}_{i,j}\right) $$
-
-其中 `rmode` 是舍入策略（参见 `pto::RoundMode`）。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcvt %src{rmode = #pto<round_mode xx>}: !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcvt ins(%src{rmode = #pto<round_mode xx>}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp` 和 `include/pto/common/constants.hpp`：
-
-```cpp
-template <typename TileDataD, typename TileDataS, typename... WaitEvents>
-PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, RoundMode mode, SaturationMode satMode, WaitEvents &... events);
-
-template <typename TileDataD, typename TileDataS, typename... WaitEvents>
-PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, RoundMode mode, WaitEvents &... events);
-```
-
-## 约束
-
-- `dst` 和 `src` 必须在形状/有效区域方面兼容，如实现所要求的。
-- 对于给定的 `RoundMode`，转换 `(src 元素类型) -> (dst 元素类型)` 必须被目标支持。
-- **实现说明 (A2A3/A5)**:
-    - 一种形式接受显式的 `SaturationMode`，指定的饱和行为会直接传递给实现。
-    - 另一种形式不显式给出 `SaturationMode`；此时实现会针对具体类型对选择目标定义的默认饱和行为。
-    - 在 CPU 实现中，目前仅实现了不显式传入 `SaturationMode` 的形式。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, half, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TCVT(dst, src, RoundMode::CAST_RINT);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, half, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TCVT(dst, src, RoundMode::CAST_RINT);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcvt %src{rmode = #pto<round_mode xx>}: !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcvt %src{rmode = #pto<round_mode xx>}: !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcvt ins(%src{rmode = #pto<round_mode xx>}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tdiv.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tdiv.md
deleted file mode 100644
index 4764e036..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tdiv.md
+++ /dev/null
@@ -1,165 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tdiv.md` -->
-
-# pto.tdiv
-
-Standalone reference page for `pto.tdiv`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise division of two tiles.
-
-## Mechanism
-
-Elementwise division of two tiles. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \frac{\mathrm{src0}_{i,j}}{\mathrm{src1}_{i,j}} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tdiv %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc0,
-          typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-`PrecisionType` has the following values available:
-
-* `DivAlgorithm::DEFAULT`: Normal algorithm, faster but with lower precision.
-* `DivAlgorithm::HIGH_PRECISION`: High precision algorithm, but slower.
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain;.
-
-- **Division-by-zero**:
-    - Behavior is target-defined.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `half`, `float`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-
-- **High Precision Algorithm**
-    - Only available on A5, `PrecisionType` option is ignored on A3.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TDIV(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TDIV(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tdiv %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tcmp](./tcmp.md)
-- Next op in family: [pto.tshl](./tshl.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tdiv_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tdiv_zh.md
deleted file mode 100644
index 1175c4c8..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tdiv_zh.md
+++ /dev/null
@@ -1,137 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tdiv_zh.md` -->
-
-# TDIV
-
-## 指令示意图
-
-![TDIV tile operation](../figures/isa/TDIV.svg)
-
-## 简介
-
-两个 Tile 的逐元素除法。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \frac{\mathrm{src0}_{i,j}}{\mathrm{src1}_{i,j}} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tdiv %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc0,
-          typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-`PrecisionType`可指定以下值：
-
-* `DivAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
-* `DivAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` 必须是以下之一： `half`, `float`.
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界： `TileData::ValidRow <= TileData::Rows`且`TileData::ValidCol <= TileData::Cols`.
-    - 运行时： `src0`, `src1`且`dst` tiles 应具有相同的 `validRow/validCol`.
-- **实现检查 (A5)**:
-    - `TileData::DType` 必须是以下之一： `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`.
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界： `TileData::ValidRow <= TileData::Rows`且`TileData::ValidCol <= TileData::Cols`.
-    - 运行时： `src0`, `src1`且`dst` tiles 应具有相同的 `validRow/validCol`.
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域;.
-- **除零**:
-    - 行为由目标定义。
-- **高精度算法**
-    - 仅在A5上有效，`PrecisionType`选项A3上将被忽略。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TDIV(dst, src0, src1);
-  TDIV<DivAlgorithm::HIGH_PRECISION>(dst, src0, src1);  // A5 Only
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TDIV(dst, src0, src1);
-  TDIV<DivAlgorithm::HIGH_PRECISION>(dst, src0, src1);  // A5 Only
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tdiv %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md
deleted file mode 100644
index ed9738c7..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md
+++ /dev/null
@@ -1,161 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md` -->
-
-# TDIVS
-
-## 指令示意图
-
-![TDIVS tile operation](../figures/isa/TDIVS.svg)
-
-## 简介
-
-与标量的逐元素除法（Tile/标量 或 标量/Tile）。
-
-## 数学语义
-
-对有效区域内的每个元素 `(i, j)`：
-
-- Tile/标量形式：
-
-  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{src}_{i,j}}{\mathrm{scalar}} $$
-
-- 标量/Tile 形式：
-
-  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{scalar}}{\mathrm{src}_{i,j}} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-Tile/标量形式：
-
-```text
-%dst = tdivs %src, %scalar : !pto.tile<...>, f32
-```
-
-标量/Tile 形式：
-
-```text
-%dst = tdivs %scalar, %src : f32, !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-%dst = pto.tdivs %scalar, %src : (dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-pto.tdivs ins(%scalar, %src : dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TDIVS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar,
-                           WaitEvents &... events);
-
-template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TDIVS(TileDataDst &dst, typename TileDataDst::DType scalar, TileDataSrc &src0,
-                           WaitEvents &... events)
-```
-
-`PrecisionType`可指定以下值：
-
-* `DivAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
-* `DivAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
-
-## 约束
-
-- **实现检查 (A2A3)**（两个重载）:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **实现检查 (A5)**（两个重载）:
-    - `TileData::DType` 必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **除零**:
-    - 行为由目标定义；在 A5 上，Tile/标量形式映射到乘以倒数，并对 `scalar == 0` 使用 `1/0 -> +inf`。dst.GetValidRow()`且`src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域.
-- **除零**:
-    - 行为由目标定义；在 A5 上，tile/标量形式映射到乘以倒数，并对 `scalar == 0` 使用 `1/0 -> +inf`。
-- **高精度算法**
-    - 仅在A5上有效，`PrecisionType`选项A3上将被忽略。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TDIVS(dst, src, 2.0f);
-  TDIVS<DivAlgorithm::HIGH_PRECISION>(dst, src, 2.0f);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TDIVS(dst, 2.0f, src);
-  TDIVS<DivAlgorithm::HIGH_PRECISION>(dst, 2.0f, src);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/texp.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/texp.md
deleted file mode 100644
index 9d608d8e..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/texp.md
+++ /dev/null
@@ -1,166 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/texp.md` -->
-
-# pto.texp
-
-Standalone reference page for `pto.texp`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise exponential.
-
-## Mechanism
-
-Elementwise exponential. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \exp(\mathrm{src}_{i,j}) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = texp %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <auto PrecisionType = ExpAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TEXP(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-`PrecisionType` has the following values available:
-
-- `ExpAlgorithm::DEFAULT`: Normal algorithm, faster but with lower precision.
-- `ExpAlgorithm::HIGH_PRECISION`: High precision algorithm, but slower.
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (NPU)**:
-    - `TileData::DType` must be one of: `float` or `half`;
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-- **High precision algorithm**:
-    - Only available on A5. `PrecisionType` is ignored on A3.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TEXP(dst, src);
-  TEXP<ExpAlgorithm::HIGH_PRECISION>(dst, src);  // A5 only
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TEXP(dst, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = texp %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tsqrt](./tsqrt.md)
-- Next op in family: [pto.tnot](./tnot.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/texp_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/texp_zh.md
deleted file mode 100644
index 844f29e3..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/texp_zh.md
+++ /dev/null
@@ -1,139 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/texp_zh.md` -->
-
-# TEXP
-
-## 指令示意图
-
-![TEXP tile operation](../figures/isa/TEXP.svg)
-
-## 简介
-
-逐元素指数运算。
-
-## 数学语义
-
-对有效区域内的每个元素 `(i, j)`：
-
-$$ \mathrm{dst}_{i,j} = \exp(\mathrm{src}_{i,j}) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = texp %src : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1（SSA）
-
-```text
-%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2（DPS）
-
-```text
-pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <auto PrecisionType = ExpAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TEXP(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-`PrecisionType` 可指定以下取值：
-
-- `ExpAlgorithm::DEFAULT`：普通算法，速度更快但精度较低。
-- `ExpAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
-
-## 约束
-
-- **实现检查（NPU）**：
-    - `TileData::DType` 必须是 `float` 或 `half`；
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）；
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`；
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`；
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **有效区域**：
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **高精度算法**：
-    - 仅 A5 支持，A3 会忽略 `PrecisionType` 选项。
-
-## 示例
-
-### 自动模式
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TEXP(dst, src);
-  TEXP<ExpAlgorithm::HIGH_PRECISION>(dst, src);  // 仅 A5
-}
-```
-
-### 手动模式
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TEXP(dst, src);
-}
-```
-
-## 汇编示例
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = texp %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tfmod.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tfmod.md
deleted file mode 100644
index 4682d92b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tfmod.md
+++ /dev/null
@@ -1,136 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tfmod.md` -->
-
-# pto.tfmod
-
-Standalone reference page for `pto.tfmod`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise fmod of two tiles.
-
-## Mechanism
-
-Elementwise floor of two tiles. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$\mathrm{dst}_{i,j} = \mathrm{fmod}(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j})$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tfmod %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TFMOD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-- Division-by-zero behavior is target-defined; the CPU simulator asserts in debug builds.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tfmod` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
-  TileT out, a, b;
-  TFMOD(out, a, b);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tfmod %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.trem](./trem.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tfmod_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tfmod_zh.md
deleted file mode 100644
index fff213d9..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tfmod_zh.md
+++ /dev/null
@@ -1,79 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tfmod_zh.md` -->
-
-# TFMOD
-
-## 指令示意图
-
-![TFMOD tile operation](../figures/isa/TFMOD.svg)
-
-## 简介
-
-两个 Tile 的逐元素余数，余数符号与被除数相同。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$\mathrm{dst}_{i,j} = \mathrm{fmod}(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j})$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tfmod %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TFMOD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-- Division-by-zero behavior is target-defined; the CPU simulator asserts in debug builds.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
-  TileT out, a, b;
-  TFMOD(out, a, b);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md
deleted file mode 100644
index 1d224aab..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md
+++ /dev/null
@@ -1,108 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md` -->
-
-# TFMODS
-
-## 指令示意图
-
-![TFMODS tile operation](../figures/isa/TFMODS.svg)
-
-## 简介
-
-与标量的逐元素余数：`fmod(src, scalar)`。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$\mathrm{dst}_{i,j} = \mathrm{fmod}(\mathrm{src}_{i,j}, \mathrm{scalar})$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tfmods %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tfmods ins(%src, %scalar : !pto.tile_buf<...>, f32) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TFMODS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - 支持的元素类型为 `float` 和 `float32_t`。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - `dst` 和 `src` 必须是行主序。
-    - 运行时：`dst.GetValidRow() == src.GetValidRow() > 0` 且 `dst.GetValidCol() == src.GetValidCol() > 0`。
-- **实现检查 (A5)**:
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - 支持的元素类型为目标实现支持的 2 字节或 4 字节类型（包括 `half` 和 `float`）。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
-    - 运行时：`dst.GetValidRow() == src.GetValidRow()` 且 `dst.GetValidCol() == src.GetValidCol()`。
-- **除零**:
-    - 行为由目标定义；CPU 模拟器在调试构建中会断言。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TFMODS(out, x, 3.0f);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tfmods %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tfmods ins(%src, %scalar : !pto.tile_buf<...>, f32) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tlog.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tlog.md
deleted file mode 100644
index e3ff283a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tlog.md
+++ /dev/null
@@ -1,151 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tlog.md` -->
-
-# pto.tlog
-
-Standalone reference page for `pto.tlog`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise natural logarithm of a tile.
-
-## Mechanism
-
-Elementwise natural logarithm of a tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \log(\mathrm{src}_{i,j}) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tlog %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <auto PrecisionType = LogAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TLOG(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-`PrecisionType` has the following values available:
-
-- `LogAlgorithm::DEFAULT`: Normal algorithm, faster but with lower precision.
-- `LogAlgorithm::HIGH_PRECISION`: High precision algorithm, but slower.
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-- **Domain / NaN**:
-    - Domain behavior (e.g., `log(<=0)`) is target-defined.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (NPU)**:
-    - `TileData::DType` must be one of: `float` or `half`;
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-- **High precision algorithm**:
-    - Only available on A5. `PrecisionType` is ignored on A3.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TLOG(out, x);
-  TLOG<LogAlgorithm::HIGH_PRECISION>(out, x);  // A5 only
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tlog %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.txor](./txor.md)
-- Next op in family: [pto.trecip](./trecip.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tlog_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tlog_zh.md
deleted file mode 100644
index a44c40c7..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tlog_zh.md
+++ /dev/null
@@ -1,123 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tlog_zh.md` -->
-
-# TLOG
-
-## 指令示意图
-
-![TLOG tile operation](../figures/isa/TLOG.svg)
-
-## 简介
-
-Tile 的逐元素自然对数。
-
-## 数学语义
-
-对有效区域内的每个元素 `(i, j)`：
-
-$$ \mathrm{dst}_{i,j} = \log(\mathrm{src}_{i,j}) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tlog %src : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1（SSA）
-
-```text
-%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2（DPS）
-
-```text
-pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <auto PrecisionType = LogAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TLOG(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-`PrecisionType` 可指定以下取值：
-
-- `LogAlgorithm::DEFAULT`：普通算法，速度更快但精度较低。
-- `LogAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
-
-## 约束
-
-- **实现检查（NPU）**：
-    - `TileData::DType` 必须是 `float` 或 `half`；
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）；
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`；
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`；
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **有效区域**：
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **域 / NaN**：
-    - 域行为（例如 `log(<=0)`）由目标实现定义。
-- **高精度算法**：
-    - 仅 A5 支持，A3 会忽略 `PrecisionType` 选项。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TLOG(out, x);
-  TLOG<LogAlgorithm::HIGH_PRECISION>(out, x);  // 仅 A5
-}
-```
-
-## 汇编示例
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tlog %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmax.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmax.md
deleted file mode 100644
index 02b9a31d..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmax.md
+++ /dev/null
@@ -1,165 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tmax.md` -->
-
-# pto.tmax
-
-Standalone reference page for `pto.tmax`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise maximum of two tiles.
-
-## Mechanism
-
-Elementwise maximum of two tiles. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tmax %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TMAX(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TMAX(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tmax %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tmin](./tmin.md)
-- Next op in family: [pto.tcmp](./tcmp.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmax_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmax_zh.md
deleted file mode 100644
index d443ce3f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmax_zh.md
+++ /dev/null
@@ -1,110 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tmax_zh.md` -->
-
-# TMAX
-
-## 指令示意图
-
-![TMAX tile operation](../figures/isa/TMAX.svg)
-
-## 简介
-
-两个 Tile 的逐元素最大值。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tmax %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TMAX(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TMAX(dst, src0, src1);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md
deleted file mode 100644
index 279fdd64..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md
+++ /dev/null
@@ -1,105 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md` -->
-
-# TMAXS
-
-## 指令示意图
-
-![TMAXS tile operation](../figures/isa/TMAXS.svg)
-
-## 简介
-
-Tile 与标量的逐元素最大值：`max(src, scalar)`。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \max(\mathrm{src}_{i,j}, \mathrm{scalar}) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tmaxs %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmaxs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TMAXS(TileDataDst& dst, TileDataSrc& src, typename TileDataSrc::DType scalar, WaitEvents&... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`int16_t`、`half`、`float`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **实现检查 (A5)**:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`uint32_t`、`float`、`int16_t`、`uint16_t`、`half`、`bfloat16_t`、`uint8_t`、`int8_t`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **通用约束**:
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`dst` 和 `src` 的有效行列数必须相同。
-    - 标量类型必须与 Tile 数据类型一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TMAXS(out, x, 0.0f);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tmaxs %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tmaxs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmin.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmin.md
deleted file mode 100644
index 2fe328ab..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmin.md
+++ /dev/null
@@ -1,165 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tmin.md` -->
-
-# pto.tmin
-
-Standalone reference page for `pto.tmin`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise minimum of two tiles.
-
-## Mechanism
-
-Elementwise minimum of two tiles. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tmin %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TMIN(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TMIN(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tmin %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tmul](./tmul.md)
-- Next op in family: [pto.tmax](./tmax.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmin_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmin_zh.md
deleted file mode 100644
index bdffdc3e..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmin_zh.md
+++ /dev/null
@@ -1,110 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tmin_zh.md` -->
-
-# TMIN
-
-## 指令示意图
-
-![TMIN tile operation](../figures/isa/TMIN.svg)
-
-## 简介
-
-两个 Tile 的逐元素最小值。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tmin %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TMIN(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TMIN(dst, src0, src1);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md
deleted file mode 100644
index 19b129ea..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md
+++ /dev/null
@@ -1,122 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md` -->
-
-# TMINS
-
-## 指令示意图
-
-![TMINS tile operation](../figures/isa/TMINS.svg)
-
-## 简介
-
-Tile 与标量的逐元素最小值。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \min(\mathrm{src}_{i,j}, \mathrm{scalar}) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tmins %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmins ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TMINS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-- **实现检查 (A5)**:
-    - `TileData::DType` 必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t`。
-    - 运行时：`src.GetValidCol() == dst.GetValidCol()`。
-- **通用约束**:
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - 标量类型必须与 Tile 数据类型一致。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TMINS(dst, src, 0.0f);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TMINS(dst, src, 0.0f);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tmins %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tmins ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmul.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmul.md
deleted file mode 100644
index 7e3039d3..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmul.md
+++ /dev/null
@@ -1,165 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tmul.md` -->
-
-# pto.tmul
-
-Standalone reference page for `pto.tmul`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise multiply of two tiles.
-
-## Mechanism
-
-Elementwise multiply of two tiles. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tmul %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; .
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TMUL(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TMUL(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tmul %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tsub](./tsub.md)
-- Next op in family: [pto.tmin](./tmin.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmul_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmul_zh.md
deleted file mode 100644
index a70c6760..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmul_zh.md
+++ /dev/null
@@ -1,110 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tmul_zh.md` -->
-
-# TMUL
-
-## 指令示意图
-
-![TMUL tile operation](../figures/isa/TMUL.svg)
-
-## 简介
-
-两个 Tile 的逐元素乘法。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tmul %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; .
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TMUL(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TMUL(dst, src0, src1);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmuls_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmuls_zh.md
deleted file mode 100644
index 8cd8562a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tmuls_zh.md
+++ /dev/null
@@ -1,109 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tmuls_zh.md` -->
-
-# TMULS
-
-## 指令示意图
-
-![TMULS tile operation](../figures/isa/TMULS.svg)
-
-## 简介
-
-Tile 与标量的逐元素乘法。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \cdot \mathrm{scalar} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tmuls %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TMULS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TMULS(dst, src, 2.0f);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TMULS(dst, src, 2.0f);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tneg.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tneg.md
deleted file mode 100644
index 9c27c3bf..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tneg.md
+++ /dev/null
@@ -1,134 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tneg.md` -->
-
-# pto.tneg
-
-Standalone reference page for `pto.tneg`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise negation of a tile.
-
-## Mechanism
-
-Elementwise negation of a tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = -\mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tneg %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tneg ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tneg ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TNEG(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tneg` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TNEG(out, x);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tneg %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tneg ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.trelu](./trelu.md)
-- Next op in family: [pto.trem](./trem.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tneg_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tneg_zh.md
deleted file mode 100644
index e2e676b6..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tneg_zh.md
+++ /dev/null
@@ -1,78 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tneg_zh.md` -->
-
-# TNEG
-
-## 指令示意图
-
-![TNEG tile operation](../figures/isa/TNEG.svg)
-
-## 简介
-
-Tile 的逐元素取负。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = -\mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tneg %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tneg ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tneg ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TNEG(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## 约束
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TNEG(out, x);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tnot.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tnot.md
deleted file mode 100644
index 05856e36..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tnot.md
+++ /dev/null
@@ -1,145 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tnot.md` -->
-
-# pto.tnot
-
-Standalone reference page for `pto.tnot`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise bitwise NOT of a tile.
-
-## Mechanism
-
-Elementwise bitwise NOT of a tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \sim\mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tnot %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tnot ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tnot ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TNOT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src/dst` are assumed to be compatible (not validated by explicit runtime checks in this op).
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int16_t`, `uint16_t`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileT x, out;
-  TNOT(out, x);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tnot %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tnot ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.texp](./texp.md)
-- Next op in family: [pto.trelu](./trelu.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tnot_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tnot_zh.md
deleted file mode 100644
index 1266740f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tnot_zh.md
+++ /dev/null
@@ -1,91 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tnot_zh.md` -->
-
-# TNOT
-
-## 指令示意图
-
-![TNOT tile operation](../figures/isa/TNOT.svg)
-
-## 简介
-
-Tile 的逐元素按位取反。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \sim\mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tnot %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tnot ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tnot ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TNOT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int16_t`, `uint16_t`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src/dst` are assumed to be compatible (not validated by explicit runtime checks in this op).
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileT x, out;
-  TNOT(out, x);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tor.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tor.md
deleted file mode 100644
index 58d3b1ce..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tor.md
+++ /dev/null
@@ -1,132 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tor.md` -->
-
-# pto.tor
-
-Standalone reference page for `pto.tor`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise bitwise OR of two tiles.
-
-## Mechanism
-
-Elementwise bitwise OR of two tiles. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \;|\; \mathrm{src1}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tor %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TOR(TileData &dst, TileData &src0, TileData &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Supported element types are 1-byte or 2-byte integral types.
-    - `dst`, `src0`, and `src1` must use the same element type.
-    - `dst`, `src0`, and `src1` must be row-major.
-    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
-
-- **Implementation checks (A5)**:
-    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
-    - `dst`, `src0`, and `src1` must use the same element type.
-    - `dst`, `src0`, and `src1` must be row-major.
-    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
-  TileT a, b, out;
-  TOR(out, a, b);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tor %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tand](./tand.md)
-- Next op in family: [pto.tsub](./tsub.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tor_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tor_zh.md
deleted file mode 100644
index 9121f2fd..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tor_zh.md
+++ /dev/null
@@ -1,104 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tor_zh.md` -->
-
-# TOR
-
-## 指令示意图
-
-![TOR tile operation](../figures/isa/TOR.svg)
-
-## 简介
-
-两个 Tile 的逐元素按位或。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \;|\; \mathrm{src1}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tor %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TOR(TileData &dst, TileData &src0, TileData &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - 支持的元素类型为 1 字节或 2 字节整数类型。
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
-- **实现检查 (A5)**:
-    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
-  TileT a, b, out;
-  TOR(out, a, b);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tor %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md
deleted file mode 100644
index c8dd991e..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md
+++ /dev/null
@@ -1,107 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md` -->
-
-# TORS
-
-## 指令示意图
-
-![TORS tile operation](../figures/isa/TORS.svg)
-
-## 简介
-
-Tile 与标量的逐元素按位或。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \;|\; \mathrm{scalar} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tors %src, %scalar : !pto.tile<...>, i32
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TORS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - 适用于整数元素类型。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
-- **实现检查 (A5)**:
-    - 适用于 `TEXPANDS` 和 `TOR` 支持的整数元素类型。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileDst dst;
-  TileSrc src;
-  TORS(dst, src, 0xffu);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tors %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tprelu.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tprelu.md
deleted file mode 100644
index a3b03caf..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tprelu.md
+++ /dev/null
@@ -1,137 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tprelu.md` -->
-
-# pto.tprelu
-
-Standalone reference page for `pto.tprelu`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise PReLU (parametric ReLU) with a per-element slope tile.
-
-## Mechanism
-
-Elementwise PReLU (parametric ReLU) with a per-element slope tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = (\mathrm{src0}_{i,j} > 0) ? \mathrm{src0}_{i,j} : (\mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j}) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tprelu %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tprelu ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tprelu ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TPRELU(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- `tmp` is a required temporary working tile for PReLU slope selection.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-- For A3, 2 source Tile, destination Tile, temporary space must in different memory range without overlapping.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- Temporary space is required by A3 for calculation, while not used by A5.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, slope, out, tmp;
-  TPRELU(out, x, slope, tmp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tprelu %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tprelu ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.trecip](./trecip.md)
-- Next op in family: [pto.taddc](./taddc.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tprelu_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tprelu_zh.md
deleted file mode 100644
index 5a22c946..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tprelu_zh.md
+++ /dev/null
@@ -1,81 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tprelu_zh.md` -->
-
-# TPRELU
-
-## 指令示意图
-
-![TPRELU tile operation](../figures/isa/TPRELU.svg)
-
-## 简介
-
-带逐元素斜率 Tile 的逐元素参数化 ReLU (PReLU)。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = (\mathrm{src0}_{i,j} > 0) ? \mathrm{src0}_{i,j} : (\mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j}) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tprelu %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tprelu ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tprelu ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TPRELU(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-- Temporary space is required by A3 for calculation, while not used by A5.
-- For A3, 2 source Tile, destination Tile, temporary space must in different memory range without overlapping.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, slope, out, tmp;
-  TPRELU(out, x, slope, tmp);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trecip.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trecip.md
deleted file mode 100644
index 22f58d7e..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trecip.md
+++ /dev/null
@@ -1,140 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/trecip.md` -->
-
-# pto.trecip
-
-Standalone reference page for `pto.trecip`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise reciprocal of a tile.
-
-## Mechanism
-
-Elementwise reciprocal of a tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \frac{1}{\mathrm{src}_{i,j}} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trecip %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trecip ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <auto PrecisionType = RecipAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TRECIP(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-`PrecisionType` has the following values available:
-
-* `RecipAlgorithm::DEFAULT`: Normal algorithm, faster but with lower precision.
-* `RecipAlgorithm::HIGH_PRECISION`: High precision algorithm, but slower.
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-- **Domain / NaN**:
-    - Division-by-zero behavior is target-defined; the CPU simulator asserts in debug builds.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (NPU)**:
-    - `TileData::DType` must be one of: `float` or `half`;
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - A3's TRECIP instruction does not support setting the source Tile and destination Tile to the same memory.
-
-- **High Precision Algorithm**
-    - Only available on A5, `PrecisionType` option is ignored on A3.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TRECIP(out, x);
-  TRECIP<RecipAlgorithm::HIGH_PRECISION>(out, x);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trecip %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trecip ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tlog](./tlog.md)
-- Next op in family: [pto.tprelu](./tprelu.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trecip_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trecip_zh.md
deleted file mode 100644
index a0bb3e29..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trecip_zh.md
+++ /dev/null
@@ -1,112 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/trecip_zh.md` -->
-
-# TRECIP
-
-## 指令示意图
-
-![TRECIP tile operation](../figures/isa/TRECIP.svg)
-
-## 简介
-
-Tile 的逐元素倒数。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \frac{1}{\mathrm{src}_{i,j}} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trecip %src : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trecip ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <auto PrecisionType = RecipAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TRECIP(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-`PrecisionType`可指定以下值：
-
-* `RecipAlgorithm::DEFAULT`：普通算法，速度快但精度较低。
-* `RecipAlgorithm::HIGH_PRECISION`：高精度算法，速度较慢。
-
-## 约束
-
-- **实现检查 (NPU)**:
-    - `TileData::DType` 必须是以下之一：`float` 或 `half`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`);
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-    - A3 的 TRECIP 指令不支持将源 Tile 和目标 Tile 设置为相同的内存。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **域 / NaN**:
-    - 除零行为由目标定义；CPU 模拟器在调试构建中会断言。
-- **高精度算法**
-    - 仅在A5上有效，`PrecisionType`选项A3上将被忽略。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TRECIP(out, x);
-  TRECIP<RecipAlgorithm::HIGH_PRECISION>(out, x);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trecip %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trecip ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trelu.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trelu.md
deleted file mode 100644
index 1cbfa46e..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trelu.md
+++ /dev/null
@@ -1,145 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/trelu.md` -->
-
-# pto.trelu
-
-Standalone reference page for `pto.trelu`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise ReLU of a tile.
-
-## Mechanism
-
-Elementwise ReLU of a tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \max(\mathrm{src}_{i,j}, 0) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trelu %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trelu ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.trelu ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TRELU(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src/dst` are assumed to be compatible (not validated by explicit runtime checks in this op).
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `half`, `float`, `int32_t`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `half`, `float`, `int32_t`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TRELU(out, x);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trelu %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trelu ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tnot](./tnot.md)
-- Next op in family: [pto.tneg](./tneg.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trelu_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trelu_zh.md
deleted file mode 100644
index 2ac83625..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trelu_zh.md
+++ /dev/null
@@ -1,91 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/trelu_zh.md` -->
-
-# TRELU
-
-## 指令示意图
-
-![TRELU tile operation](../figures/isa/TRELU.svg)
-
-## 简介
-
-Tile 的逐元素 ReLU。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \max(\mathrm{src}_{i,j}, 0) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = trelu %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trelu ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trelu ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TRELU(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `half`, `float`, `int32_t`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `half`, `float`, `int32_t`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src/dst` are assumed to be compatible (not validated by explicit runtime checks in this op).
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TRELU(out, x);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trem.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trem.md
deleted file mode 100644
index 5256f426..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trem.md
+++ /dev/null
@@ -1,145 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/trem.md` -->
-
-# pto.trem
-
-Standalone reference page for `pto.trem`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise remainder of two tiles.
-
-## Mechanism
-
-Elementwise remainder of two tiles. The result has the same sign as the divider. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \bmod \mathrm{src1}_{i,j}$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trem %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trem ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TREM(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Division by Zero**:
-    - Behavior is target-defined; the CPU simulator asserts in debug builds.
-
-- **Valid Region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation Checks (A2A3)**:
-    - `dst`, `src0`, and `src1` must use the same element type.
-    - Supported element types: `float` and `int32_t`.
-    - `dst`, `src0`, and `src1` must be vector tiles.
-    - `dst`, `src0`, and `src1` must be row-major.
-    - Runtime: `dst.GetValidRow() == src0.GetValidRow() == src1.GetValidRow() > 0` and `dst.GetValidCol() == src0.GetValidCol() == src1.GetValidCol() > 0`.
-    - **tmp Buffer Requirements**:
-      - `tmp.GetValidCol() >= dst.GetValidCol()` (at least as many columns as dst)
-      - `tmp.GetValidRow() >= 1` (at least 1 row)
-      - Data type must match `TileDataDst::DType`.
-
-- **Implementation Checks (A5)**:
-    - `dst`, `src0`, and `src1` must use the same element type.
-    - Supported element types: `float`, `int32_t`, `uint32_t`, `half`, `int16_t`, and `uint16_t`.
-    - `dst`, `src0`, and `src1` must be vector tiles.
-    - Static valid bounds: `ValidRow <= Rows` and `ValidCol <= Cols` for all tiles.
-    - Runtime: `dst.GetValidRow() == src0.GetValidRow() == src1.GetValidRow()` and `dst.GetValidCol() == src0.GetValidCol() == src1.GetValidCol()`.
-    - Note: tmp parameter is accepted but not validated or used on A5.
-
-- **For `int32_t` Inputs (A2A3 Only)**: Both `src0` and `src1` elements must be in the range `[-2^24, 2^24]` (i.e., `[-16777216, 16777216]`) to ensure exact conversion to float32 during computation.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
-  TileT out, a, b;
-  Tile<TileType::Vec, int32_t, 16, 16> tmp;
-  TREM(out, a, b, tmp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trem %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trem ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tneg](./tneg.md)
-- Next op in family: [pto.tfmod](./tfmod.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trem_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trem_zh.md
deleted file mode 100644
index 4f93a48e..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trem_zh.md
+++ /dev/null
@@ -1,115 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/trem_zh.md` -->
-
-# TREM
-
-## 指令示意图
-
-![TREM tile operation](../figures/isa/TREM.svg)
-
-## 简介
-
-两个 Tile 的逐元素余数运算。结果符号与除数相同。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \bmod \mathrm{src1}_{i,j}$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trem %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trem ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TREM(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - 支持的元素类型：`float` 和 `int32_t`。
-    - `dst`、`src0` 和 `src1` 必须是向量 Tile。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - 运行时：`dst.GetValidRow() == src0.GetValidRow() == src1.GetValidRow() > 0` 且 `dst.GetValidCol() == src0.GetValidCol() == src1.GetValidCol() > 0`。
-    - **tmp 缓冲区要求**：
-      - `tmp.GetValidCol() >= dst.GetValidCol()`（至少与 dst 相同的列数）
-      - `tmp.GetValidRow() >= 1`（至少 1 行）
-      - 数据类型必须与 `TileDataDst::DType` 匹配。
-- **实现检查 (A5)**:
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - 支持的元素类型：`float`、`int32_t`、`uint32_t`、`half`、`int16_t` 和 `uint16_t`。
-    - `dst`、`src0` 和 `src1` 必须是向量 Tile。
-    - 静态有效边界：所有 Tile 都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
-    - 运行时：`dst.GetValidRow() == src0.GetValidRow() == src1.GetValidRow()` 且 `dst.GetValidCol() == src0.GetValidCol() == src1.GetValidCol()`。
-    - 注意：tmp 参数在 A5 上被接受但不进行验证或使用。
-- **除零**:
-    - 行为由目标定义；CPU 模拟器在调试构建中会断言。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **对于 `int32_t` 输入（仅 A2A3）**：`src0` 和 `src1` 的所有元素必须在 `[-2^24, 2^24]` 范围内（即 `[-16777216, 16777216]`），以确保在计算过程中能精确转换为 float32。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
-  TileT out, a, b;
-  Tile<TileType::Vec, int32_t, 16, 16> tmp;
-  TREM(out, a, b, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trem %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trem ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md
deleted file mode 100644
index 428602a3..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md
+++ /dev/null
@@ -1,116 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md` -->
-
-# TREMS
-
-## 指令示意图
-
-![TREMS tile operation](../figures/isa/TREMS.svg)
-
-## 简介
-
-与标量的逐元素余数：`remainder(src, scalar)`。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$\mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \bmod \mathrm{scalar}$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trems %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trems ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TREMS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar,
-                           TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - 支持的元素类型：`float` 和 `int32_t`。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - `dst` 和 `src` 必须是行主序。
-    - 运行时：`dst.GetValidRow() == src.GetValidRow() > 0` 且 `dst.GetValidCol() == src.GetValidCol() > 0`。
-    - **tmp 缓冲区要求**：
-      - `tmp.GetValidCol() >= dst.GetValidCol()`（至少与 dst 相同的列数）
-      - `tmp.GetValidRow() >= 1`（至少 1 行）
-      - 数据类型必须与 `TileDataDst::DType` 匹配。
-- **实现检查 (A5)**:
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - 支持的元素类型：`float`、`int32_t`、`uint32_t`、`half`、`int16_t` 和 `uint16_t`。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
-    - 运行时：`dst.GetValidRow() == src.GetValidRow()` 且 `dst.GetValidCol() == src.GetValidCol()`。
-    - 注意：tmp 参数在 A5 上被接受但不进行验证或使用。
-- **除零**:
-    - 行为由目标定义；CPU 模拟器在调试构建中会断言。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **对于 `int32_t` 输入（仅 A2A3）**：`src` 的元素和 `scalar` 必须在 `[-2^24, 2^24]` 范围内（即 `[-16777216, 16777216]`），以确保在计算过程中能精确转换为 float32。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  Tile<TileType::Vec, float, 16, 16> tmp;
-  TREMS(out, x, 3.0f, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trems %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.trems ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trsqrt.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trsqrt.md
deleted file mode 100644
index 92c0a4a5..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trsqrt.md
+++ /dev/null
@@ -1,163 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/trsqrt.md` -->
-
-# pto.trsqrt
-
-Standalone reference page for `pto.trsqrt`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise reciprocal square root.
-
-## Mechanism
-
-Elementwise reciprocal square root. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \frac{1}{\sqrt{\mathrm{src}_{i,j}}} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trsqrt %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.trsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TRSQRT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TRSQRT(TileDataDst &dst, TileDataSrc &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-- **Domain / NaN**:
-    - Behavior is target-defined (e.g., for `src == 0` or negative inputs).
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (NPU)**:
-    - The `tmp` buffer must be at least 32 bytes. When tmp is provided, the high-precision version is executed.
-    - `TileData::DType` must be one of: `float` or `half`;
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TRSQRT(dst, src);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TRSQRT(dst, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trsqrt %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tsel](./tsel.md)
-- Next op in family: [pto.tsqrt](./tsqrt.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trsqrt_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trsqrt_zh.md
deleted file mode 100644
index 1273d217..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/trsqrt_zh.md
+++ /dev/null
@@ -1,109 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/trsqrt_zh.md` -->
-
-# TRSQRT
-
-## 指令示意图
-
-![TRSQRT tile operation](../figures/isa/TRSQRT.svg)
-
-## 简介
-
-逐元素倒数平方根。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \frac{1}{\sqrt{\mathrm{src}_{i,j}}} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = trsqrt %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TRSQRT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TRSQRT(TileDataDst &dst, TileDataSrc &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (NPU)**:
-    - The `tmp` buffer must be at least 32 bytes. When tmp is provided, the high-precision version is executed.
-    - `TileData::DType` must be one of: `float` or `half`;
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-- **Domain / NaN**:
-    - Behavior is target-defined (e.g., for `src == 0` or negative inputs).
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TRSQRT(dst, src);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TRSQRT(dst, src);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsel.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsel.md
deleted file mode 100644
index 4fcfe7ea..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsel.md
+++ /dev/null
@@ -1,172 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tsel.md` -->
-
-# pto.tsel
-
-Standalone reference page for `pto.tsel`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Per-element conditional selection between two tiles using a predicate mask.
-
-## Mechanism
-
-For each element `(i, j)` in the destination's valid region:
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src0}_{i,j} & \text{if } \mathrm{mask}_{i,j}\ \text{is true (non-zero)} \\
-\mathrm{src1}_{i,j} & \text{otherwise}
-\end{cases}
-$$
-
-The predicate mask tile uses a target-defined packed encoding. A temporary tile (`tmp`) is required as a working buffer for predicate unpacking.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tsel %mask, %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsel %mask, %src0, %src1 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsel ins(%mask, %src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tsel %mask, %src0, %src1 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tsel ins(%mask, %src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename MaskTile, typename TmpTile, typename... WaitEvents>
-PTO_INST RecordEvent TSEL(TileData &dst, MaskTile &selMask, TileData &src0,
-                          TileData &src1, TmpTile &tmp, WaitEvents &... events);
-```
-
-**Parameters:**
-- `dst`: destination tile receiving the selected values.
-- `selMask`: predicate mask tile. Lane `(i,j)` is true if non-zero; selects `src0[i,j]`.
-- `src0`: source tile selected when mask lane is true.
-- `src1`: source tile selected when mask lane is false.
-- `tmp`: required temporary working tile for predicate unpacking. Must have compatible shape.
-
-## Inputs
-
-- `dst` names the destination tile receiving the selected values.
-- `selMask` is the predicate mask tile. Lane `(i,j)` selects from `src0` if non-zero, otherwise from `src1`.
-- `src0` is the source tile selected for mask-true lanes.
-- `src1` is the source tile selected for mask-false lanes.
-- `tmp` is a required temporary working tile for predicate unpacking.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile — `src0` where the mask is true, `src1` otherwise.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `sizeof(TileData::DType)` MUST be `2` or `4` bytes.
-- `dst`, `src0`, and `src1` MUST use the **same element type**.
-- `dst`, `src0`, and `src1` MUST be row-major layout.
-- `dst`, `src0`, and `src1` MUST have the same declared shape.
-- `selMask` layout MUST be compatible with the target's predicate unpacking format.
-- The iteration domain is `dst.GetValidRow()` × `dst.GetValidCol()`.
-- `tmp` MUST have sufficient capacity to hold intermediate predicate bits; its exact requirements are target-defined.
-
-## Cases That Are Not Allowed
-
-- **MUST NOT** use non-row-major `dst`/`src0`/`src1` tiles.
-- **MUST NOT** use `dst`/`src0`/`src1` with different declared shapes.
-
-## Target-Profile Restrictions
-
-| Check | A2/A3 | A5 |
-|-------|:-----:|:--:|
-| Supported dtypes | `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `bfloat16_t`, `float` | Same |
-| sizeof(dtype) | 2 or 4 bytes | Same |
-| Row-major layout | Required | Required |
-| Same shape (dst/src0/src1) | Required | Required |
-| `tmp` tile required | Yes | Yes |
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, uint32_t, 1, 16>;
-  TileT src0, src1, dst;
-  MaskT mask(16, 2);
-  TmpT tmp;
-  TSEL(dst, mask, src0, src1, tmp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, uint32_t, 1, 16>;
-  TileT src0, src1, dst;
-  MaskT mask(16, 2);
-  TmpT tmp;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TASSIGN(mask, 0x4000);
-  TASSIGN(tmp,  0x5000);
-  TSEL(dst, mask, src0, src1, tmp);
-}
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tsel %mask, %src0, %src1 : !pto.tile<...>
-pto.tsel ins(%mask, %src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tcvt](./tcvt.md)
-- Next op in family: [pto.trsqrt](./trsqrt.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsel_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsel_zh.md
deleted file mode 100644
index f2e9953f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsel_zh.md
+++ /dev/null
@@ -1,141 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tsel_zh.md` -->
-
-# TSEL
-
-## 指令示意图
-
-![TSEL tile operation](../figures/isa/TSEL.svg)
-
-## 简介
-
-使用掩码 Tile 在两个 Tile 之间进行选择（逐元素选择）。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src0}_{i,j} & \text{if } \mathrm{mask}_{i,j}\ \text{is true} \\
-\mathrm{src1}_{i,j} & \text{otherwise}
-\end{cases}
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tsel %mask, %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tsel %mask, %src0, %src1 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tsel ins(%mask, %src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileData, typename MaskTile, typename TmpTile, typename... WaitEvents>
-PTO_INST RecordEvent TSEL(TileData &dst, MaskTile &selMask, TileData &src0, TileData &src1, TmpTile &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `sizeof(TileData::DType)` 必须是 `2` 或 `4` 字节。
-    - `TileData::DType` 必须是 `int16_t` 或 `uint16_t` 或 `int32_t` 或 `uint32_t` 或 `half` 或 `bfloat16_t` 或 `float`。
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - 选择域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定。
-- **实现检查 (A5)**:
-    - `sizeof(TileData::DType)` 必须是 `2` 或 `4` 字节。
-    - `TileData::DType` 必须是 `int16_t` 或 `uint16_t` 或 `int32_t` 或 `uint32_t` 或 `half` 或 `bfloat16_t` 或 `float`。
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - 选择域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定。
-- **掩码编码**:
-    - 掩码 tile 被解释为目标定义布局中的打包谓词位。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, uint32_t, 1, 16>;
-  TileT src0, src1, dst;
-  MaskT mask(16, 2);
-  TmpT tmp;
-  TSEL(dst, mask, src0, src1, tmp);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  using TmpT = Tile<TileType::Vec, uint32_t, 1, 16>;
-  TileT src0, src1, dst;
-  MaskT mask(16, 2);
-  TmpT tmp;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TASSIGN(mask, 0x4000);
-  TASSIGN(tmp,  0x5000);
-  TSEL(dst, mask, src0, src1, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tsel %mask, %src0, %src1 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tsel %mask, %src0, %src1 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tsel %mask, %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tsel ins(%mask, %src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md
deleted file mode 100644
index 08870bdf..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md
+++ /dev/null
@@ -1,148 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md` -->
-
-# TSELS
-
-## 指令示意图
-
-![TSELS tile operation](../figures/isa/TSELS.svg)
-
-## 简介
-
-使用 mask tile 在源 Tile 和标量之间进行逐元素选择。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src}_{i,j} & \text{if } \mathrm{mask}_{i,j}\ \text{为真} \\
-\mathrm{scalar} & \text{否则}
-\end{cases}
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tsels %mask, %src, %scalar : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tsels ins(%mask, %src, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataMask, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TSELS(TileDataDst &dst, TileDataMask &mask, TileDataSrc &src, TileDataTmp &tmp, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `sizeof(TileDataDst::DType)` 必须是 `2` 或 `4` 字节。
-    - 支持的数据类型为 `half`、`float16_t`、`float` 和 `float32_t`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是行主序。
-    - 运行时：`src.GetValidRow()/GetValidCol()` 必须与 `dst.GetValidRow()/GetValidCol()` 一致。
-- **实现检查 (A5)**:
-    - `sizeof(TileDataDst::DType)` 可以是 `1`、`2` 或 `4` 字节。
-    - 支持的数据类型为 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half` 和 `float`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst`、`mask` 和 `src` 必须是行主序。
-    - 运行时：`src.GetValidRow()/GetValidCol()` 必须与 `dst.GetValidRow()/GetValidCol()` 一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-- **掩码编码**:
-    - 掩码 Tile 被解释为目标定义布局中的打包谓词位。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileDst = Tile<TileType::Vec, float, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, float, 16, 16>;
-  using TileTmp = Tile<TileType::Vec, float, 16, 16>;
-  using TileMask = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  TileDst dst;
-  TileSrc src;
-  TileTmp tmp;
-  TileMask mask(16, 2);
-  float scalar = 0.0f;
-  TSELS(dst, mask, src, tmp, scalar);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileDst = Tile<TileType::Vec, float, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, float, 16, 16>;
-  using TileTmp = Tile<TileType::Vec, float, 16, 16>;
-  using TileMask = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  TileDst dst;
-  TileSrc src;
-  TileTmp tmp;
-  TileMask mask(16, 2);
-  float scalar = 0.0f;
-  TASSIGN(src, 0x1000);
-  TASSIGN(tmp, 0x2000);
-  TASSIGN(dst, 0x3000);
-  TASSIGN(mask, 0x4000);
-  TSELS(dst, mask, src, tmp, scalar);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tsels %mask, %src, %scalar : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tsels ins(%mask, %src, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshl.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshl.md
deleted file mode 100644
index c27ab9cc..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshl.md
+++ /dev/null
@@ -1,132 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tshl.md` -->
-
-# pto.tshl
-
-Standalone reference page for `pto.tshl`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise shift-left of two tiles.
-
-## Mechanism
-
-Elementwise shift-left of two tiles. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \ll \mathrm{src1}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tshl %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tshl ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TSHL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
-    - `dst`, `src0`, and `src1` must use the same element type.
-    - `dst`, `src0`, and `src1` must be row-major.
-    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
-
-- **Implementation checks (A5)**:
-    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
-    - `dst`, `src0`, and `src1` must use the same element type.
-    - `dst`, `src0`, and `src1` must be row-major.
-    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileT x, sh, out;
-  TSHL(out, x, sh);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tshl %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tshl ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tdiv](./tdiv.md)
-- Next op in family: [pto.tshr](./tshr.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshl_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshl_zh.md
deleted file mode 100644
index a52736cc..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshl_zh.md
+++ /dev/null
@@ -1,104 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tshl_zh.md` -->
-
-# TSHL
-
-## 指令示意图
-
-![TSHL tile operation](../figures/isa/TSHL.svg)
-
-## 简介
-
-两个 Tile 的逐元素左移。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \ll \mathrm{src1}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tshl %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tshl ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TSHL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
-- **实现检查 (A5)**:
-    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileT x, sh, out;
-  TSHL(out, x, sh);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tshl %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tshl ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshr.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshr.md
deleted file mode 100644
index b677bf03..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshr.md
+++ /dev/null
@@ -1,132 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tshr.md` -->
-
-# pto.tshr
-
-Standalone reference page for `pto.tshr`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise shift-right of two tiles.
-
-## Mechanism
-
-Elementwise shift-right of two tiles. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \gg \mathrm{src1}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tshr %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tshr ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TSHR(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
-    - `dst`, `src0`, and `src1` must use the same element type.
-    - `dst`, `src0`, and `src1` must be row-major.
-    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
-
-- **Implementation checks (A5)**:
-    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
-    - `dst`, `src0`, and `src1` must use the same element type.
-    - `dst`, `src0`, and `src1` must be row-major.
-    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileT x, sh, out;
-  TSHR(out, x, sh);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tshr %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tshr ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tshl](./tshl.md)
-- Next op in family: [pto.txor](./txor.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshr_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshr_zh.md
deleted file mode 100644
index 771b570c..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tshr_zh.md
+++ /dev/null
@@ -1,104 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tshr_zh.md` -->
-
-# TSHR
-
-## 指令示意图
-
-![TSHR tile operation](../figures/isa/TSHR.svg)
-
-## 简介
-
-两个 Tile 的逐元素右移。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \gg \mathrm{src1}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tshr %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tshr ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TSHR(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
-- **实现检查 (A5)**:
-    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
-    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileT x, sh, out;
-  TSHR(out, x, sh);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tshr %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tshr ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsqrt.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsqrt.md
deleted file mode 100644
index 0962ceee..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsqrt.md
+++ /dev/null
@@ -1,159 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tsqrt.md` -->
-
-# pto.tsqrt
-
-Standalone reference page for `pto.tsqrt`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise square root.
-
-## Mechanism
-
-Elementwise square root. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \sqrt{\mathrm{src}_{i,j}} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tsqrt %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TSQRT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-- **Domain / NaN**:
-    - Behavior is target-defined (e.g., for negative inputs).
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (NPU)**:
-    - `TileData::DType` must be one of: `float` or `half`;
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TSQRT(dst, src);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TSQRT(dst, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tsqrt %src : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.trsqrt](./trsqrt.md)
-- Next op in family: [pto.texp](./texp.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsqrt_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsqrt_zh.md
deleted file mode 100644
index c16e6ea8..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsqrt_zh.md
+++ /dev/null
@@ -1,105 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tsqrt_zh.md` -->
-
-# TSQRT
-
-## 指令示意图
-
-![TSQRT tile operation](../figures/isa/TSQRT.svg)
-
-## 简介
-
-逐元素平方根。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \sqrt{\mathrm{src}_{i,j}} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tsqrt %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TSQRT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (NPU)**:
-    - `TileData::DType` must be one of: `float` or `half`;
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-- **Domain / NaN**:
-    - Behavior is target-defined (e.g., for negative inputs).
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TSQRT(dst, src);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TSQRT(dst, src);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsub.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsub.md
deleted file mode 100644
index d2be6a29..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsub.md
+++ /dev/null
@@ -1,165 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tsub.md` -->
-
-# pto.tsub
-
-Standalone reference page for `pto.tsub`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise subtract of two tiles.
-
-## Mechanism
-
-Elementwise subtract of two tiles. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tsub %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TSUB(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TSUB(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tsub %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tor](./tor.md)
-- Next op in family: [pto.tmul](./tmul.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsub_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsub_zh.md
deleted file mode 100644
index 19582754..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsub_zh.md
+++ /dev/null
@@ -1,110 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tsub_zh.md` -->
-
-# TSUB
-
-## 指令示意图
-
-![TSUB tile operation](../figures/isa/TSUB.svg)
-
-## 简介
-
-两个 Tile 的逐元素减法。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tsub %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TSUB(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TSUB(dst, src0, src1);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsubc.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsubc.md
deleted file mode 100644
index aa70bcc9..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsubc.md
+++ /dev/null
@@ -1,135 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tsubc.md` -->
-
-# pto.tsubc
-
-Standalone reference page for `pto.tsubc`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise ternary op: `src0 - src1 + src2`.
-
-## Mechanism
-
-Elementwise ternary op: `src0 - src1 + src2`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{i,j} + \mathrm{src2}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tsubc %src0, %src1, %src2 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsubc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tsubc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TSUBC(TileData &dst, TileData &src0, TileData &src1, TileData &src2, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tsubc` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, c, out;
-  TSUBC(out, a, b, c);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tsubc %src0, %src1, %src2 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tsubc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.taddc](./taddc.md)
-- Next op in family: [pto.tcvt](./tcvt.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsubc_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsubc_zh.md
deleted file mode 100644
index c4f20e50..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsubc_zh.md
+++ /dev/null
@@ -1,78 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tsubc_zh.md` -->
-
-# TSUBC
-
-## 指令示意图
-
-![TSUBC tile operation](../figures/isa/TSUBC.svg)
-
-## 简介
-
-三元逐元素运算：`src0 - src1 + src2`。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{i,j} + \mathrm{src2}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tsubc %src0, %src1, %src2 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsubc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tsubc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TSUBC(TileData &dst, TileData &src0, TileData &src1, TileData &src2, WaitEvents &... events);
-```
-
-## 约束
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, c, out;
-  TSUBC(out, a, b, c);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md
deleted file mode 100644
index 2b3797c2..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md
+++ /dev/null
@@ -1,106 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md` -->
-
-# TSUBS
-
-## 指令示意图
-
-![TSUBS tile operation](../figures/isa/TSUBS.svg)
-
-## 简介
-
-从 Tile 中逐元素减去一个标量。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} - \mathrm{scalar} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tsubs %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tsubs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TSUBS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
-- **实现检查 (A5)**:
-    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
-    - Tile 位置必须是向量（`TileDataDst::Loc == TileType::Vec` 且 `TileDataSrc::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileDataDst::ValidRow <= TileDataDst::Rows`、`TileDataDst::ValidCol <= TileDataDst::Cols`、`TileDataSrc::ValidRow <= TileDataSrc::Rows`，且 `TileDataSrc::ValidCol <= TileDataSrc::Cols`。
-    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
-- **通用约束**:
-    - `dst` 和 `src0` 必须使用相同的元素类型。
-    - 标量类型必须与 `TileDataSrc::DType` 一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TSUBS(out, x, 1.0f);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tsubs %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tsubs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/txor.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/txor.md
deleted file mode 100644
index 269d03d8..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/txor.md
+++ /dev/null
@@ -1,139 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/txor.md` -->
-
-# pto.txor
-
-Standalone reference page for `pto.txor`. This page belongs to the [Elementwise Tile Tile](../../elementwise-tile-tile.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise bitwise XOR of two tiles.
-
-## Mechanism
-
-Elementwise bitwise XOR of two tiles. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \oplus \mathrm{src1}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = txor %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.txor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TXOR(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (left operand).
-- `src1` is the second source tile (right operand).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A5)**:
-    - `dst`, `src0`, and `src1` element types must match.
-    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
-    - `dst`, `src0`, and `src1` must be row-major.
-    - `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
-
-- **Implementation checks (A2A3)**:
-    - `dst`, `src0`, `src1`, and `tmp` element types must match.
-    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, and `int16_t`.
-    - `dst`, `src0`, `src1`, and `tmp` must be row-major.
-    - `src0`, `src1`, and `tmp` valid shapes must match `dst`.
-    - In manual mode, `dst`, `src0`, `src1`, and `tmp` must not overlap in memory.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint32_t, 16, 16>;
-  using TileSrc0 = Tile<TileType::Vec, uint32_t, 16, 16>;
-  using TileSrc1 = Tile<TileType::Vec, uint32_t, 16, 16>;
-  using TileTmp = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileDst dst;
-  TileSrc0 src0;
-  TileSrc1 src1;
-  TileTmp tmp;
-  TXOR(dst, src0, src1, tmp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = txor %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.txor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md)
-- Previous op in family: [pto.tshr](./tshr.md)
-- Next op in family: [pto.tlog](./tlog.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/txor_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/txor_zh.md
deleted file mode 100644
index dd040291..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/txor_zh.md
+++ /dev/null
@@ -1,111 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/txor_zh.md` -->
-
-# TXOR
-
-## 指令示意图
-
-![TXOR tile operation](../figures/isa/TXOR.svg)
-
-## 简介
-
-两个 Tile 的逐元素按位异或。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \oplus \mathrm{src1}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = txor %src0, %src1 : !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.txor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TXOR(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- 该操作在 `dst.GetValidRow()` / `dst.GetValidCol()` 上迭代。
-- **实现检查 (A5)**:
-    - `dst`、`src0` 和 `src1` 的元素类型必须一致。
-    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
-    - `dst`、`src0` 和 `src1` 必须是行主序。
-    - `src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
-- **实现检查 (A2A3)**:
-    - `dst`、`src0`、`src1` 和 `tmp` 的元素类型必须一致。
-    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t` 和 `int16_t`。
-    - `dst`、`src0`、`src1` 和 `tmp` 必须是行主序。
-    - `src0`、`src1` 和 `tmp` 的有效形状必须与 `dst` 一致。
-    - 在手动模式下，`dst`、`src0`、`src1` 和 `tmp` 的内存区域不得重叠。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint32_t, 16, 16>;
-  using TileSrc0 = Tile<TileType::Vec, uint32_t, 16, 16>;
-  using TileSrc1 = Tile<TileType::Vec, uint32_t, 16, 16>;
-  using TileTmp = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileDst dst;
-  TileSrc0 src0;
-  TileSrc1 src1;
-  TileTmp tmp;
-  TXOR(dst, src0, src1, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = txor %src0, %src1 : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.txor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md
deleted file mode 100644
index ff96efb6..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md
+++ /dev/null
@@ -1,106 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md` -->
-
-# TXORS
-
-## 指令示意图
-
-![TXORS tile operation](../figures/isa/TXORS.svg)
-
-## 简介
-
-Tile 与标量的逐元素按位异或。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \oplus \mathrm{scalar} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = txors %src, %scalar : !pto.tile<...>, i32
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.txors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TXORS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t` 和 `int16_t`。
-    - `dst`、`src` 和 `tmp` 必须使用相同的元素类型。
-    - 在手动模式下，源、目标和临时存储的内存区域不得重叠。
-- **实现检查 (A5)**:
-    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
-    - `dst` 和 `src` 的元素类型必须一致。
-    - `src.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint32_t, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, uint32_t, 16, 16>;
-  using TileTmp = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileDst dst;
-  TileSrc src;
-  TileTmp tmp;
-  TXORS(dst, src, 0x1u, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = txors %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.txors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tci.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tci.md
deleted file mode 100644
index 59257792..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tci.md
+++ /dev/null
@@ -1,161 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tci.md` -->
-
-# pto.tci
-
-Standalone reference page for `pto.tci`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Generate a contiguous integer sequence into a destination tile.
-
-## Mechanism
-
-Generate a contiguous integer sequence into a destination tile. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-For a linearized index `k` over the valid elements:
-
-- Ascending:
-
-  $$ \mathrm{dst}_{k} = S + k $$
-
-- Descending:
-
-  $$ \mathrm{dst}_{k} = S - k $$
-
-The linearization order depends on the tile layout (implementation-defined).
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tci %S {descending = false} : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename T, int descending, typename... WaitEvents>
-PTO_INST RecordEvent TCI(TileData &dst, T start, WaitEvents &... events);
-```
-
-## Inputs
-
-- `start` is the starting integer value for the sequence.
-- `descending` (template parameter): if true, generates descending sequence.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds a contiguous integer sequence starting from `start`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The implementation uses `dst.GetValidCol()` as the sequence length and does not consult `dst.GetValidRow()`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3/A5)**:
-    - `TileData::DType` must be exactly the same type as the scalar template parameter `T`.
-    - `dst/scalar` element types must be identical, and must be one of: `int32_t`, `uint32_t`, `int16_t`, `uint16_t`.
-    - `TileData::Cols != 1` (this is the condition enforced by the implementation).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, int32_t, 1, 16>;
-  TileT dst;
-  TCI<TileT, int32_t, /*descending=*/0>(dst, /*S=*/0);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, int32_t, 1, 16>;
-  TileT dst;
-  TASSIGN(dst, 0x1000);
-  TCI<TileT, int32_t, /*descending=*/1>(dst, /*S=*/100);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tci %S {descending = false} : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.tgather](./tgather.md)
-- Next op in family: [pto.ttri](./ttri.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tci_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tci_zh.md
deleted file mode 100644
index 023401c6..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tci_zh.md
+++ /dev/null
@@ -1,108 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tci_zh.md` -->
-
-# TCI
-
-## 指令示意图
-
-![TCI tile operation](../figures/isa/TCI.svg)
-
-## 简介
-
-生成连续整数序列到目标 Tile 中。
-
-## 数学语义
-
-For a linearized index `k` over the valid elements:
-
-- Ascending:
-
-  $$ \mathrm{dst}_{k} = S + k $$
-
-- Descending:
-
-  $$ \mathrm{dst}_{k} = S - k $$
-
-The linearization order depends on the tile layout (implementation-defined).
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tci %S {descending = false} : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename T, int descending, typename... WaitEvents>
-PTO_INST RecordEvent TCI(TileData &dst, T start, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3/A5)**:
-    - `TileData::DType` must be exactly the same type as the scalar template parameter `T`.
-    - `dst/scalar` element types must be identical, and must be one of: `int32_t`, `uint32_t`, `int16_t`, `uint16_t`.
-    - `TileData::Cols != 1` (this is the condition enforced by the implementation).
-- **有效区域**:
-    - The implementation uses `dst.GetValidCol()` as the sequence length and does not consult `dst.GetValidRow()`.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, int32_t, 1, 16>;
-  TileT dst;
-  TCI<TileT, int32_t, /*descending=*/0>(dst, /*S=*/0);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, int32_t, 1, 16>;
-  TileT dst;
-  TASSIGN(dst, 0x1000);
-  TCI<TileT, int32_t, /*descending=*/1>(dst, /*S=*/100);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgather.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgather.md
deleted file mode 100644
index 52467a3d..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgather.md
+++ /dev/null
@@ -1,188 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tgather.md` -->
-
-# pto.tgather
-
-Standalone reference page for `pto.tgather`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Gather/select elements using either an index tile or a compile-time mask pattern.
-
-## Mechanism
-
-Gather/select elements using either an index tile or a compile-time mask pattern. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Index-based gather (conceptual):
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}\!\left[\mathrm{indices}_{i,j}\right] $$
-
-Exact index interpretation and bounds behavior are implementation-defined.
-
-Mask-pattern gather is an implementation-defined selection/reduction controlled by `pto::MaskPattern`.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Index-based gather:
-
-```text
-%dst = tgather %src0, %indices : !pto.tile<...> -> !pto.tile<...>
-```
-
-Mask-pattern gather:
-
-```text
-%dst = tgather %src {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-%dst = pto.tgather %src {maskPattern = #pto.mask_pattern<P0101>}: !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-pto.tgather ins(%src, {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataD, typename TileDataS0, typename TileDataS1, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(TileDataD &dst, TileDataS0 &src0, TileDataS1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, MaskPattern maskPattern, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the source tile.
-- `indices` (index-based gather): index tile providing gather indices.
-- `tmp` (optional): temporary tile for index-based gather.
-- `maskPattern` (mask-pattern gather): compile-time mask pattern.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds gathered elements from `src0` at positions specified by `indices` or `maskPattern`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Bounds / validity**:
-    - Index bounds are not validated by explicit runtime assertions; out-of-range indices are target-defined.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Index-based gather: implementation checks (A2A3)**:
-    - `sizeof(DstTileData::DType)` must be must be `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `float`.
-    - `sizeof(Src1TileData::DType)` must be must be `int32_t`, `uint32_t`.
-    - `DstTileData::DType` must be the same type as `Src0TileData::DType`.
-    - `src1.GetValidCol() == Src1TileData::Cols` and `dst.GetValidCol() == DstTileData::Cols`.
-
-- **Index-based gather: implementation checks (A5)**:
-    - `sizeof(DstTileData::DType)` must be must be `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `float`.
-    - `sizeof(Src1TileData::DType)` must be must be `int16_t`, `uint16_t`, `int32_t`, `uint32_t`.
-    - `DstTileData::DType` must be the same type as `Src0TileData::DType`.
-    - `src1.GetValidCol() == Src1TileData::Cols` and `dst.GetValidCol() == DstTileData::Cols`.
-
-- **Mask-pattern gather: implementation checks (A2A3)**:
-    - Source element size must be `2` or `4` bytes.
-    - `SrcTileData::DType`/`DstTileData::DType` must be `int16_t` or `uint16_t` or `int32_t` or `uint32_t`
-    or `half` or `bfloat16_t` or `float`.
-    - `dst` and `src` must both be `TileType::Vec` and row-major.
-    - `sizeof(dst element) == sizeof(src element)` and `dst.GetValidCol() == DstTileData::Cols` (continuous dst storage).
-
-- **Mask-pattern gather: implementation checks (A5)**:
-    - Source element size must be `1` or `2` or `4` bytes.
-    - `dst` and `src` must both be `TileType::Vec` and row-major.
-    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t`
-    or `half` or `bfloat16_t` or `float` or `float8_e4m3_t`or `float8_e5m2_t` or `hifloat8_t`.
-    - Supported dtypes are restricted to a target-defined set (checked via `static_assert` in the implementation), and `sizeof(dst element) == sizeof(src element)`, `dst.GetValidCol() == DstTileData::Cols` (continuous dst storage).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using IdxT = Tile<TileType::Vec, int32_t, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src0;
-  IdxT idx;
-  DstT dst;
-  TGATHER(dst, src0, idx);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TGATHER<DstT, SrcT, MaskPattern::P0101>(dst, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.tsort32](./tsort32.md)
-- Next op in family: [pto.tci](./tci.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgather_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgather_zh.md
deleted file mode 100644
index 1d871180..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgather_zh.md
+++ /dev/null
@@ -1,155 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tgather_zh.md` -->
-
-# TGATHER
-
-## 指令示意图
-
-![TGATHER tile operation](../figures/isa/TGATHER.svg)
-
-## 简介
-
-使用索引 Tile 或编译时掩码模式来收集/选择元素。
-
-## 数学语义
-
-基于索引的 gather（概念性定义）：
-
-设 `R = dst.GetValidRow()`，`C = dst.GetValidCol()`。对于 `0 <= i < R` 且 `0 <= j < C`：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}\!\left[\mathrm{indices}_{i,j}\right] $$
-
-确切的索引解释和边界行为由实现定义。
-
-基于掩码模式的 gather 是由 `pto::MaskPattern` 控制的实现定义的选择/归约操作。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-基于索引的 gather：
-
-```text
-%dst = tgather %src0, %indices : !pto.tile<...> -> !pto.tile<...>
-```
-
-基于掩码模式的 gather：
-
-```text
-%dst = tgather %src {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-%dst = pto.tgather %src {maskPattern = #pto.mask_pattern<P0101>}: !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-pto.tgather ins(%src, {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataD, typename TileDataS0, typename TileDataS1, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(TileDataD &dst, TileDataS0 &src0, TileDataS1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, MaskPattern maskPattern, typename... WaitEvents>
-PTO_INST RecordEvent TGATHER(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-```
-
-## 约束
-
-- **基于索引的 gather：实现检查 (A2A3)**:
-    - `sizeof(DstTileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float` 之一。
-    - `sizeof(Src1TileData::DType)` 对应类型必须是 `int32_t`、`uint32_t` 之一。
-    - `DstTileData::DType` 必须与 `Src0TileData::DType` 类型相同。
-    - `src1.GetValidCol() == Src1TileData::Cols` 且 `dst.GetValidCol() == DstTileData::Cols`。
-- **基于索引的 gather：实现检查 (A5)**:
-    - `sizeof(DstTileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float` 之一。
-    - `sizeof(Src1TileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t` 之一。
-    - `DstTileData::DType` 必须与 `Src0TileData::DType` 类型相同。
-    - `src1.GetValidCol() == Src1TileData::Cols` 且 `dst.GetValidCol() == DstTileData::Cols`。
-- **基于掩码模式的 gather：实现检查 (A2A3)**:
-    - 源元素大小必须是 `2` 或 `4` 字节。
-    - `SrcTileData::DType`/`DstTileData::DType` 必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t` 或 `float` 之一。
-    - `dst` 和 `src` 必须都是 `TileType::Vec` 且行主序。
-    - `sizeof(dst element) == sizeof(src element)` 且 `dst.GetValidCol() == DstTileData::Cols`（连续的目标存储）。
-- **基于掩码模式的 gather：实现检查 (A5)**:
-    - 源元素大小必须是 `1`、`2` 或 `4` 字节。
-    - `dst` 和 `src` 必须都是 `TileType::Vec` 且行主序。
-    - `SrcTileData::DType`/`DstTileData::DType` 必须是 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`、`float8_e4m3_t`、`float8_e5m2_t` 或 `hifloat8_t` 之一。
-    - 支持的数据类型限制为目标定义的集合（通过实现中的 `static_assert` 强制执行），且 `sizeof(dst element) == sizeof(src element)`，`dst.GetValidCol() == DstTileData::Cols`（连续的目标存储）。
-- **边界 / 有效性**:
-    - 索引边界不通过显式运行时断言进行验证；超出范围的索引行为由目标定义。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using IdxT = Tile<TileType::Vec, int32_t, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src0;
-  IdxT idx;
-  DstT dst;
-  TGATHER(dst, src0, idx);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TGATHER<DstT, SrcT, MaskPattern::P0101>(dst, src);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgatherb.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgatherb.md
deleted file mode 100644
index 5c2ef2c3..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgatherb.md
+++ /dev/null
@@ -1,170 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tgatherb.md` -->
-
-# pto.tgatherb
-
-Standalone reference page for `pto.tgatherb`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Gather elements using byte offsets.
-
-## Mechanism
-
-Gather elements using byte offsets. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-For each element in the valid region:
-
-$$ \mathrm{dst}_{i,j} = *\left(\mathrm{srcBase} + \mathrm{offset}_{i,j}\right) $$
-
-Exact bounds behavior is implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tgatherb %src, %offsets : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename TileDataOffset, typename... WaitEvents>
-PTO_INST RecordEvent TGATHERB(TileDataDst &dst, TileDataSrc &src, TileDataOffset &offset, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `offset` is an offset tile providing byte offsets for each destination element.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds elements gathered from `src` using byte offsets from `offset`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Offset interpretation**:
-    - Offsets are interpreted as `uint32_t` values (byte offsets) by the implementation.
-    - Offset bounds are not validated by explicit runtime assertions; out-of-range offsets are target-defined.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Destination layout must be row-major (`TileDataDst::isRowMajor`).
-    - Destination element size must be `1`, `2`, or `4` bytes (enforced via `static_assert` in the helper).
-    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
-
-- **Implementation checks (A5)**:
-    - Destination element size must be `1`, `2`, or `4` bytes.
-    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, uint8_t, 1, 256>;
-  using OffT = Tile<TileType::Vec, uint32_t, 1, 256>;
-  using DstT = Tile<TileType::Vec, uint8_t, 1, 256>;
-  SrcT src;
-  OffT off;
-  DstT dst;
-  TGATHERB(dst, src, off);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, uint8_t, 1, 256>;
-  using OffT = Tile<TileType::Vec, uint32_t, 1, 256>;
-  using DstT = Tile<TileType::Vec, uint8_t, 1, 256>;
-  SrcT src;
-  OffT off;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(off, 0x2000);
-  TASSIGN(dst, 0x3000);
-  TGATHERB(dst, src, off);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tgatherb %src, %offsets : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.tpartmin](./tpartmin.md)
-- Next op in family: [pto.tscatter](./tscatter.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgatherb_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgatherb_zh.md
deleted file mode 100644
index e1e6f82f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tgatherb_zh.md
+++ /dev/null
@@ -1,116 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tgatherb_zh.md` -->
-
-# TGATHERB
-
-## 指令示意图
-
-![TGATHERB tile operation](../figures/isa/TGATHERB.svg)
-
-## 简介
-
-使用字节偏移量收集元素。
-
-## 数学语义
-
-对每个元素 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = *\left(\mathrm{srcBase} + \mathrm{offset}_{i,j}\right) $$
-
-Exact bounds behavior is implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tgatherb %src, %offsets : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename TileDataOffset, typename... WaitEvents>
-PTO_INST RecordEvent TGATHERB(TileDataDst &dst, TileDataSrc &src, TileDataOffset &offset, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - Destination layout must be row-major (`TileDataDst::isRowMajor`).
-    - Destination element size must be `1`, `2`, or `4` bytes (enforced via `static_assert` in the helper).
-    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
-- **实现检查 (A5)**:
-    - Destination element size must be `1`, `2`, or `4` bytes.
-    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
-- **Offset interpretation**:
-    - Offsets are interpreted as `uint32_t` values (byte offsets) by the implementation.
-    - Offset bounds are not validated by explicit runtime assertions; out-of-range offsets are target-defined.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, uint8_t, 1, 256>;
-  using OffT = Tile<TileType::Vec, uint32_t, 1, 256>;
-  using DstT = Tile<TileType::Vec, uint8_t, 1, 256>;
-  SrcT src;
-  OffT off;
-  DstT dst;
-  TGATHERB(dst, src, off);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, uint8_t, 1, 256>;
-  using OffT = Tile<TileType::Vec, uint32_t, 1, 256>;
-  using DstT = Tile<TileType::Vec, uint8_t, 1, 256>;
-  SrcT src;
-  OffT off;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(off, 0x2000);
-  TASSIGN(dst, 0x3000);
-  TGATHERB(dst, src, off);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tmrgsort.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tmrgsort.md
deleted file mode 100644
index c72d8d21..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tmrgsort.md
+++ /dev/null
@@ -1,186 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tmrgsort.md` -->
-
-# pto.tmrgsort
-
-Standalone reference page for `pto.tmrgsort`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Merge sort for multiple sorted lists (implementation-defined element format and layout).
-
-## Mechanism
-
-Merge sort for multiple sorted lists (implementation-defined element format and layout). It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Merges sorted input lists into `dst`. Ordering, element format (e.g., value/index pairs), and the meaning of executed counts depend on the implementation.
-
-$$ \mathrm{dst} = \mathrm{merge}(\mathrm{src}_0, \mathrm{src}_1, \ldots) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form (conceptual):
-
-```text
-%dst, %executed = tmrgsort %src0, %src1 {exhausted = false}
-    : !pto.tile<...>, !pto.tile<...> -> (!pto.tile<...>, vector<4xi16>)
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
-%dst, %executed = pto.tmrgsort %src0, %src1, %src2, %src3 {exhausted = false}
- : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, vector<4xi16>)
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmrgsort ins(%src, %blockLen : !pto.tile_buf<...>, dtype)  outs(%dst : !pto.tile_buf<...>)
-pto.tmrgsort ins(%src0, %src1, %src2, %src3 {exhausted = false} : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
-outs(%dst, %executed : !pto.tile_buf<...>, vector<4xi16>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
-%dst, %executed = pto.tmrgsort %src0, %src1, %src2, %src3 {exhausted = false}
- : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, vector<4xi16>)
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tmrgsort ins(%src, %blockLen : !pto.tile_buf<...>, dtype)  outs(%dst : !pto.tile_buf<...>)
-pto.tmrgsort ins(%src0, %src1, %src2, %src3 {exhausted = false} : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
-outs(%dst, %executed : !pto.tile_buf<...>, vector<4xi16>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData,
-          typename Src2TileData, typename Src3TileData, bool exhausted, typename... WaitEvents>
-PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, Src2TileData &src2, Src3TileData &src3, WaitEvents &... events);
-
-template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData,
-          typename Src2TileData, bool exhausted, typename... WaitEvents>
-PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, Src2TileData &src2, WaitEvents &... events);
-
-template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData, bool exhausted,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TMRGSORT(DstTileData &dst, SrcTileData &src, uint32_t blockLen, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0...src3` are source tiles (sorted lists to merge).
-- `tmp` is a temporary tile used during merge.
-- `executedNumList` outputs the number of consumed elements from each source.
-- `blockLen` (single-list variant): length of each sorted block.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the merged sorted output. `executedNumList` reports consumed counts.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Single-list variant (`TMRGSORT(dst, src, blockLen)`)**:
-    - `blockLen` must be a multiple of 64 (as checked by the implementation).
-    - `src.GetValidCol()` must be an integer multiple of `blockLen * 4`.
-    - `repeatTimes = src.GetValidCol() / (blockLen * 4)` must be in `[1, 255]`.
-
-- **Multi-list variants**:
-    - `tmp` is required and `executedNumList` is written by the implementation; supported list counts and exact semantics are target-defined.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3/A5)**:
-    - Element type must be `half` or `float` and must match across `dst/tmp/src*` tiles.
-    - All tiles must be `TileType::Vec`, row-major, and have `Rows == 1` (list stored in a single row).
-    - UB memory usage is checked (compile-time and runtime) against target limits (single `Cols` across inputs plus `tmp`/`dst`).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 1, 256>;
-  using DstT = Tile<TileType::Vec, float, 1, 256>;
-  SrcT src;
-  DstT dst;
-  TMRGSORT(dst, src, /*blockLen=*/64);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 1, 256>;
-  using DstT = Tile<TileType::Vec, float, 1, 256>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TMRGSORT(dst, src, /*blockLen=*/64);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmrgsort ins(%src, %blockLen : !pto.tile_buf<...>, dtype)  outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.tprint](./tprint.md)
-- Next op in family: [pto.tsort32](./tsort32.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tmrgsort_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tmrgsort_zh.md
deleted file mode 100644
index fc14e1aa..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tmrgsort_zh.md
+++ /dev/null
@@ -1,130 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tmrgsort_zh.md` -->
-
-# TMRGSORT
-
-## 指令示意图
-
-![TMRGSORT tile operation](../figures/isa/TMRGSORT.svg)
-
-## 简介
-
-用于多个已排序列表的归并排序（实现定义的元素格式和布局）。
-
-## 数学语义
-
-Merges sorted input lists into `dst`. Ordering, element format (e.g., value/index pairs), and the meaning of executed counts depend on the implementation.
-
-$$ \mathrm{dst} = \mathrm{merge}(\mathrm{src}_0, \mathrm{src}_1, \ldots) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-Synchronous form (conceptual):
-
-```text
-%dst, %executed = tmrgsort %src0, %src1 {exhausted = false}
-    : !pto.tile<...>, !pto.tile<...> -> (!pto.tile<...>, vector<4xi16>)
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
-%dst, %executed = pto.tmrgsort %src0, %src1, %src2, %src3 {exhausted = false}
- : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, vector<4xi16>)
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmrgsort ins(%src, %blockLen : !pto.tile_buf<...>, dtype)  outs(%dst : !pto.tile_buf<...>)
-pto.tmrgsort ins(%src0, %src1, %src2, %src3 {exhausted = false} : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
-outs(%dst, %executed : !pto.tile_buf<...>, vector<4xi16>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
-%dst, %executed = pto.tmrgsort %src0, %src1, %src2, %src3 {exhausted = false}
- : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, vector<4xi16>)
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmrgsort ins(%src, %blockLen : !pto.tile_buf<...>, dtype)  outs(%dst : !pto.tile_buf<...>)
-pto.tmrgsort ins(%src0, %src1, %src2, %src3 {exhausted = false} : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
-outs(%dst, %executed : !pto.tile_buf<...>, vector<4xi16>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData,
-          typename Src2TileData, typename Src3TileData, bool exhausted, typename... WaitEvents>
-PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, Src2TileData &src2, Src3TileData &src3, WaitEvents &... events);
-
-template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData,
-          typename Src2TileData, bool exhausted, typename... WaitEvents>
-PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, Src2TileData &src2, WaitEvents &... events);
-
-template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData, bool exhausted,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TMRGSORT(DstTileData &dst, SrcTileData &src, uint32_t blockLen, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3/A5)**:
-    - Element type must be `half` or `float` and must match across `dst/tmp/src*` tiles.
-    - All tiles must be `TileType::Vec`, row-major, and have `Rows == 1` (list stored in a single row).
-    - UB memory usage is checked (compile-time and runtime) against target limits (single `Cols` across inputs plus `tmp`/`dst`).
-- **Single-list variant (`TMRGSORT(dst, src, blockLen)`)**:
-    - `blockLen` must be a multiple of 64 (as checked by the implementation).
-    - `src.GetValidCol()` must be an integer multiple of `blockLen * 4`.
-    - `repeatTimes = src.GetValidCol() / (blockLen * 4)` must be in `[1, 255]`.
-- **Multi-list variants**:
-    - `tmp` is required and `executedNumList` is written by the implementation; supported list counts and exact semantics are target-defined.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 1, 256>;
-  using DstT = Tile<TileType::Vec, float, 1, 256>;
-  SrcT src;
-  DstT dst;
-  TMRGSORT(dst, src, /*blockLen=*/64);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 1, 256>;
-  using DstT = Tile<TileType::Vec, float, 1, 256>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TMRGSORT(dst, src, /*blockLen=*/64);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartadd.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartadd.md
deleted file mode 100644
index cfc18c0a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartadd.md
+++ /dev/null
@@ -1,168 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tpartadd.md` -->
-
-# pto.tpartadd
-
-Standalone reference page for `pto.tpartadd`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Partial elementwise add with implementation-defined handling of mismatched valid regions.
-
-## Mechanism
-
-Performs elementwise addition over the destination valid region. When both `src0` and `src1` are valid at an element, the result is their sum; when only one input is valid there, the result copies that input value. Handling of other mismatched-validity cases is implementation-defined. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-For each element `(i, j)` in the destination valid region:
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} & \text{if both inputs are defined at } (i,j) \\
-\mathrm{src0}_{i,j} & \text{if only src0 is defined at } (i,j) \\
-\mathrm{src1}_{i,j} & \text{if only src1 is defined at } (i,j)
-\end{cases}
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tpartadd %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TPARTADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile.
-- `src1` is the second source tile.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the elementwise partial sum: both valid gives sum; one valid gives the valid value.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst`, `src0`, and `src1` must use the same element type.
-
-- The destination valid region defines the result domain.
-
-- For each element in the destination valid region:
-  - if both inputs are valid, the instruction applies its elementwise operator;
-  - if only one input is valid, the result copies that input value.
-
-- If `dst` has a zero valid region, the instruction returns early.
-
-- Supported partial-validity patterns require at least one source tile to have a valid region exactly equal to `dst`, while the other source tile's valid region must not exceed `dst` in either dimension.
-
-- Supported element types: `int32_t`, `int16_t`, `half`, `float`.
-
-- Supported element types: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- Handling of any validity pattern not explicitly listed above is implementation-defined.
-
-### A2A3 implementation checks
-
-- `dst`, `src0`, and `src1` must all be row-major (`isRowMajor`).
-
-### A5 implementation checks
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TPARTADD(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TPARTADD(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tpartadd %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.ttri](./ttri.md)
-- Next op in family: [pto.tpartmul](./tpartmul.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md
deleted file mode 100644
index bf463fc3..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md
+++ /dev/null
@@ -1,137 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md` -->
-
-# TPARTADD
-
-## 指令示意图
-
-![TPARTADD tile operation](../figures/isa/TPARTADD.svg)
-
-## 简介
-
-在目标有效区域内执行逐元素加法。若某个位置上 `src0` 和 `src1` 都有效，则结果为两者之和；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
-
-## 数学语义
-
-对目标有效区域内的每个元素 `(i, j)`：
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\
-\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\
-\mathrm{src1}_{i,j} & \text{若仅 src1 在 } (i,j) \text{ 处有定义}
-\end{cases}
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tpartadd %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TPARTADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst`、`src0` 和 `src1` 的元素类型必须一致。
-- 目标有效区域定义结果的计算范围。
-- 对目标有效区域内的每个元素：
-  - 若两个输入都有效，则执行该指令对应的逐元素运算；
-  - 若只有一个输入有效，则结果直接取该输入的值。
-- 若 `dst` 的有效区域为零，指令直接返回。
-- 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
-- 上述范围之外的有效区域组合，其行为均由具体实现定义。
-
-### A2A3 实现检查
-
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
-
-### A5 实现检查
-
-- 支持的元素类型：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t`。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TPARTADD(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TPARTADD(dst, src0, src1);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tpartadd %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmax.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmax.md
deleted file mode 100644
index 9676deba..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmax.md
+++ /dev/null
@@ -1,168 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tpartmax.md` -->
-
-# pto.tpartmax
-
-Standalone reference page for `pto.tpartmax`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Partial elementwise max with implementation-defined handling of mismatched valid regions.
-
-## Mechanism
-
-Performs elementwise maximum selection over the destination valid region. When both `src0` and `src1` are valid at an element, the result is `max(src0, src1)`; when only one input is valid there, the result copies that input value. Handling of other mismatched-validity cases is implementation-defined. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-For each element `(i, j)` in the destination valid region:
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{if both inputs are defined at } (i,j) \\
-\mathrm{src0}_{i,j} & \text{if only src0 is defined at } (i,j) \\
-\mathrm{src1}_{i,j} & \text{if only src1 is defined at } (i,j)
-\end{cases}
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tpartmax %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TPARTMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile.
-- `src1` is the second source tile.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the elementwise partial maximum: both valid gives max; one valid gives the valid value.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst`, `src0`, and `src1` must use the same element type.
-
-- The destination valid region defines the result domain.
-
-- For each element in the destination valid region:
-  - if both inputs are valid, the instruction applies the elementwise maximum;
-  - if only one input is valid, the result copies that input value.
-
-- If `dst` has a zero valid region, the instruction returns early.
-
-- Supported partial-validity patterns require at least one source tile to have a valid region exactly equal to `dst`, while the other source tile's valid region must not exceed `dst` in either dimension.
-
-- Supported element types: `int32_t`, `int16_t`, `half`, `float`.
-
-- Supported element types: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `bfloat16_t`, `float`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- Handling of any validity pattern not explicitly listed above is implementation-defined.
-
-### A2A3 implementation checks
-
-- `dst`, `src0`, and `src1` must all be row-major (`isRowMajor`).
-
-### A5 implementation checks
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TPARTMAX(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TPARTMAX(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tpartmax %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.tpartmul](./tpartmul.md)
-- Next op in family: [pto.tpartmin](./tpartmin.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md
deleted file mode 100644
index 5dd80aee..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md
+++ /dev/null
@@ -1,137 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md` -->
-
-# TPARTMAX
-
-## 指令示意图
-
-![TPARTMAX tile operation](../figures/isa/TPARTMAX.svg)
-
-## 简介
-
-在目标有效区域内执行逐元素最大值选择。若某个位置上 `src0` 和 `src1` 都有效，则结果为 `max(src0, src1)`；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
-
-## 数学语义
-
-对目标有效区域内的每个元素 `(i, j)`：
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\\\
-\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\\\
-\mathrm{src1}_{i,j} & \text{若仅 src1 在 } (i,j) \text{ 处有定义}
-\end{cases}
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tpartmax %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TPARTMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst`、`src0` 和 `src1` 的元素类型必须一致。
-- 目标有效区域定义结果的计算范围。
-- 对目标有效区域内的每个元素：
-  - 若两个输入都有效，则执行逐元素最大值运算；
-  - 若只有一个输入有效，则结果直接取该输入的值。
-- 若 `dst` 的有效区域为零，指令直接返回。
-- 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
-- 上述范围之外的有效区域组合，其行为均由具体实现定义。
-
-### A2A3 实现检查
-
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
-
-### A5 实现检查
-
-- 支持的元素类型：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TPARTMAX(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TPARTMAX(dst, src0, src1);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tpartmax %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmin.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmin.md
deleted file mode 100644
index d2450d40..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmin.md
+++ /dev/null
@@ -1,168 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tpartmin.md` -->
-
-# pto.tpartmin
-
-Standalone reference page for `pto.tpartmin`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Partial elementwise min with implementation-defined handling of mismatched valid regions.
-
-## Mechanism
-
-Performs elementwise minimum selection over the destination valid region. When both `src0` and `src1` are valid at an element, the result is `min(src0, src1)`; when only one input is valid there, the result copies that input value. Handling of other mismatched-validity cases is implementation-defined. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-For each element `(i, j)` in the destination valid region:
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{if both inputs are defined at } (i,j) \\
-\mathrm{src0}_{i,j} & \text{if only src0 is defined at } (i,j) \\
-\mathrm{src1}_{i,j} & \text{if only src1 is defined at } (i,j)
-\end{cases}
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tpartmin %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TPARTMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile.
-- `src1` is the second source tile.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the elementwise partial minimum: both valid gives min; one valid gives the valid value.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst`, `src0`, and `src1` must use the same element type.
-
-- The destination valid region defines the result domain.
-
-- For each element in the destination valid region:
-  - if both inputs are valid, the instruction applies the elementwise minimum;
-  - if only one input is valid, the result copies that input value.
-
-- If `dst` has a zero valid region, the instruction returns early.
-
-- Supported partial-validity patterns require at least one source tile to have a valid region exactly equal to `dst`, while the other source tile's valid region must not exceed `dst` in either dimension.
-
-- Supported element types: `int32_t`, `int16_t`, `half`, `float`.
-
-- Supported element types: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `bfloat16_t`, `float`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- Handling of any validity pattern not explicitly listed above is implementation-defined.
-
-### A2A3 implementation checks
-
-- `dst`, `src0`, and `src1` must all be row-major (`isRowMajor`).
-
-### A5 implementation checks
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TPARTMIN(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TPARTMIN(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tpartmin %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.tpartmax](./tpartmax.md)
-- Next op in family: [pto.tgatherb](./tgatherb.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md
deleted file mode 100644
index c0437186..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md
+++ /dev/null
@@ -1,137 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md` -->
-
-# TPARTMIN
-
-## 指令示意图
-
-![TPARTMIN tile operation](../figures/isa/TPARTMIN.svg)
-
-## 简介
-
-在目标有效区域内执行逐元素最小值选择。若某个位置上 `src0` 和 `src1` 都有效，则结果为 `min(src0, src1)`；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
-
-## 数学语义
-
-对目标有效区域内的每个元素 `(i, j)`：
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\\\
-\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\\\
-\mathrm{src1}_{i,j} & \text{若仅 src1 在 } (i,j) \text{ 处有定义}
-\end{cases}
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tpartmin %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TPARTMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst`、`src0` 和 `src1` 的元素类型必须一致。
-- 目标有效区域定义结果的计算范围。
-- 对目标有效区域内的每个元素：
-  - 若两个输入都有效，则执行逐元素最小值运算；
-  - 若只有一个输入有效，则结果直接取该输入的值。
-- 若 `dst` 的有效区域为零，指令直接返回。
-- 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
-- 上述范围之外的有效区域组合，其行为均由具体实现定义。
-
-### A2A3 实现检查
-
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
-
-### A5 实现检查
-
-- 支持的元素类型：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TPARTMIN(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TPARTMIN(dst, src0, src1);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tpartmin %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmul.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmul.md
deleted file mode 100644
index bc275b86..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmul.md
+++ /dev/null
@@ -1,178 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tpartmul.md` -->
-
-# pto.tpartmul
-
-Standalone reference page for `pto.tpartmul`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Partial elementwise multiply with implementation-defined handling of mismatched valid regions.
-
-## Mechanism
-
-Performs elementwise multiplication over the destination valid region. When both `src0` and `src1` are valid at an element, the result is their product; when only one input is valid there, the result copies that input value. Handling of other mismatched-validity cases is implementation-defined. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-For each element `(i, j)` in the destination valid region:
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j} & \text{if both inputs are defined at } (i,j) \\
-\mathrm{src0}_{i,j} & \text{if only src0 is defined at } (i,j) \\
-\mathrm{src1}_{i,j} & \text{if only src1 is defined at } (i,j)
-\end{cases}
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TPARTMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile.
-- `src1` is the second source tile.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the elementwise partial product: both valid gives product; one valid gives the valid value.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst`, `src0`, and `src1` must use the same element type.
-
-- The destination valid region defines the result domain.
-
-- For each element in the destination valid region:
-  - if both inputs are valid, the instruction applies its elementwise operator;
-  - if only one input is valid, the result copies that input value.
-
-- If `dst` has a zero valid region, the instruction returns early.
-
-- Supported partial-validity patterns require at least one source tile to have a valid region exactly equal to `dst`, while the other source tile's valid region must not exceed `dst` in either dimension.
-
-- Supported element types: `int32_t`, `int16_t`, `half`, `float`.
-
-- Supported element types: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- Handling of any validity pattern not explicitly listed above is implementation-defined.
-
-### A2A3 implementation checks
-
-- `dst`, `src0`, and `src1` must all be row-major (`isRowMajor`).
-
-### A5 implementation checks
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TPARTMUL(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TPARTMUL(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.tpartadd](./tpartadd.md)
-- Next op in family: [pto.tpartmax](./tpartmax.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md
deleted file mode 100644
index 8e1a7883..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md
+++ /dev/null
@@ -1,135 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md` -->
-
-# TPARTMUL
-
-## 指令示意图
-
-![TPARTMUL tile operation](../figures/isa/TPARTMUL.svg)
-
-## 简介
-
-在目标有效区域内执行逐元素乘法。若某个位置上 `src0` 和 `src1` 都有效，则结果为两者之积；若只有一个输入在该位置有效，则结果直接取该输入的值。其余有效区域不匹配的情况由具体实现定义。
-
-## 数学语义
-
-对目标有效区域内的每个元素 `(i, j)`：
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j} & \text{若两个输入在 } (i,j) \text{ 处均有定义} \\\\
-\mathrm{src0}_{i,j} & \text{若仅 src0 在 } (i,j) \text{ 处有定义} \\\\
-\mathrm{src1}_{i,j} & \text{若仅 src1 在 } (i,j) \text{ 处有定义}
-\end{cases}
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TPARTMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst`、`src0` 和 `src1` 的元素类型必须一致。
-- 目标有效区域定义结果的计算范围。
-- 对目标有效区域内的每个元素：
-  - 若两个输入都有效，则执行该指令对应的逐元素运算；
-  - 若只有一个输入有效，则结果直接取该输入的值。
-- 若 `dst` 的有效区域为零，指令直接返回。
-- 支持的部分有效区域模式要求至少有一个源 Tile 的有效区域与 `dst` 完全一致，另一个源 Tile 的有效区域在两个维度上都不能超过 `dst`。
-- 上述范围之外的有效区域组合，其行为均由具体实现定义。
-
-### A2A3 实现检查
-
-- 支持的元素类型：`int32_t`、`int16_t`、`half`、`float`。
-- `dst`、`src0` 和 `src1` 必须全部为行主序（`isRowMajor`）。
-
-### A5 实现检查
-
-- 支持的元素类型：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t`。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TPARTMUL(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src0, src1, dst;
-  TASSIGN(src0, 0x1000);
-  TASSIGN(src1, 0x2000);
-  TASSIGN(dst,  0x3000);
-  TPARTMUL(dst, src0, src1);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tprint.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tprint.md
deleted file mode 100644
index be42222e..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tprint.md
+++ /dev/null
@@ -1,192 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tprint.md` -->
-
-# pto.tprint
-
-Standalone reference page for `pto.tprint`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Debug/print elements from a tile (implementation-defined).
-
-## Mechanism
-
-Print the contents of a Tile or GlobalTensor for debugging purposes directly from device code.
-
-The `TPRINT` instruction outputs the logical view of data stored in a Tile or GlobalTensor. It supports common data types (e.g., `float`, `half`, `int8`, `uint32`) and multiple memory layouts (`ND`, `DN`, `NZ` for GlobalTensor; vector tiles for on-chip buffers).
-
-> **Important**:
-> - This instruction is **for development and debugging ONLY**.
-> - It incurs **significant runtime overhead** and **must not be used in production kernels**.
-> - Output may be **truncated** if it exceeds the internal print buffer. The print buffer can be adjusted with `-DCCEBlockMaxSize=16384`; the default is 16 KiB.
-> - **Requires CCE compilation option `-D_DEBUG --cce-enable-print`**. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-- **Mandatory Compilation Flag**:
-
-  On A2/A3/A5 devices, `TPRINT` uses `cce::printf` to emit output via the device-to-host debug channel. **You must enable the CCE option `-D_DEBUG --cce-enable-print`**.
-
-- **Buffer Limitation:**
-
-  The internal print buffer of `cce::printf` is limited in size. If the output exceeds this buffer, a warning message such as `"Warning: out of bound! try best to print"` may appear, and **only partial data will be printed**.
-
-- **Synchronization**:
-
-  Automatically inserts a `pipe_barrier(PIPE_ALL)` before printing to ensure all prior operations complete and data is consistent.
-
-- **Formatting**:
-
-    - Floating-point values: printed as `%6.2f`
-    - Integer values: printed as `%6d`
-    - For `GlobalTensor`, due to data size and buffer limitations, only elements within its logical shape (defined by `Shape`) are printed.
-    - For `Tile`, invalid regions (beyond `validRows`/`validCols`) are still printed but marked with a `|` separator when partial validity is specified.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-```text
-tprint %src : !pto.tile<...> | !pto.global<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-```cpp
-// For printing GlobalTensor or Vec-type Tile
-template <typename TileData>
-PTO_INST void TPRINT(TileData &src);
-
-// For printing Acc-type Tile and Mat-type Tile (Mat printing is currently A3-only)
-template <typename TileData, typename GlobalData>
-PTO_INTERNAL void TPRINT(TileData &src, GlobalData &tmp);
-```
-
-### Supported Types for T
-- **Tile**: `TileType` may be `Vec`, `Acc`, or `Mat` (Mat printing is currently supported on A3 only).
-- **GlobalTensor**: Must use layout `ND`, `DN`, or `NZ`, and have a supported element type.
-
-## Inputs
-
-- `src` is the Tile or GlobalTensor to print.
-
-## Expected Outputs
-
-Debug output is emitted to the device-to-host debug channel. The tile data is not modified.
-
-## Side Effects
-
-This operation emits debug output via `cce::printf`. It synchronizes by inserting a `pipe_barrier(PIPE_ALL)` before printing. Significant runtime overhead is expected.
-
-## Constraints
-
-- **Supported element type**:
-    - Floating-point: `float`, `half`
-    - Signed integers: `int8_t`, `int16_t`, `int32_t`
-    - Unsigned integers: `uint8_t`, `uint16_t`, `uint32_t`
-
-- **For GlobalTensor**: Layout must be one of `Layout::ND`, `Layout::DN`, or `Layout::NZ`.
-
-- **For temporary space**: Printing a `Tile` with `TileType::Mat` or `TileType::Acc` requires GM temporary space. The temporary buffer must be at least `TileData::Numel * sizeof(T)`.
-
-- When `TileType` is `Mat`, the output is formatted according to `Layout::ND`; other layouts may appear misaligned.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- A5 does not yet support printing `TileType::Mat`.
-
-## Examples
-
-### Print a Tile
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-PTO_INTERNAL void DebugTile(__gm__ float *src) {
-  using ValidSrcShape = TileShape2D<float, 16, 16>;
-  using NDSrcShape = BaseShape2D<float, 32, 32>;
-  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
-  GlobalDataSrc srcGlobal(src);
-
-  using srcTileData = Tile<TileType::Vec, float, 16, 16>;
-  srcTileData srcTile;
-  TASSIGN(srcTile, 0x0);
-
-  TLOAD(srcTile, srcGlobal);
-  TPRINT(srcTile);
-}
-```
-
-### Print a GlobalTensor
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-PTO_INTERNAL void DebugGlobalTensor(__gm__ float *src) {
-  using ValidSrcShape = TileShape2D<float, 16, 16>;
-  using NDSrcShape = BaseShape2D<float, 32, 32>;
-  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
-  GlobalDataSrc srcGlobal(src);
-
-  TPRINT(srcGlobal);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### PTO Assembly Form
-
-```text
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-# AS Level 2 (DPS)
-pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Next op in family: [pto.tmrgsort](./tmrgsort.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tprint_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tprint_zh.md
deleted file mode 100644
index 62f0d6eb..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tprint_zh.md
+++ /dev/null
@@ -1,137 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tprint_zh.md` -->
-
-# TPRINT
-
-## 指令示意图
-
-![TPRINT tile operation](../figures/isa/TPRINT.svg)
-
-## 简介
-
-调试/打印 Tile 中的元素（实现定义）。
-
-从设备代码直接打印 Tile 或 GlobalTensor 的内容以用于调试目的。
-
-`TPRINT` 指令输出存储在 Tile 或 GlobalTensor 中的数据的逻辑视图。它支持常见的数据类型（例如 `float`、`half`、`int8`、`uint32`）和多种内存布局（GlobalTensor 的 `ND`、`DN`、`NZ`；片上缓冲区的向量 tiles）。
-
-> **重要**:
-> - 此指令**仅用于开发和调试**。
-> - 它会产生**显著的运行时开销**，**不得在生产 kernel 中使用**。
-> - 如果输出超过内部打印缓冲区，可能会被**截断**。可以通过在编译选项中添加`-DCCEBlockMaxSize=16384`来修改打印缓冲区，默认为16KB。
-> - **需要 CCE 编译选项 `-D_DEBUG --cce-enable-print`**（参见 [行为](#behavior)）。
-
-## 数学语义
-
-除非另有说明，语义在有效区域上定义，目标相关的行为标记为实现定义。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-```text
-tprint %src : !pto.tile<...> | !pto.global<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-```cpp
-// 适用于打印GlobalTensor或Vec类型Tile
-template <typename TileData>
-PTO_INST void TPRINT(TileData &src);
-
-// 适用于打印Acc类型Tile和Mat类型Tile(Mat打印仅适用于A3，A5暂不支持)
-template <typename TileData, typename GlobalData>
-PTO_INTERNAL void TPRINT(TileData &src, GlobalData &tmp);
-```
-
-### 支持的 T 类型
-- **Tile**：TileType必须是`Vec`、`Acc`、`Mat(仅A3支持)`，并具有支持的元素类型。
-- **GlobalTensor**：必须使用布局 `ND`、`DN` 或 `NZ`，并具有支持的元素类型。
-
-## 约束
-
-- **支持的元素类型**:
-    - 浮点数：`float`、`half`
-    - 有符号整数：`int8_t`、`int16_t`、`int32_t`
-    - 无符号整数：`uint8_t`、`uint16_t`、`uint32_t`
-- **对于 GlobalTensor**：布局必须是 `Layout::ND`、`Layout::DN` 或 `Layout::NZ` 之一。
-- **对于 临时空间**：打印`TileType`为`Mat`或`Acc`的Tile时需要传入gm上的临时空间，临时空间不得小于`TileData::Numel * sizeof(T)`。
-- A5暂不支持`TileType`为`Mat`的Tile打印。
-- **回显信息**: `TileType`为`Mat`时，布局将按照`Layout::ND`进行打印，其他布局可能会导致信息错位。
-
-## 示例
-
-### Print a Tile
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-PTO_INTERNAL void DebugTile(__gm__ float *src) {
-  using ValidSrcShape = TileShape2D<float, 16, 16>;
-  using NDSrcShape = BaseShape2D<float, 32, 32>;
-  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
-  GlobalDataSrc srcGlobal(src);
-
-  using srcTileData = Tile<TileType::Vec, float, 16, 16>;
-  srcTileData srcTile;
-  TASSIGN(srcTile, 0x0);
-
-  TLOAD(srcTile, srcGlobal);
-  TPRINT(srcTile);
-}
-```
-
-### Print a GlobalTensor
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-PTO_INTERNAL void DebugGlobalTensor(__gm__ float *src) {
-  using ValidSrcShape = TileShape2D<float, 16, 16>;
-  using NDSrcShape = BaseShape2D<float, 32, 32>;
-  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
-  GlobalDataSrc srcGlobal(src);
-
-  TPRINT(srcGlobal);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-```
-
-### PTO 汇编形式
-
-```text
-pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
-# AS Level 2 (DPS)
-pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tquant.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tquant.md
deleted file mode 100644
index 9cbbae87..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tquant.md
+++ /dev/null
@@ -1,128 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tquant.md` -->
-
-# pto.tquant
-
-Standalone reference page for `pto.tquant`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Quantize a tile (e.g. FP32 to FP8) producing exponent/scaling/max outputs.
-
-## Mechanism
-
-Quantize an FP32 tile into a lower-precision format (e.g. FP8), producing auxiliary exponent/scaling/max tiles. The quantization mode is a compile-time template parameter (`mode`). It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <auto quant_type, typename TileDataOut, typename TileDataSrc, typename TileDataExp, typename TileDataMax,
-          typename... WaitEvents>
-PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataExp *exp, TileDataMax *max, TileDataSrc *scaling, WaitEvents &... events);
-
-template <auto quant_type, auto store_mode, typename TileDataOut, typename TileDataSrc, typename TileDataExp,
-          typename TileDataMax, typename TileDataIdx, typename... WaitEvents>
-PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataExp *exp, TileDataMax *max, TileDataSrc *scaling, TileDataExp *exp_zz, TileDataIdx *vgather_idx, WaitEvents &... events);
-
-template <auto quant_type, typename TileDataOut, typename TileDataSrc, typename TileDataPara, typename... WaitEvents>
-PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataPara &scale, TileDataPara *offset = nullptr, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile to quantize.
-- `qp` is the quantization parameter tile.
-- `exp` (output): exponent tile.
-- `max` (output): max tile.
-- `scaling` (output): scaling tile.
-- `dst` names the destination tile. The operation iterates over src's valid region.
-
-## Expected Outputs
-
-`dst` holds the quantized output. `exp`, `max`, `scaling` hold quantization metadata.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- This instruction is currently implemented for specific targets (see `include/pto/npu/*/TQuant.hpp`).
-
-- Input type requirements and output tile types are mode/target-dependent.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tquant` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.tscatter](./tscatter.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tquant_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tquant_zh.md
deleted file mode 100644
index 0bb596b0..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tquant_zh.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tquant_zh.md` -->
-
-# TQUANT
-
-## 指令示意图
-
-![TQUANT tile operation](../figures/isa/TQUANT.svg)
-
-## 简介
-
-量化 Tile（例如 FP32 到 FP8），生成指数/缩放/最大值输出。
-
-## 数学语义
-
-除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <auto quant_type, typename TileDataOut, typename TileDataSrc, typename TileDataExp, typename TileDataMax,
-          typename... WaitEvents>
-PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataExp *exp, TileDataMax *max, TileDataSrc *scaling, WaitEvents &... events);
-
-template <auto quant_type, auto store_mode, typename TileDataOut, typename TileDataSrc, typename TileDataExp,
-          typename TileDataMax, typename TileDataIdx, typename... WaitEvents>
-PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataExp *exp, TileDataMax *max, TileDataSrc *scaling, TileDataExp *exp_zz, TileDataIdx *vgather_idx, WaitEvents &... events);
-
-template <auto quant_type, typename TileDataOut, typename TileDataSrc, typename TileDataPara, typename... WaitEvents>
-PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataPara &scale, TileDataPara *offset = nullptr, WaitEvents &... events);
-```
-
-## 约束
-
-- This instruction is currently implemented for specific targets (see `include/pto/npu/*/TQuant.hpp`).
-- Input type requirements and output tile types are mode/target-dependent.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tscatter.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tscatter.md
deleted file mode 100644
index 5fb235e0..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tscatter.md
+++ /dev/null
@@ -1,169 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tscatter.md` -->
-
-# pto.tscatter
-
-Standalone reference page for `pto.tscatter`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Scatter rows of a source tile into a destination tile using per-element row indices.
-
-## Mechanism
-
-Scatter source elements into a destination tile using per-element flattened destination offsets. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-For each source element `(i, j)`, let `k = idx[i,j]` and write:
-
-$$ \mathrm{dst\_flat}_{k} = \mathrm{src}_{i,j} $$
-
-Here `dst_flat` denotes the destination tile viewed as a single linear storage sequence. `TSCATTER` does **not** interpret `idx[i,j]` as a destination row selector. On the standard row-major tile layout, this is equivalent to writing the `k`-th flattened destination element.
-
-If multiple elements map to the same destination location, the final value is implementation-defined (last writer wins in the current implementation).
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tscatter %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tscatter %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataD, typename TileDataS, typename TileDataI, typename... WaitEvents>
-PTO_INST RecordEvent TSCATTER(TileDataD &dst, TileDataS &src, TileDataI &indexes, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `indexes` is an index tile providing flattened destination offsets.
-- `dst` names the destination tile. The operation iterates over src's valid region.
-
-## Expected Outputs
-
-Elements from `src` are scattered to positions in `dst` specified by `indexes`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Concurrent writes to the same location produce implementation-defined results.
-
-## Constraints
-
-- Operand shape, mode, and state tuples MUST match the documented contract of this operation and its family overview.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-  - `TileDataD::Loc`, `TileDataS::Loc`, `TileDataI::Loc` must be `TileType::Vec`.
-  - `TileDataD::DType`, `TileDataS::DType` must be one of: `int32_t`, `int16_t`, `int8_t`, `half`, `float32_t`, `uint32_t`, `uint16_t`, `uint8_t`, `bfloat16_t`.
-  - `TileDataI::DType` must be one of: `int16_t`, `int32_t`, `uint16_t` or `uint32_t`.
-  - `indexes` values are interpreted as flattened destination element offsets in destination tile storage order.
-  - No bounds checks are enforced on `indexes` values.
-  - Static valid bounds: `TileDataD::ValidRow <= TileDataD::Rows`, `TileDataD::ValidCol <= TileDataD::Cols`, `TileDataS::ValidRow <= TileDataS::Rows`, `TileDataS::ValidCol <= TileDataS::Cols`, `TileDataI::ValidRow <= TileDataI::Rows`, `TileDataI::ValidCol <= TileDataI::Cols`.
-  - `TileDataD::DType` and `TileDataS::DType` must be the same.
-  - When size of `TileDataD::DType` is 4 bytes, the size of `TileDataI::DType` must be 4 bytes.
-  - When size of `TileDataD::DType` is 2 bytes, the size of `TileDataI::DType` must be 2 bytes.
-  - When size of `TileDataD::DType` is 1 bytes, the size of `TileDataI::DType` must be 2 bytes.
-
-- **Implementation checks (A5)**:
-  - `TileDataD::Loc`, `TileDataS::Loc`, `TileDataI::Loc` must be `TileType::Vec`.
-  - `TileDataD::DType`, `TileDataS::DType` must be one of: `int32_t`, `int16_t`, `int8_t`, `half`, `float32_t`, `uint32_t`, `uint16_t`, `uint8_t`, `bfloat16_t`.
-  - `TileDataI::DType` must be one of: `int16_t`, `int32_t`, `uint16_t` or `uint32_t`.
-  - `indexes` values are interpreted as flattened destination element offsets in destination tile storage order.
-  - No bounds checks are enforced on `indexes` values.
-  - Static valid bounds: `TileDataD::ValidRow <= TileDataD::Rows`, `TileDataD::ValidCol <= TileDataD::Cols`, `TileDataS::ValidRow <= TileDataS::Rows`, `TileDataS::ValidCol <= TileDataS::Cols`, `TileDataI::ValidRow <= TileDataI::Rows`, `TileDataI::ValidCol <= TileDataI::Cols`.
-  - `TileDataD::DType` and `TileDataS::DType` must be the same.
-  - When size of `TileDataD::DType` is 4 bytes, the size of `TileDataI::DType` must be 4 bytes.
-  - When size of `TileDataD::DType` is 2 bytes, the size of `TileDataI::DType` must be 2 bytes.
-  - When size of `TileDataD::DType` is 1 bytes, the size of `TileDataI::DType` must be 2 bytes.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  using IdxT = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileT src, dst;
-  IdxT idx;
-  TSCATTER(dst, src, idx);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  using IdxT = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileT src, dst;
-  IdxT idx;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(idx, 0x3000);
-  TSCATTER(dst, src, idx);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tscatter %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tscatter %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tscatter %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
-pto.tscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.tgatherb](./tgatherb.md)
-- Next op in family: [pto.tquant](./tquant.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tscatter_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tscatter_zh.md
deleted file mode 100644
index d8adfca2..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tscatter_zh.md
+++ /dev/null
@@ -1,114 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tscatter_zh.md` -->
-
-# TSCATTER
-
-## 指令示意图
-
-![TSCATTER tile operation](../figures/isa/TSCATTER.svg)
-
-## 简介
-
-使用逐元素行索引将源 Tile 的行散播到目标 Tile 中。
-
-## 数学语义
-
-对每个源元素 `(i, j)`, let `k = idx[i,j]` and write:
-
-$$ \mathrm{dst\_flat}_{k} = \mathrm{src}_{i,j} $$
-
-Here `dst_flat` denotes the destination tile viewed as a single linear storage sequence. `TSCATTER` does **not** interpret `idx[i,j]` as a destination row selector. On the standard row-major tile layout, this is equivalent to writing the `k`-th flattened destination element.
-
-If multiple elements map to the same destination location, the final value is implementation-defined (last writer wins in the current implementation).
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tscatter %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tscatter %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataD, typename TileDataS, typename TileDataI, typename... WaitEvents>
-PTO_INST RecordEvent TSCATTER(TileDataD &dst, TileDataS &src, TileDataI &indexes, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-  - `TileDataD::Loc`, `TileDataS::Loc`, `TileDataI::Loc` must be `TileType::Vec`.
-  - `TileDataD::DType`, `TileDataS::DType` must be one of: `int32_t`, `int16_t`, `int8_t`, `half`, `float32_t`, `uint32_t`, `uint16_t`, `uint8_t`, `bfloat16_t`.
-  - `TileDataI::DType` must be one of: `int16_t`, `int32_t`, `uint16_t` or `uint32_t`.
-  - `indexes` values are interpreted as flattened destination element offsets in destination tile storage order.
-  - No bounds checks are enforced on `indexes` values.
-  - Static valid bounds: `TileDataD::ValidRow <= TileDataD::Rows`, `TileDataD::ValidCol <= TileDataD::Cols`, `TileDataS::ValidRow <= TileDataS::Rows`, `TileDataS::ValidCol <= TileDataS::Cols`, `TileDataI::ValidRow <= TileDataI::Rows`, `TileDataI::ValidCol <= TileDataI::Cols`.
-  - `TileDataD::DType` and `TileDataS::DType` must be the same.
-  - When size of `TileDataD::DType` is 4 bytes, the size of `TileDataI::DType` must be 4 bytes.
-  - When size of `TileDataD::DType` is 2 bytes, the size of `TileDataI::DType` must be 2 bytes.
-  - When size of `TileDataD::DType` is 1 bytes, the size of `TileDataI::DType` must be 2 bytes.
-- **实现检查 (A5)**:
-  - `TileDataD::Loc`, `TileDataS::Loc`, `TileDataI::Loc` must be `TileType::Vec`.
-  - `TileDataD::DType`, `TileDataS::DType` must be one of: `int32_t`, `int16_t`, `int8_t`, `half`, `float32_t`, `uint32_t`, `uint16_t`, `uint8_t`, `bfloat16_t`.
-  - `TileDataI::DType` must be one of: `int16_t`, `int32_t`, `uint16_t` or `uint32_t`.
-  - `indexes` values are interpreted as flattened destination element offsets in destination tile storage order.
-  - No bounds checks are enforced on `indexes` values.
-  - Static valid bounds: `TileDataD::ValidRow <= TileDataD::Rows`, `TileDataD::ValidCol <= TileDataD::Cols`, `TileDataS::ValidRow <= TileDataS::Rows`, `TileDataS::ValidCol <= TileDataS::Cols`, `TileDataI::ValidRow <= TileDataI::Rows`, `TileDataI::ValidCol <= TileDataI::Cols`.
-  - `TileDataD::DType` and `TileDataS::DType` must be the same.
-  - When size of `TileDataD::DType` is 4 bytes, the size of `TileDataI::DType` must be 4 bytes.
-  - When size of `TileDataD::DType` is 2 bytes, the size of `TileDataI::DType` must be 2 bytes.
-  - When size of `TileDataD::DType` is 1 bytes, the size of `TileDataI::DType` must be 2 bytes.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  using IdxT = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileT src, dst;
-  IdxT idx;
-  TSCATTER(dst, src, idx);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  using IdxT = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileT src, dst;
-  IdxT idx;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(idx, 0x3000);
-  TSCATTER(dst, src, idx);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tsort32.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tsort32.md
deleted file mode 100644
index 510f8f78..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tsort32.md
+++ /dev/null
@@ -1,181 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/tsort32.md` -->
-
-# pto.tsort32
-
-Standalone reference page for `pto.tsort32`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Sort a fixed-size 32-element block and produce an index mapping.
-
-## Mechanism
-
-Sort each 32-element block of `src` together with the corresponding indices from `idx`, and write the sorted value-index pairs into `dst`. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-For each row, `TSORT32` processes `src` in independent 32-element blocks. Let block `b` cover columns `32b ... 32b+31`, and let `n_b = min(32, C - 32b)` be the valid element count of that block.
-
-For each valid element in the block, form a pair
-
-$$
-(v_k, i_k) = (\mathrm{src}_{r,32b+k}, \mathrm{idx}_{r,32b+k}), \quad 0 \le k < n_b
-$$
-
-Then sort the pairs by value and write the sorted value-index pairs to `dst`. The exact packing layout in `dst` is target-defined, but semantically the output of each block is the reordered sequence
-
-$$
-[(v_{\pi(0)}, i_{\pi(0)}), (v_{\pi(1)}, i_{\pi(1)}), \ldots, (v_{\pi(n_b-1)}, i_{\pi(n_b-1)})]
-$$
-
-where `π` is the permutation produced by the implementation for that 32-element block.
-
-Notes:
-
-- `idx` is an input tile, not an output tile.
-- `dst` stores sorted value-index pairs, not just sorted values.
-- The CPU simulation sorts in descending order by value, and for equal values keeps smaller indices first.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename IdxTileData>
-PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx);
-
-template <typename DstTileData, typename SrcTileData, typename IdxTileData, typename TmpTileData>
-PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx, TmpTileData &tmp);
-```
-
-## Inputs
-
-- `src` is the source tile containing values to sort.
-- `idx` is the index tile providing initial indices.
-- `tmp` (optional): temporary tile for non-32-aligned tails.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds sorted value-index pairs from `src` according to `idx` order.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TSORT32` does not take `WaitEvents&...` and does not call `TSYNC(...)` internally; synchronize explicitly if needed.
-
-- `idx` is a required input operand in both overloads; it provides the indices that are permuted together with `src`.
-
-- **Valid region**:
-    - The implementation uses `dst.GetValidRow()` as the row count.
-    - The implementation uses `src.GetValidCol()` to determine how many elements participate in sorting in each row.
-    - Sorting is performed independently per 32-element block; the 4-argument overload additionally supports non-32-aligned tails with `tmp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3/A5)**:
-    - `DstTileData::DType` must be `half` or `float`.
-    - `SrcTileData::DType` must match `DstTileData::DType`.
-    - `IdxTileData::DType` must be `uint32_t`.
-    - `dst/src/idx` tile location must be `TileType::Vec`, and all must be row-major (`isRowMajor`).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 1, 32>;
-  using IdxT = Tile<TileType::Vec, uint32_t, 1, 32>;
-  using DstT = Tile<TileType::Vec, float, 1, 64>;
-  SrcT src;
-  IdxT idx;
-  DstT dst;
-  TSORT32(dst, src, idx);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 1, 32>;
-  using IdxT = Tile<TileType::Vec, uint32_t, 1, 32>;
-  using DstT = Tile<TileType::Vec, float, 1, 64>;
-  SrcT src;
-  IdxT idx;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(idx, 0x2000);
-  TASSIGN(dst, 0x3000);
-  TSORT32(dst, src, idx);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-# pto.tassign %arg2, @tile(0x3000)
-%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.tmrgsort](./tmrgsort.md)
-- Next op in family: [pto.tgather](./tgather.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tsort32_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tsort32_zh.md
deleted file mode 100644
index afc2e0c8..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/tsort32_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# pto.tsort32
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tsort32.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [现有中文指令说明](../../../TSORT32_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/ttri.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/ttri.md
deleted file mode 100644
index 51257294..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/ttri.md
+++ /dev/null
@@ -1,130 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/ttri.md` -->
-
-# pto.ttri
-
-Standalone reference page for `pto.ttri`. This page belongs to the [Irregular And Complex](../../irregular-and-complex.md) family in the PTO ISA manual.
-
-## Summary
-
-Generate a triangular (lower/upper) mask tile.
-
-## Mechanism
-
-Generate a (lower/upper) triangular mask tile with ones and zeros. The triangular orientation is controlled by the compile-time template parameter `isUpperOrLower` (0 = lower, 1 = upper). It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `d = diagonal`.
-
-Lower-triangular (`isUpperOrLower=0`) conceptually produces:
-
-$$
-\mathrm{dst}_{i,j} = \begin{cases}1 & j \le i + d \\\\ 0 & \text{otherwise}\end{cases}
-$$
-
-Upper-triangular (`isUpperOrLower=1`) conceptually produces:
-
-$$
-\mathrm{dst}_{i,j} = \begin{cases}0 & j < i + d \\\\ 1 & \text{otherwise}\end{cases}
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, int isUpperOrLower, typename... WaitEvents>
-PTO_INST RecordEvent TTRI(TileData &dst, int diagonal, WaitEvents &... events);
-```
-
-## Inputs
-
-- `diagonal` is the diagonal offset.
-- `isUpperOrLower` (template parameter): 0 for lower triangular, 1 for upper triangular.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds a triangular mask (1s on one side of the diagonal, 0s elsewhere).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `isUpperOrLower` must be `0` (lower) or `1` (upper).
-
-- Destination tile must be row-major on some targets (see `include/pto/npu/*/TTri.hpp`).
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.ttri` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Irregular And Complex](../../irregular-and-complex.md)
-- Previous op in family: [pto.tci](./tci.md)
-- Next op in family: [pto.tpartadd](./tpartadd.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/ttri_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/ttri_zh.md
deleted file mode 100644
index 88ac7181..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/irregular-and-complex/ttri_zh.md
+++ /dev/null
@@ -1,73 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/irregular-and-complex/ttri_zh.md` -->
-
-# TTRI
-
-## 指令示意图
-
-![TTRI tile operation](../figures/isa/TTRI.svg)
-
-## 简介
-
-生成三角（下/上）掩码 Tile。
-
-## 数学语义
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `d = diagonal`.
-
-Lower-triangular (`isUpperOrLower=0`) conceptually produces:
-
-$$
-\mathrm{dst}_{i,j} = \begin{cases}1 & j \le i + d \\\\ 0 & \text{otherwise}\end{cases}
-$$
-
-Upper-triangular (`isUpperOrLower=1`) conceptually produces:
-
-$$
-\mathrm{dst}_{i,j} = \begin{cases}0 & j < i + d \\\\ 1 & \text{otherwise}\end{cases}
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, int isUpperOrLower, typename... WaitEvents>
-PTO_INST RecordEvent TTRI(TileData &dst, int diagonal, WaitEvents &... events);
-```
-
-## 约束
-
-- `isUpperOrLower` must be `0` (lower) or `1` (upper).
-- Destination tile must be row-major on some targets (see `include/pto/npu/*/TTri.hpp`).
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract-fp.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract-fp.md
deleted file mode 100644
index 36e15c6a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract-fp.md
+++ /dev/null
@@ -1,120 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/textract-fp.md` -->
-
-# pto.textract_fp
-
-Standalone reference page for `pto.textract_fp`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Extract with fp/scaling tile (vector-quantization parameters).
-
-## Mechanism
-
-Extract a sub-tile from a source tile, while also providing an `fp` (scaling) tile used for vector quantization parameters (target/implementation-defined). It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `fp` is the scaling tile for vector quantization.
-- `indexRow` is the starting row offset in `src`.
-- `indexCol` is the starting column offset in `src`.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-- `reluMode` (optional): specifies ReLU mode.
-
-## Expected Outputs
-
-`dst` holds the extracted sub-tile from `src` at position (indexRow, indexCol), converted using `fp` scaling parameters.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.textract_fp` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Previous op in family: [pto.textract](./textract.md)
-- Next op in family: [pto.timg2col](./timg2col.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract-fp_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract-fp_zh.md
deleted file mode 100644
index 3c09ad0d..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract-fp_zh.md
+++ /dev/null
@@ -1,61 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/textract-fp_zh.md` -->
-
-# TEXTRACT_FP
-
-## 指令示意图
-
-![TEXTRACT_FP tile operation](../figures/isa/TEXTRACT_FP.svg)
-
-## 简介
-
-带 fp/缩放 Tile 的提取（向量量化参数）。
-
-## 数学语义
-
-除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-```
-
-## 约束
-
-Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract.md
deleted file mode 100644
index e68d1af8..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract.md
+++ /dev/null
@@ -1,189 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/textract.md` -->
-
-# pto.textract
-
-Standalone reference page for `pto.textract`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Extract a sub-tile from a source tile.
-
-## Mechanism
-
-Extract a smaller sub-tile from a larger source tile. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Conceptually copies a smaller window starting at `(indexRow, indexCol)` from the larger `src` tile into `dst`. Exact mapping depends on tile layouts.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{indexRow}+i,\; \mathrm{indexCol}+j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint16_t indexRow = 0, uint16_t indexCol = 0, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
-PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `indexRow` is the starting row offset in `src`.
-- `indexCol` is the starting column offset in `src`.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-- `fp` (optional for TEXTRACT_FP): scaling tile for vector quantization.
-- `reluMode` (optional): specifies ReLU mode.
-- `preQuantScalar` (optional): scalar for pre-quantization.
-
-## Expected Outputs
-
-`dst` holds the extracted sub-tile from `src` at position (indexRow, indexCol), with optional conversion.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `DstTileData::DType` must equal `SrcTileData::DType`.
-
-- Supported element types: `int8_t`, `half`, `bfloat16_t`, `float`.
-
-- In GEMV scenarios targeting `TileType::Left`, the checked source layout also allows `(SrcTileData::Rows == 1 && SrcTileData::isRowMajor)`.
-
-- Supported element types: `int8_t`, `hifloat8_t`, `float8_e5m2_t`, `float8_e4m3_t`, `half`, `bfloat16_t`, `float`, `float4_e2m1x2_t`, `float4_e1m2x2_t`, `float8_e8m0_t`.
-
-- In GEMV scenarios targeting `Left`, the checked source layout also allows `(SrcTileData::Rows == 1 && SrcTileData::isRowMajor)`.
-
-- Destination supports `TileType::Mat -> TileType::Left/Right/Scale`, `TileType::Acc -> TileType::Mat` (including relu, scalar-quant, and vector-quantized forms), and specific `TileType::Vec -> TileType::Mat` extraction paths.
-
-- The vector-quantized form additionally requires an `FpTileData` scaling operand, matching the `TEXTRACT_FP(...)` interface.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- Runtime bounds checks:
-  - `indexRow + DstTileData::Rows <= SrcTileData::Rows`
-  - `indexCol + DstTileData::Cols <= SrcTileData::Cols`
-
-### A2A3 implementation checks
-
-- Source layout must satisfy one of the checked A2A3 extraction layouts:
-  - `(SFractal == ColMajor && isRowMajor)`, or
-  - `(SFractal == RowMajor && !isRowMajor)`.
-
-- Destination must be `TileType::Left` or `TileType::Right` with a target-supported fractal configuration.
-
-### A5 implementation checks
-
-- Source layout must satisfy one of the checked A5 extraction layouts:
-  - for `Left` / `Right`: `(SFractal == ColMajor && isRowMajor)` or `(SFractal == RowMajor && !isRowMajor)`
-  - for `ScaleLeft`: `(SFractal == RowMajor && isRowMajor)`
-  - for `ScaleRight`: `(SFractal == ColMajor && !isRowMajor)`
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
-  using DstT = TileLeft<float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TEXTRACT(dst, src, /*indexRow=*/0, /*indexCol=*/0);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
-  using DstT = TileLeft<float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TEXTRACT(dst, src, /*indexRow=*/0, /*indexCol=*/0);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Next op in family: [pto.textract_fp](./textract-fp.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md
deleted file mode 100644
index 3cf2665e..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md
+++ /dev/null
@@ -1,153 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md` -->
-
-# TEXTRACT
-
-## 指令示意图
-
-![TEXTRACT tile operation](../figures/isa/TEXTRACT.svg)
-
-## 简介
-
-从较大的源 Tile 中提取较小的子 Tile。
-
-## 数学语义
-
-概念上从较大的 `src` Tile 中，以 `(indexRow, indexCol)` 为起点复制一个较小窗口到 `dst`。确切的映射取决于 tile 布局。
-
-设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。对于 `0 <= i < R` 和 `0 <= j < C`：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{indexRow}+i,\; \mathrm{indexCol}+j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint16_t indexRow = 0, uint16_t indexCol = 0, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
-PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `DstTileData::DType` 必须等于 `SrcTileData::DType`。
-- 运行时边界检查：
-  - `indexRow + DstTileData::Rows <= SrcTileData::Rows`
-  - `indexCol + DstTileData::Cols <= SrcTileData::Cols`
-
-### A2A3 实现检查
-
-- 支持的元素类型：`int8_t`、`half`、`bfloat16_t`、`float`。
-- 源布局必须满足以下已检查到的 A2A3 提取布局之一：
-  - `(SFractal == ColMajor && isRowMajor)`，或
-  - `(SFractal == RowMajor && !isRowMajor)`。
-- 在以 `TileType::Left` 为目标的 GEMV 场景中，已检查到的源布局还允许 `(SrcTileData::Rows == 1 && SrcTileData::isRowMajor)`。
-- 目标必须是 `TileType::Left` 或 `TileType::Right`，并具有目标支持的布局配置。
-
-### A5 实现检查
-
-- 支持的元素类型：`int8_t`、`hifloat8_t`、`float8_e5m2_t`、`float8_e4m3_t`、`half`、`bfloat16_t`、`float`、`float4_e2m1x2_t`、`float4_e1m2x2_t`、`float8_e8m0_t`。
-- 源布局必须满足以下已检查到的 A5 提取布局之一：
-  - 对于 `Left` / `Right`：`(SFractal == ColMajor && isRowMajor)` 或 `(SFractal == RowMajor && !isRowMajor)`
-  - 对于 `ScaleLeft`：`(SFractal == RowMajor && isRowMajor)`
-  - 对于 `ScaleRight`：`(SFractal == ColMajor && !isRowMajor)`
-- 在以 `Left` 为目标的 GEMV 场景中，已检查到的源布局还允许 `(SrcTileData::Rows == 1 && SrcTileData::isRowMajor)`。
-- 目标支持 `TileType::Mat -> TileType::Left/Right/Scale`、`TileType::Acc -> TileType::Mat`（含 relu、标量量化、向量量化形式），以及特定的 `TileType::Vec -> TileType::Mat` 提取路径。
-- 向量量化形式额外要求提供 `FpTileData` 缩放操作数，对应 `TEXTRACT_FP(...)` 接口。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
-  using DstT = TileLeft<float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TEXTRACT(dst, src, /*indexRow=*/0, /*indexCol=*/0);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
-  using DstT = TileLeft<float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TEXTRACT(dst, src, /*indexRow=*/0, /*indexCol=*/0);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand.md
deleted file mode 100644
index b9577853..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand.md
+++ /dev/null
@@ -1,116 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand.md` -->
-
-# pto.tfillpad_expand
-
-Standalone reference page for `pto.tfillpad_expand`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Fill/pad while allowing dst to be larger than src.
-
-## Mechanism
-
-Expand fill/pad variant of TFILLPAD (allows dst to be larger than src; implementation-defined). It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TFILLPAD_EXPAND(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile. May be larger than `src`.
-- `PadVal` is the compile-time pad value for elements outside the valid region.
-
-## Expected Outputs
-
-`dst` holds a copy of `src` with valid region copied and padded region filled with the specified pad value.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tfillpad_expand` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Previous op in family: [pto.tfillpad_inplace](./tfillpad-inplace.md)
-- Next op in family: [pto.tmov](./tmov.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand_zh.md
deleted file mode 100644
index 447f8993..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand_zh.md
+++ /dev/null
@@ -1,60 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand_zh.md` -->
-
-# TFILLPAD_EXPAND
-
-## 指令示意图
-
-![TFILLPAD_EXPAND tile operation](../figures/isa/TFILLPAD_EXPAND.svg)
-
-## 简介
-
-填充/填充时允许目标大于源。
-
-## 数学语义
-
-除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TFILLPAD_EXPAND(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-```
-
-## 约束
-
-Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace.md
deleted file mode 100644
index a38bc7d4..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace.md
+++ /dev/null
@@ -1,116 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace.md` -->
-
-# pto.tfillpad_inplace
-
-Standalone reference page for `pto.tfillpad_inplace`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-In-place fill/pad variant.
-
-## Mechanism
-
-In-place fill/pad variant of TFILLPAD (implementation-defined). It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TFILLPAD_INPLACE(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile. Must have same shape as `src`.
-- `PadVal` is the compile-time pad value for elements outside the valid region.
-
-## Expected Outputs
-
-`dst` holds the in-place copy of `src` with valid region copied and padded region filled with the specified pad value.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tfillpad_inplace` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Previous op in family: [pto.tfillpad](./tfillpad.md)
-- Next op in family: [pto.tfillpad_expand](./tfillpad-expand.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace_zh.md
deleted file mode 100644
index 06b1b83d..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace_zh.md
+++ /dev/null
@@ -1,60 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace_zh.md` -->
-
-# TFILLPAD_INPLACE
-
-## 指令示意图
-
-![TFILLPAD_INPLACE tile operation](../figures/isa/TFILLPAD_INPLACE.svg)
-
-## 简介
-
-原地填充/填充变体。
-
-## 数学语义
-
-除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TFILLPAD_INPLACE(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-```
-
-## 约束
-
-Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad.md
deleted file mode 100644
index ec1655e8..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad.md
+++ /dev/null
@@ -1,168 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tfillpad.md` -->
-
-# pto.tfillpad
-
-Standalone reference page for `pto.tfillpad`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Copy+pad a tile outside the valid region with a compile-time pad value.
-
-## Mechanism
-
-Copy a source tile into a destination tile and fill the remaining (padded) elements with a compile-time pad value
-selected by `TileDataDst::PadVal` (e.g., `PadValue::Min`/`PadValue::Max`).
-
-This is commonly used to materialize deterministic values outside the runtime valid region so that subsequent ops can
-operate on a full static tile shape. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Let `VR = src.GetValidRow()` and `VC = src.GetValidCol()`. For each destination element `(i, j)`:
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src}_{i,j} & \text{if } i < VR \text{ and } j < VC \\
-\mathrm{pad}       & \text{otherwise}
-\end{cases}
-$$
-
-`pad` is determined by `TileDataDst::PadVal` and the element type (e.g., `+inf/-inf` for floating types when available,
-otherwise `std::numeric_limits<T>::max()/min()`).
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form (conceptual):
-
-```text
-%dst = tfillpad %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Implemented in the backend headers pulled in by `include/pto/common/pto_instr_impl.hpp`:
-
-```cpp
-template <typename TileData, PadValue PadVal = PadValue::Zero, typename... WaitEvents>
-PTO_INST RecordEvent TFILLPAD(TileData &dst, TileData &src, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TFILLPAD(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile. Must have same shape as `src`.
-- `PadVal` is the compile-time pad value for elements outside the valid region.
-
-## Expected Outputs
-
-`dst` holds a copy of `src` with valid region copied and padded region filled with the specified pad value.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::PadVal != PadValue::Null`.
-
-- `sizeof(TileDataDst::DType) == sizeof(TileDataSrc::DType)` and element size must be `1`, `2`, or `4` bytes.
-
-- `TFILLPAD`: `TileDataDst::Rows/Cols` must match `TileDataSrc::Rows/Cols`.
-
-- `TFILLPAD_EXPAND`: `TileDataDst::Rows >= TileDataSrc::Rows` and `TileDataDst::Cols >= TileDataSrc::Cols`.
-
-- `TFILLPAD(TileData &dst, TileData &src)`:`if TileData::TileType is Mat, layout only support (!TileData::isRowMajor && TileData::Slayout::RowMajor), and PadVal only support PadValue::Zero`
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tfillpad` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example1() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::NoneBox, TileConfig::fractalABSize, PadValue::Min>;
-
-  SrcT src;
-  DstT dst;
-  TFILLPAD(dst, src);
-}
-
-void example2() {
-  using TileMatData = Tile<TileType::Mat, float, 16, 256, BLayout::ColMajor, 1, 224, SLayout::RowMajor, 512>;
-
-  TileMatData matTile;
-  TFILLPAD(matTile, matTile);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Previous op in family: [pto.tinsert_fp](./tinsert-fp.md)
-- Next op in family: [pto.tfillpad_inplace](./tfillpad-inplace.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad_zh.md
deleted file mode 100644
index 8d58d777..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tfillpad_zh.md
+++ /dev/null
@@ -1,110 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tfillpad_zh.md` -->
-
-# TFILLPAD
-
-## 指令示意图
-
-![TFILLPAD tile operation](../figures/isa/TFILLPAD.svg)
-
-## 简介
-
-复制 Tile 并在有效区域外使用编译时填充值进行填充。
-
-Copy a source tile into a destination tile and fill the remaining (padded) elements with a compile-time pad value
-selected by `TileDataDst::PadVal` (e.g., `PadValue::Min`/`PadValue::Max`).
-
-This is commonly used to materialize deterministic values outside the runtime valid region so that subsequent ops can
-operate on a full static tile shape.
-
-## 数学语义
-
-Let `VR = src.GetValidRow()` and `VC = src.GetValidCol()`. 对每个 destination element `(i, j)`:
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src}_{i,j} & \text{if } i < VR \text{ and } j < VC \\
-\mathrm{pad}       & \text{otherwise}
-\end{cases}
-$$
-
-`pad` is determined by `TileDataDst::PadVal` and the element type (e.g., `+inf/-inf` for floating types when available,
-otherwise `std::numeric_limits<T>::max()/min()`).
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-Synchronous form (conceptual):
-
-```text
-%dst = tfillpad %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-Implemented in the backend headers pulled in by `include/pto/common/pto_instr_impl.hpp`:
-
-```cpp
-template <typename TileData, PadValue PadVal = PadValue::Zero, typename... WaitEvents>
-PTO_INST RecordEvent TFILLPAD(TileData &dst, TileData &src, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TFILLPAD(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::PadVal != PadValue::Null`.
-- `sizeof(TileDataDst::DType) == sizeof(TileDataSrc::DType)` and element size must be `1`, `2`, or `4` bytes.
-- `TFILLPAD`: `TileDataDst::Rows/Cols` must match `TileDataSrc::Rows/Cols`.
-- `TFILLPAD_EXPAND`: `TileDataDst::Rows >= TileDataSrc::Rows` and `TileDataDst::Cols >= TileDataSrc::Cols`.
-- `TFILLPAD(TileData &dst, TileData &src)`:`if TileData::TileType is Mat, layout only support (!TileData::isRowMajor && TileData::Slayout::RowMajor), and PadVal only support PadValue::Zero`
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example1() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::NoneBox, TileConfig::fractalABSize, PadValue::Min>;
-
-  SrcT src;
-  DstT dst;
-  TFILLPAD(dst, src);
-}
-
-void example2() {
-  using TileMatData = Tile<TileType::Mat, float, 16, 256, BLayout::ColMajor, 1, 224, SLayout::RowMajor, 512>;
-
-  TileMatData matTile;
-  TFILLPAD(matTile, matTile);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/timg2col.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/timg2col.md
deleted file mode 100644
index 5ac63ff7..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/timg2col.md
+++ /dev/null
@@ -1,117 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/timg2col.md` -->
-
-# pto.timg2col
-
-Standalone reference page for `pto.timg2col`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Image-to-column transform for convolution-like workloads.
-
-## Mechanism
-
-Transform an input feature-map tile (e.g. NC1HWC0 layout) into an im2col-style matrix tile for convolution-like workloads. Parameters are provided via `Img2colTileConfig` and `(posM, posK)` offsets. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-PTO_INST RecordEvent TIMG2COL(TileData &dst, ConvTileData &src, uint16_t posM = 0, uint16_t posK = 0,
-                              WaitEvents&... events);
-```
-
-## Inputs
-
-- `src` is the source ConvTileData (feature-map tile in NC1HWC0 layout).
-- `dst` names the destination im2col matrix tile.
-- `posM` is the output row offset.
-- `posK` is the output column offset.
-
-## Expected Outputs
-
-`dst` holds the im2col transformed data from `src` according to the Img2colTileConfig parameters.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- This instruction is target/implementation-specific. See `include/pto/npu/*/TImg2col.hpp` for the supported tile types/layouts and config fields.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.timg2col` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Previous op in family: [pto.textract_fp](./textract-fp.md)
-- Next op in family: [pto.tinsert](./tinsert.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/timg2col_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/timg2col_zh.md
deleted file mode 100644
index 4168a5d8..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/timg2col_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# pto.timg2col
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](timg2col.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [现有中文指令说明](../../../TIMG2COL_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp.md
deleted file mode 100644
index a32a6e28..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp.md
+++ /dev/null
@@ -1,120 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp.md` -->
-
-# pto.tinsert_fp
-
-Standalone reference page for `pto.tinsert_fp`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Insert with fp/scaling tile (vector-quantization parameters).
-
-## Mechanism
-
-Vector-quantization variant of `TINSERT` that also takes an `fp` (scaling) tile. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TINSERT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `fp` is the scaling tile for vector quantization.
-- `indexRow` is the starting row offset in `dst`.
-- `indexCol` is the starting column offset in `dst`.
-- `dst` names the destination tile. The operation iterates over src's valid region.
-- `reluMode` (optional): specifies ReLU mode.
-
-## Expected Outputs
-
-`dst` holds the result of inserting `src` into `dst` at position (indexRow, indexCol), converted using `fp` scaling parameters.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tinsert_fp` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Previous op in family: [pto.tinsert](./tinsert.md)
-- Next op in family: [pto.tfillpad](./tfillpad.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp_zh.md
deleted file mode 100644
index 2723c55b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp_zh.md
+++ /dev/null
@@ -1,61 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tinsert-fp_zh.md` -->
-
-# TINSERT_FP
-
-## 指令示意图
-
-![TINSERT_FP tile operation](../figures/isa/TINSERT_FP.svg)
-
-## 简介
-
-带 fp/缩放 Tile 的插入（向量量化参数）。
-
-## 数学语义
-
-除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TINSERT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-```
-
-## 约束
-
-Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert.md
deleted file mode 100644
index c6ed0fb3..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert.md
+++ /dev/null
@@ -1,138 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tinsert.md` -->
-
-# pto.tinsert
-
-Standalone reference page for `pto.tinsert`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Insert a sub-tile into a destination tile at an (indexRow, indexCol) offset.
-
-## Mechanism
-
-Insert a source sub-tile into a destination tile at `(indexRow, indexCol)`. This is conceptually the inverse of `TEXTRACT` for many layouts. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. Conceptually, for `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{\mathrm{indexRow}+i,\;\mathrm{indexCol}+j} = \mathrm{src}_{i,j}
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tinsert ins(%src[%r0, %r1] : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
-PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TINSERT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-#ifdef PTO_NPU_ARCH_A5
-template <TInsertMode mode, typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint32_t indexRow = 0, uint32_t indexCol = 0, WaitEvents &... events);
-#endif
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `indexRow` is the starting row offset in `dst`.
-- `indexCol` is the starting column offset in `dst`.
-- `dst` names the destination tile. The operation iterates over src's valid region.
-- `fp` (optional for TINSERT_FP): scaling tile for vector quantization.
-- `reluMode` (optional): specifies ReLU mode.
-- `preQuantScalar` (optional): scalar for pre-quantization.
-
-## Expected Outputs
-
-`dst` holds the result of inserting `src` into `dst` at position (indexRow, indexCol), with optional conversion.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **A2/A3**:
-    - The documented overloads map to `Acc -> Mat` insertion paths, including plain, `reluMode`, scalar pre-quant, and vector pre-quant (`TINSERT_FP`) forms.
-    - Runtime bounds must satisfy `indexRow + src.Rows <= dst.Rows` and `indexCol + src.Cols <= dst.Cols`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **A5**:
-    - In addition to the `Acc -> Mat` insertion paths above, A5 also exposes `template <TInsertMode mode, ...> TINSERT(...)` for `Vec -> Mat` and `Vec -> Vec` insertion variants.
-    - `mode == TInsertMode::ND` requires a row-major source vector tile and inserts into a matrix tile in ND layout.
-    - `mode == TInsertMode::ND_VEC` requires both source and destination to be row-major vector tiles.
-    - NZ-family modes (`NZ`, `NZ_PLUS_1`, `SPLIT2_NZ_PLUS_1`, `SPLIT4_NZ_PLUS_1`) require an NZ-format source vector tile and a matrix destination tile.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tinsert ins(%src[%r0, %r1] : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Previous op in family: [pto.timg2col](./timg2col.md)
-- Next op in family: [pto.tinsert_fp](./tinsert-fp.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md
deleted file mode 100644
index 2d32f8fe..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md
+++ /dev/null
@@ -1,108 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md` -->
-
-# TINSERT
-
-## 指令示意图
-
-![TINSERT tile operation](../figures/isa/TINSERT.svg)
-
-## 简介
-
-在 (indexRow, indexCol) 偏移处将子 Tile 插入到目标 Tile 中。
-
-## 数学语义
-
-设 `R = src.GetValidRow()` 和 `C = src.GetValidCol()`。概念上，对于 `0 <= i < R` 和 `0 <= j < C`：
-
-$$
-\mathrm{dst}_{\mathrm{indexRow}+i,\;\mathrm{indexCol}+j} = \mathrm{src}_{i,j}
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tinsert ins(%src[%r0, %r1] : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
-PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TINSERT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
-
-#ifdef PTO_NPU_ARCH_A5
-template <TInsertMode mode, typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint32_t indexRow = 0, uint32_t indexCol = 0, WaitEvents &... events);
-#endif
-```
-
-## 约束
-
-- **A2/A3**:
-    - 文档中列出的这些重载对应 `Acc -> Mat` 插入路径，包括普通形式、`reluMode` 形式、标量预量化形式以及向量预量化（`TINSERT_FP`）形式。
-    - 运行时边界必须满足 `indexRow + src.Rows <= dst.Rows` 且 `indexCol + src.Cols <= dst.Cols`。
-- **A5**:
-    - 除了上面的 `Acc -> Mat` 插入路径外，A5 还额外提供 `template <TInsertMode mode, ...> TINSERT(...)`，用于 `Vec -> Mat` 与 `Vec -> Vec` 插入变体。
-    - `mode == TInsertMode::ND` 要求源向量 tile 为行优先，并以 ND 布局插入到矩阵 tile。
-    - `mode == TInsertMode::ND_VEC` 要求源和目的都为行优先向量 tile。
-    - NZ 系列模式（`NZ`、`NZ_PLUS_1`、`SPLIT2_NZ_PLUS_1`、`SPLIT4_NZ_PLUS_1`）要求源向量 tile 为 NZ 格式，目的为矩阵 tile。
-
-## 示例
-
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tinsert ins(%src[%r0, %r1] : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov-fp.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov-fp.md
deleted file mode 100644
index 4aad47ec..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov-fp.md
+++ /dev/null
@@ -1,172 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tmov-fp.md` -->
-
-# pto.tmov_fp
-
-Standalone reference page for `pto.tmov_fp`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Move/convert from an accumulator tile into a destination tile, using a scaling (`fp`) tile for vector quantization parameters.
-
-## Mechanism
-
-Move/convert from an accumulator tile into a destination tile, using a scaling (`fp`) tile for vector quantization parameters.
-
-`TMOV_FP` is a named wrapper around the `TMOV_IMPL(..., fp)` path and is part of the [`pto.tmov`](./tmov.md) family. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Conceptually converts each element using an implementation-defined quantization/dequantization configuration derived from `fp`:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMOV_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source accumulator tile.
-- `fp` is the scaling tile for vector quantization.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-- `reluMode` (optional): specifies ReLU mode.
-
-## Expected Outputs
-
-`dst` holds the converted values from `src` using `fp` quantization parameters.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- Operand shape, mode, and state tuples MUST match the documented contract of this operation and its family overview.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - The fp path is only supported for accumulator conversion and is validated by internal compile-time checks in `TMOV_IMPL(dst, src, fp)`.
-    - `FpTileData::Loc` must be `TileType::Scaling` (`static_assert`).
-
-- **Implementation checks (A5)**:
-    - Validated by `CheckTMovAccValid(...)` and related compile-time checks in `TMOV_IMPL(dst, src, fp)`.
-    - `FpTileData::Loc` must be `TileType::Scaling` (`static_assert`).
-    - Destination location is target-dependent (`Vec` or `Mat` are supported in the fp path).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using AccT = TileAcc<float, 16, 16>;
-  using DstT = Tile<TileType::Vec, int8_t, 16, 16>;
-  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, 16, SLayout::NoneBox>;
-
-  AccT acc;
-  DstT dst;
-  FpT fp;
-  TMOV_FP(dst, acc, fp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using AccT = TileAcc<float, 16, 16>;
-  using DstT = Tile<TileType::Vec, int8_t, 16, 16>;
-  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, 16, SLayout::NoneBox>;
-
-  AccT acc;
-  DstT dst;
-  FpT fp;
-  TASSIGN(acc, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(fp,  0x3000);
-  TMOV_FP(dst, acc, fp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Previous op in family: [pto.tmov](./tmov.md)
-- Next op in family: [pto.treshape](./treshape.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov-fp_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov-fp_zh.md
deleted file mode 100644
index 7d2217f3..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov-fp_zh.md
+++ /dev/null
@@ -1,114 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tmov-fp_zh.md` -->
-
-# TMOV_FP
-
-## 指令示意图
-
-![TMOV_FP tile operation](../figures/isa/TMOV_FP.svg)
-
-## 简介
-
-使用缩放 (`fp`) Tile 作为向量量化参数，将累加器 Tile 移动/转换到目标 Tile。
-
-## 数学语义
-
-Conceptually converts each element using an implementation-defined quantization/dequantization configuration derived from `fp`:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMOV_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - The fp path is only supported for accumulator conversion and is validated by internal compile-time checks in `TMOV_IMPL(dst, src, fp)`.
-    - `FpTileData::Loc` must be `TileType::Scaling` (`static_assert`).
-- **实现检查 (A5)**:
-    - Validated by `CheckTMovAccValid(...)` and related compile-time checks in `TMOV_IMPL(dst, src, fp)`.
-    - `FpTileData::Loc` must be `TileType::Scaling` (`static_assert`).
-    - Destination location is target-dependent (`Vec` or `Mat` are supported in the fp path).
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using AccT = TileAcc<float, 16, 16>;
-  using DstT = Tile<TileType::Vec, int8_t, 16, 16>;
-  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, 16, SLayout::NoneBox>;
-
-  AccT acc;
-  DstT dst;
-  FpT fp;
-  TMOV_FP(dst, acc, fp);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using AccT = TileAcc<float, 16, 16>;
-  using DstT = Tile<TileType::Vec, int8_t, 16, 16>;
-  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, 16, SLayout::NoneBox>;
-
-  AccT acc;
-  DstT dst;
-  FpT fp;
-  TASSIGN(acc, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(fp,  0x3000);
-  TMOV_FP(dst, acc, fp);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov.md
deleted file mode 100644
index 437e168f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov.md
+++ /dev/null
@@ -1,245 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tmov.md` -->
-
-# pto.tmov
-
-Standalone reference page for `pto.tmov`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Move/copy between tiles, optionally applying implementation-defined conversion modes.
-
-## Mechanism
-
-Move/copy between tiles, optionally applying implementation-defined conversion modes selected by template parameters and overloads.
-
-`TMOV` is used for:
-
-- Vec -> Vec moves
-- Mat -> Left/Right/Bias/Scaling/Scale(Microscaling) moves (target-dependent)
-- Acc -> Mat/Vec moves (target-dependent) It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Conceptually copies or transforms elements from `src` into `dst` over the valid region. Exact transformation depends on the selected mode and target.
-
-For the pure copy case:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-The PTO AS design recommends splitting `TMOV` into a family of ops:
-
-```text
-%left  = tmov.m2l %mat  : !pto.tile<...> -> !pto.tile<...>
-%right = tmov.m2r %mat  : !pto.tile<...> -> !pto.tile<...>
-%bias  = tmov.m2b %mat  : !pto.tile<...> -> !pto.tile<...>
-%scale = tmov.m2s %mat  : !pto.tile<...> -> !pto.tile<...>
-%vec   = tmov.a2v %acc  : !pto.tile<...> -> !pto.tile<...>
-%v1    = tmov.v2v %v0   : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, AccToVecMode mode, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, typename FpTileData, AccToVecMode mode,
-          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, FpTileData &fp, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, AccToVecMode mode, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-- `fp` (optional for TMOV_FP): scaling tile for vector quantization parameters.
-
-## Expected Outputs
-
-`dst` holds a copy or transformed version of `src`, with optional conversion applied (relu, quantization, etc.).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `TMOV` has these overload families:
-  - plain move: `TMOV(dst, src)`
-  - relu form: `TMOV<..., reluMode>(dst, src)`
-  - accumulator-to-vector form: `TMOV<..., mode, reluMode>(dst, src)`
-  - vector-quant form: `TMOV<..., FpTileData, mode, reluMode>(dst, src, fp)`
-  - scalar-quant form: `TMOV<..., reluMode>(dst, src, preQuantScalar)` and `TMOV<..., mode, reluMode>(dst, src, preQuantScalar)`
-
-- `reluMode` is `ReluPreMode::{NoRelu, NormalRelu}`.
-
-- Shape must match: `SrcTileData::Rows == DstTileData::Rows` and `SrcTileData::Cols == DstTileData::Cols`.
-
-- Supported tile-type pairs are compile-time restricted to:
-  - `TileType::Mat -> TileType::Left/Right/Bias/Scaling`
-  - `TileType::Vec -> TileType::Vec`
-  - `TileType::Acc -> TileType::Mat`
-
-- For `TileType::Mat -> TileType::Bias`:
-  - supported source/destination dtype pairs are `int32_t -> int32_t`, `float -> float`, and `half -> float`
-  - source row must be `1`
-  - `SrcTileData::Cols * sizeof(SrcType)` must be aligned to `64` bytes
-
-- For `TileType::Mat -> TileType::Scaling`:
-  - destination dtype must equal source dtype and must be `uint64_t`
-  - source row must be `1`
-  - `SrcTileData::Cols * sizeof(SrcType)` must be aligned to `128` bytes
-
-- `CommonCheck()` requires:
-  - destination/source dtype must be identical
-  - supported element types are `int8_t`, `hifloat8_t`, `float8_e5m2_t`, `float8_e4m3_t`, `half`, `bfloat16_t`, `float`, `float4_e2m1x2_t`, `float4_e1m2x2_t`
-  - source layout must satisfy one of:
-    - `(SrcTileData::SFractal == SLayout::ColMajor && SrcTileData::isRowMajor)`
-    - `(SrcTileData::SFractal == SLayout::RowMajor && !SrcTileData::isRowMajor)`
-    - `SrcTileData::isRowMajor`
-
-- `CommonCheckMX()` for MX paths requires identical source/destination dtype and supports `float8_e8m0_t`.
-
-- For `TileType::Mat -> TileType::Bias`:
-  - supported dtype pairs are `int32_t -> int32_t`, `float -> float`, `half -> float`, `bfloat16_t -> float`
-  - source row must be `1`
-  - `DstTileData::Cols * sizeof(DstType)` must be aligned to `64` bytes
-  - bias-table footprint `DstTileData::Cols * sizeof(DstType)` must not exceed `4096` bytes
-
-- For `TileType::Mat -> TileType::Scaling`:
-  - source row must be `1`
-  - `DstTileData::Cols * sizeof(DstType)` must be aligned to `128` bytes
-  - fixpipe-buffer footprint `DstTileData::Cols * sizeof(DstType)` must not exceed `4096` bytes
-
-- For `TileType::Acc -> TileType::Vec`:
-  - `mode` selects `SingleModeVec0`, `SingleModeVec1`, `DualModeSplitM`, or `DualModeSplitN`
-  - dual-destination modes require `QuantMode_t::NoQuant`
-  - dual-destination modes do not support the `nz2dn` path
-  - destination stride must be non-zero and `dstStride * sizeof(dstType)` must be a multiple of `32` bytes
-
-- For `TileType::Acc -> TileType::Mat`:
-  - destination stride must be non-zero and `dstStride * sizeof(dstType)` must be a multiple of `32` bytes
-  - relu/scalar-quant/vector-quant forms are supported through the corresponding overloads
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `mode` is `AccToVecMode::{SingleModeVec0, SingleModeVec1, DualModeSplitM, DualModeSplitN}`.
-
-### A2A3 implementation checks
-
-- For `TileType::Acc -> TileType::Mat`:
-  - additional `CheckTMovAccToMat<...>` compile-time checks are enforced
-  - plain/relu forms use cast pre-quant mode derived by `GetCastPreQuantMode<SrcDType, DstDType>()`
-  - scalar-quant forms use `GetScalarPreQuantMode<SrcDType, DstDType>()`
-  - vector-quant forms require an `FpTileData` operand with `FpTileData::Loc == TileType::Scaling`, and use `GetVectorPreQuantMode<SrcDType, DstDType>()`
-
-### A5 implementation checks
-
-- Supported paths include:
-  - `TileType::Mat -> TileType::Left/Right/Bias/Scaling/ScaleLeft/ScaleRight`
-  - `TileType::Vec -> TileType::Vec/TileType::Mat`
-  - `TileType::Acc -> TileType::Vec/TileType::Mat`
-  - specific `ND -> ZZ` and related internal path variants handled by the A5 implementation
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TMOV(dst, src);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
-  using DstT = TileLeft<float, 16, 16>;
-  SrcT mat;
-  DstT left;
-  TASSIGN(mat, 0x1000);
-  TASSIGN(left, 0x2000);
-  TMOV(left, mat);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Previous op in family: [pto.tfillpad_expand](./tfillpad-expand.md)
-- Next op in family: [pto.tmov_fp](./tmov-fp.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov_zh.md
deleted file mode 100644
index 2c7bdafb..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/tmov_zh.md
+++ /dev/null
@@ -1,208 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/tmov_zh.md` -->
-
-# TMOV
-
-## 指令示意图
-
-![TMOV tile operation](../figures/isa/TMOV.svg)
-
-## 简介
-
-在 Tile 之间移动/复制，可选通过模板参数和重载选择实现定义的转换模式。
-
-`TMOV` 用于：
-
-- Vec -> Vec 移动
-- Mat -> Left/Right/Bias/Scaling/Scale（微缩放）移动（取决于目标）
-- Acc -> Mat/Vec 移动（取决于目标）
-
-## 数学语义
-
-概念上在有效区域上将元素从 `src` 复制或转换到 `dst`。确切的转换取决于所选模式和目标。
-
-对于纯复制情况：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-PTO AS 设计建议将 `TMOV` 拆分为一系列操作：
-
-```text
-%left  = tmov.m2l %mat  : !pto.tile<...> -> !pto.tile<...>
-%right = tmov.m2r %mat  : !pto.tile<...> -> !pto.tile<...>
-%bias  = tmov.m2b %mat  : !pto.tile<...> -> !pto.tile<...>
-%scale = tmov.m2s %mat  : !pto.tile<...> -> !pto.tile<...>
-%vec   = tmov.a2v %acc  : !pto.tile<...> -> !pto.tile<...>
-%v1    = tmov.v2v %v0   : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp` 和 `include/pto/common/constants.hpp`：
-
-```cpp
-template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, AccToVecMode mode, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, typename FpTileData, AccToVecMode mode,
-          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, FpTileData &fp, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);
-
-template <typename DstTileData, typename SrcTileData, AccToVecMode mode, ReluPreMode reluMode = ReluPreMode::NoRelu,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `TMOV` 包含以下重载族：
-  - 普通移动：`TMOV(dst, src)`
-  - relu 形式：`TMOV<..., reluMode>(dst, src)`
-  - 累加器到向量形式：`TMOV<..., mode, reluMode>(dst, src)`
-  - 向量量化形式：`TMOV<..., FpTileData, mode, reluMode>(dst, src, fp)`
-  - 标量量化形式：`TMOV<..., reluMode>(dst, src, preQuantScalar)` 和 `TMOV<..., mode, reluMode>(dst, src, preQuantScalar)`
-- `reluMode` 取值为 `ReluPreMode::{NoRelu, NormalRelu}`。
-- `mode` 取值为 `AccToVecMode::{SingleModeVec0, SingleModeVec1, DualModeSplitM, DualModeSplitN}`。
-
-### A2A3 实现检查
-
-- 形状必须匹配：`SrcTileData::Rows == DstTileData::Rows` 且 `SrcTileData::Cols == DstTileData::Cols`。
-- 支持的 Tile 类型对在编译期限制为：
-  - `TileType::Mat -> TileType::Left/Right/Bias/Scaling`
-  - `TileType::Vec -> TileType::Vec`
-  - `TileType::Acc -> TileType::Mat`
-- 对于 `TileType::Mat -> TileType::Bias`：
-  - 支持的源/目标 dtype 对为 `int32_t -> int32_t`、`float -> float`、`half -> float`
-  - 源行数必须为 `1`
-  - `SrcTileData::Cols * sizeof(SrcType)` 必须按 `64` 字节对齐
-- 对于 `TileType::Mat -> TileType::Scaling`：
-  - 目标 dtype 必须与源 dtype 相同，且必须为 `uint64_t`
-  - 源行数必须为 `1`
-  - `SrcTileData::Cols * sizeof(SrcType)` 必须按 `128` 字节对齐
-- 对于 `TileType::Acc -> TileType::Mat`：
-  - 额外执行 `CheckTMovAccToMat<...>` 编译期检查
-  - 普通/relu 形式使用 `GetCastPreQuantMode<SrcDType, DstDType>()` 推导的 cast pre-quant 模式
-  - 标量量化形式使用 `GetScalarPreQuantMode<SrcDType, DstDType>()`
-  - 向量量化形式要求提供 `FpTileData` 操作数，且 `FpTileData::Loc == TileType::Scaling`，并使用 `GetVectorPreQuantMode<SrcDType, DstDType>()`
-
-### A5 实现检查
-
-- `CommonCheck()` 要求：
-  - 目标/源 dtype 必须相同
-  - 支持的元素类型为 `int8_t`、`hifloat8_t`、`float8_e5m2_t`、`float8_e4m3_t`、`half`、`bfloat16_t`、`float`、`float4_e2m1x2_t`、`float4_e1m2x2_t`
-  - 源布局必须满足以下之一：
-    - `(SrcTileData::SFractal == SLayout::ColMajor && SrcTileData::isRowMajor)`
-    - `(SrcTileData::SFractal == SLayout::RowMajor && !SrcTileData::isRowMajor)`
-    - `SrcTileData::isRowMajor`
-- `CommonCheckMX()` 用于 MX 路径时要求源/目标 dtype 相同，并支持 `float8_e8m0_t`。
-- 支持的路径包括：
-  - `TileType::Mat -> TileType::Left/Right/Bias/Scaling/ScaleLeft/ScaleRight`
-  - `TileType::Vec -> TileType::Vec/TileType::Mat`
-  - `TileType::Acc -> TileType::Vec/TileType::Mat`
-  - A5 实现中处理的特定 `ND -> ZZ` 及相关内部路径变体
-- 对于 `TileType::Mat -> TileType::Bias`：
-  - 支持的 dtype 对为 `int32_t -> int32_t`、`float -> float`、`half -> float`、`bfloat16_t -> float`
-  - 源行数必须为 `1`
-  - `DstTileData::Cols * sizeof(DstType)` 必须按 `64` 字节对齐
-  - bias table 占用 `DstTileData::Cols * sizeof(DstType)` 不得超过 `4096` 字节
-- 对于 `TileType::Mat -> TileType::Scaling`：
-  - 源行数必须为 `1`
-  - `DstTileData::Cols * sizeof(DstType)` 必须按 `128` 字节对齐
-  - fixpipe buffer 占用 `DstTileData::Cols * sizeof(DstType)` 不得超过 `4096` 字节
-- 对于 `TileType::Acc -> TileType::Vec`：
-  - `mode` 用于选择 `SingleModeVec0`、`SingleModeVec1`、`DualModeSplitM` 或 `DualModeSplitN`
-  - 双目标模式要求 `QuantMode_t::NoQuant`
-  - 双目标模式不支持 `nz2dn` 路径
-  - 目标 stride 必须非零，且 `dstStride * sizeof(dstType)` 必须是 `32` 字节的整数倍
-- 对于 `TileType::Acc -> TileType::Mat`：
-  - 目标 stride 必须非零，且 `dstStride * sizeof(dstType)` 必须是 `32` 字节的整数倍
-  - 支持通过对应重载启用 relu/标量量化/向量量化形式
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TMOV(dst, src);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
-  using DstT = TileLeft<float, 16, 16>;
-  SrcT mat;
-  DstT left;
-  TASSIGN(mat, 0x1000);
-  TASSIGN(left, 0x2000);
-  TMOV(left, mat);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/treshape.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/treshape.md
deleted file mode 100644
index 7b8f5193..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/treshape.md
+++ /dev/null
@@ -1,142 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/treshape.md` -->
-
-# pto.treshape
-
-Standalone reference page for `pto.treshape`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Reinterpret a tile as another tile type/shape while preserving the underlying bytes.
-
-## Mechanism
-
-Reinterpret a tile as another tile type/shape while preserving the underlying bytes.
-
-This is a *bitwise* reshape: it does not change values, it only changes how the same byte buffer is viewed. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-```text
-%dst = treshape %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.treshape ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.treshape ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
-PTO_INST RecordEvent TRESHAPE(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile. Must have same total byte size as `src`.
-
-## Expected Outputs
-
-`dst` holds the same byte data as `src`, reinterpreted with different tile type/shape.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-Enforced by `TRESHAPE_IMPL`:
-
-- **Tile type must match**: `TileDataIn::Loc == TileDataOut::Loc`.
-
-- **Total byte size must match**: `sizeof(InElem) * InNumel == sizeof(OutElem) * OutNumel`.
-
-- **No boxed/non-boxed conversion**:
-    - cannot reshape between `SLayout::NoneBox` and boxed layouts.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.treshape` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using Src = Tile<TileType::Vec, float, 16, 16>;
-  using Dst = Tile<TileType::Vec, float, 8, 32>;
-  static_assert(Src::Numel == Dst::Numel);
-
-  Src src;
-  Dst dst;
-  TRESHAPE(dst, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.treshape ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Previous op in family: [pto.tmov_fp](./tmov-fp.md)
-- Next op in family: [pto.ttrans](./ttrans.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/treshape_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/treshape_zh.md
deleted file mode 100644
index 412474d5..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/treshape_zh.md
+++ /dev/null
@@ -1,83 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/treshape_zh.md` -->
-
-# TRESHAPE
-
-## 指令示意图
-
-![TRESHAPE tile operation](../figures/isa/TRESHAPE.svg)
-
-## 简介
-
-将 Tile 重新解释为另一种 Tile 类型/形状，同时保留底层字节。
-
-## 数学语义
-
-除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-```text
-%dst = treshape %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.treshape ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.treshape ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
-PTO_INST RecordEvent TRESHAPE(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
-```
-
-## 约束
-
-Enforced by `TRESHAPE_IMPL`:
-
-- **Tile type must match**: `TileDataIn::Loc == TileDataOut::Loc`.
-- **Total byte size must match**: `sizeof(InElem) * InNumel == sizeof(OutElem) * OutNumel`.
-- **No boxed/non-boxed conversion**:
-    - cannot reshape between `SLayout::NoneBox` and boxed layouts.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using Src = Tile<TileType::Vec, float, 16, 16>;
-  using Dst = Tile<TileType::Vec, float, 8, 32>;
-  static_assert(Src::Numel == Dst::Numel);
-
-  Src src;
-  Dst dst;
-  TRESHAPE(dst, src);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/ttrans.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/ttrans.md
deleted file mode 100644
index 992eaf1d..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/ttrans.md
+++ /dev/null
@@ -1,172 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/ttrans.md` -->
-
-# pto.ttrans
-
-Standalone reference page for `pto.ttrans`. This page belongs to the [Layout And Rearrangement](../../layout-and-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Transpose with an implementation-defined temporary tile.
-
-## Mechanism
-
-Transpose with an implementation-defined temporary tile. It belongs to the tile surface and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
-
-For a 2D tile, over the effective transpose domain:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{j,i} $$
-
-Exact shape/layout and the transpose domain depend on the target (see Constraints).
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
-```
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TTRANS(TileDataDst &dst, TileDataSrc &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `tmp` is a temporary tile used during transpose (may not be used by all implementations).
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the transposed version of `src`: `dst[i,j]` = `src[j,i]`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Temporary tile**:
-    - The C++ API requires `tmp`, but some implementations may not use it.
-
-- **ConvTile**:
-    - Transpose of ConvTile for `TileType::Vec` is supported。 Element size must be `1`、`2` or `4` bytes. Supported element types are `uint32_t`、`int32_t`、`float`、`uint16_t`、`int16_t`、`half`、`bfloat16_t`、`uint8_t`、`int8_t`.
-    - Format transformation from `NCHW` to `NC1HWC0` is supported, while `C1 == (C + C0 - 1)/C0`，HW matches alignment constraint，which means `H*W*sizeof(T)==0`. C0 means `c0_size`, which `C0 * sizeof(T) == 32`。C0 can also be 4.
-    - Format transformation from `NC1HWC0` to `FRACTAL_Z` is supported， while `N1 == (N + N0 - 1)/N0`。N0 should be 16.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`.
-    - Source layout must be row-major (`TileDataSrc::isRowMajor`).
-    - Element size must be `1`, `2`, or `4` bytes.
-    - Supported element types are restricted per element width:
-    - 4 bytes: `uint32_t`, `int32_t`, `float`
-    - 2 bytes: `uint16_t`, `int16_t`, `half`, `bfloat16_t`
-    - 1 byte: `uint8_t`, `int8_t`
-    - The transpose size is taken from `src.GetValidRow()` / `src.GetValidCol()`.
-
-- **Implementation checks (A5)**:
-    - `sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`.
-    - 32-byte alignment constraints are enforced on the major dimension of both input and output (row-major checks `Cols * sizeof(T) % 32 == 0`, col-major checks `Rows * sizeof(T) % 32 == 0`).
-    - Supported element types are restricted per element width:
-    - 4 bytes: `uint32_t`, `int32_t`, `float`
-    - 2 bytes: `uint16_t`, `int16_t`, `half`, `bfloat16_t`
-    - 1 byte: `uint8_t`, `int8_t`
-    - The implementation operates over the static tile shape (`TileDataSrc::Rows/Cols`) and does not consult `GetValidRow/GetValidCol`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TTRANS(dst, src, tmp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TTRANS(dst, src, tmp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Layout And Rearrangement](../../layout-and-rearrangement.md)
-- Previous op in family: [pto.treshape](./treshape.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md
deleted file mode 100644
index a32fa473..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md
+++ /dev/null
@@ -1,145 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md` -->
-
-# TTRANS
-
-## 指令示意图
-
-![TTRANS tile operation](../figures/isa/TTRANS.svg)
-
-## 简介
-
-使用实现定义的临时 Tile 进行转置。
-
-## 数学语义
-
-对于二维 Tile，在有效转置域上：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{j,i} $$
-
-确切的形状/布局及转置域取决于目标硬件（参见约束）。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
-```
-降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TTRANS(TileDataDst &dst, TileDataSrc &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`。
-    - 源布局必须是行主序（`TileDataSrc::isRowMajor`）。
-    - 元素大小必须是 `1`、`2` 或 `4` 字节。
-    - 支持的元素类型按元素宽度限制如下：
-    - 4 字节：`uint32_t`、`int32_t`、`float`
-    - 2 字节：`uint16_t`、`int16_t`、`half`、`bfloat16_t`
-    - 1 字节：`uint8_t`、`int8_t`
-    - 转置大小取自 `src.GetValidRow()` / `src.GetValidCol()`。
-- **实现检查 (A5)**:
-    - `sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`。
-    - 对输入和输出的主维度强制执行 32 字节对齐约束（行主序检查 `Cols * sizeof(T) % 32 == 0`，列主序检查 `Rows * sizeof(T) % 32 == 0`）。
-    - 支持的元素类型按元素宽度限制如下：
-    - 4 字节：`uint32_t`、`int32_t`、`float`
-    - 2 字节：`uint16_t`、`int16_t`、`half`、`bfloat16_t`
-    - 1 字节：`uint8_t`、`int8_t`
-    - 实现在静态 Tile 形状（`TileDataSrc::Rows/Cols`）上运算，不参考 `GetValidRow/GetValidCol`。
-- **临时 Tile**:
-    - C++ API 需要 `tmp`，但某些实现可能不使用它。
-- **ConvTile**:
-    - 支持在`TileType::Vec`上的ConvTile的格式转换。其元素大小必须是 `1`、`2` 或 `4` 字节。元素类型限制为`uint32_t`、`int32_t`、`float`、`uint16_t`、`int16_t`、`half`、`bfloat16_t`、`uint8_t`、`int8_t`。
-    - 支持ConvTile从`NCHW`到`NC1HWC0`的变换，其中`C1 == (C + C0 - 1)/C0`，HW满足对齐要求，即`H*W*sizeof(T)==0`. C0对应`c0_size`, 即`C0 * sizeof(T) == 32`。C0也可以为4。
-    - 支持ConvTile从`NC1HWC0`到`FRACTAL_Z`的变换, 其中`N1 == (N + N0 - 1)/N0`。N0为16。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TTRANS(dst, src, tmp);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TTRANS(dst, src, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc.md
deleted file mode 100644
index 66ab7b01..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc.md
+++ /dev/null
@@ -1,191 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc.md` -->
-
-# pto.tgemv_acc
-
-Standalone reference page for `pto.tgemv_acc`. This page belongs to the [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md) family in the PTO ISA manual.
-
-## Summary
-
-GEMV with explicit accumulator input/output tiles.
-
-## Mechanism
-
-Tile-based GEMV with explicit accumulator input tile (`cInMatrix`) and output tile (`cOutMatrix`). It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let:
-
-- `M = 1`
-- `K = bMatrix.GetValidRow()`
-- `N = bMatrix.GetValidCol()`
-
-For `0 <= j < N` (accumulates into the existing output tile):
-
-$$ \mathrm{C}_{0,j} \gets \mathrm{C}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
-
-**Note:** Exact accumulator behavior and datatype promotion are target/implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-```
-
-## Inputs
-
-- `cIn` is the input accumulator tile.
-- `a` is the left operand tile (must be TileLeft location).
-- `b` is the right operand tile (must be TileRight location).
-- `dst` names the output accumulator tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the accumulated matrix-vector product: `dst[0,j]` = `cIn[0,j]` + sum over `k` of `a[0,k] * b[k,j]`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### Common shape and location constraints
-
-- Static shape constraints:
-    - `TileLeft::Rows == TileRes::Rows`
-    - `TileLeft::Cols == TileRight::Rows`
-    - `TileRight::Cols == TileRes::Cols`
-
-- Tile locations:
-    - `TileLeft::Loc == Left`
-    - `TileRight::Loc == Right`
-    - `TileRes::Loc == Acc`
-
-- Runtime valid-size constraints:
-    - `m` must be `1`
-    - `k` and `n` (taken from `bMatrix.GetValidRow()` and `bMatrix.GetValidCol()`) must be in `[1, 4095]`
-
-### Datatype constraints
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Supported `(CType, AType, BType)` triples:
-        - `(int32_t, int8_t, int8_t)`
-        - `(float, half, half)`
-        - `(float, float, float)`
-        - `(float, bfloat16_t, bfloat16_t)`
-
-- **Implementation checks (A5)**:
-    - Accumulator type must be `int32_t` or `float`.
-    - If `int32_t`: `AType == int8_t` and `BType == int8_t`.
-    - If `float`: supports `half`, `bfloat16_t`, `float`, and selected fp8 pairs (target-defined).
-    - Fractal/layout constraints are enforced:
-        - Left: `Loc == Left`, `!isRowMajor`, `SFractal == RowMajor`
-        - Right: `Loc == Right`, `isRowMajor`, `SFractal == ColMajor`
-        - Acc: `Loc == Acc`, `!isRowMajor`, `SFractal == RowMajor`
-    - No separate explicit `m/k/n` runtime assertions are enforced in the underlying A5 matmul implementation beyond the GEMV contract described above.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TGEMV_ACC(c1, c0, a, b);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c0, 0x3000);
-  TASSIGN(c1, 0x4000);
-  TGEMV_ACC(c1, c0, a, b);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md)
-- Previous op in family: [pto.tgemv](./tgemv.md)
-- Next op in family: [pto.tgemv_bias](./tgemv-bias.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc_zh.md
deleted file mode 100644
index 8d3cccb7..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc_zh.md
+++ /dev/null
@@ -1,167 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-acc_zh.md` -->
-
-# TGEMV_ACC
-
-## 指令示意图
-
-![TGEMV_ACC tile operation](../figures/isa/TGEMV_ACC.svg)
-
-## 简介
-
-带显式累加器输入 Tile（`cInMatrix`）和输出 Tile（`cOutMatrix`）的 GEMV。
-
-## 另请参见
-
-- 基础 GEMV 指令：`docs/isa/TGEMV.md`。
-- 偏置变体：`docs/isa/TGEMV_BIAS.md`。
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-```
-
-## 数学语义
-
-设：
-
-- `M = 1`
-- `K = bMatrix.GetValidRow()`
-- `N = bMatrix.GetValidCol()`
-
-对于 `0 <= j < N`（累加到已有输出 Tile）：
-
-$$ \mathrm{C}_{0,j} \gets \mathrm{C}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
-
-**注意：** 精确的累加器行为和数据类型提升由目标/实现定义。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-```
-
-## 约束
-
-### 通用形状与位置约束
-
-- 静态形状约束：
-    - `TileLeft::Rows == TileRes::Rows`
-    - `TileLeft::Cols == TileRight::Rows`
-    - `TileRight::Cols == TileRes::Cols`
-- Tile 位置约束：
-    - `TileLeft::Loc == Left`
-    - `TileRight::Loc == Right`
-    - `TileRes::Loc == Acc`
-- 运行时有效尺寸约束：
-    - `m` 必须为 `1`
-    - `k` 和 `n`（取自 `bMatrix.GetValidRow()` 与 `bMatrix.GetValidCol()`）必须位于 `[1, 4095]`
-
-### 数据类型约束
-
-- **实现检查 (A2A3)**:
-    - 支持的 `(CType, AType, BType)` 三元组：
-        - `(int32_t, int8_t, int8_t)`
-        - `(float, half, half)`
-        - `(float, float, float)`
-        - `(float, bfloat16_t, bfloat16_t)`
-- **实现检查 (A5)**:
-    - 累加器类型必须是 `int32_t` 或 `float`。
-    - 如果为 `int32_t`：`AType == int8_t` 且 `BType == int8_t`。
-    - 如果为 `float`：支持 `half`、`bfloat16_t`、`float` 以及选定的 fp8 组合（目标定义）。
-    - 会强制执行以下分形/布局约束：
-        - Left：`Loc == Left`、`!isRowMajor`、`SFractal == RowMajor`
-        - Right：`Loc == Right`、`isRowMajor`、`SFractal == ColMajor`
-        - Acc：`Loc == Acc`、`!isRowMajor`、`SFractal == RowMajor`
-    - 除上述 GEMV 约定外，底层 A5 matmul 实现不会再单独补充一组显式的 `m/k/n` 运行时断言。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TGEMV_ACC(c1, c0, a, b);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c0, 0x3000);
-  TASSIGN(c1, 0x4000);
-  TGEMV_ACC(c1, c0, a, b);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias.md
deleted file mode 100644
index 4d9c0a9b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias.md
+++ /dev/null
@@ -1,205 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias.md` -->
-
-# pto.tgemv_bias
-
-Standalone reference page for `pto.tgemv_bias`. This page belongs to the [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md) family in the PTO ISA manual.
-
-## Summary
-
-GEMV with bias add.
-
-## Mechanism
-
-Tile-based GEMV with bias add. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let:
-
-- `M = 1`
-- `K = bMatrix.GetValidRow()`
-- `N = bMatrix.GetValidCol()`
-
-For `0 <= j < N` (adds a bias term to the matrix product):
-
-$$ \mathrm{C}_{0,j} = \mathrm{Bias}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
-
-**Note:** Exact accumulator behavior and datatype promotion are target/implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename TileBias,
-          typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
-```
-
-## Inputs
-
-- `a` is the left operand tile (must be TileLeft location).
-- `b` is the right operand tile (must be TileRight location).
-- `bias` is the bias tile (must be TileType::Bias, single row).
-- `dst` names the destination accumulator tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the biased matrix-vector product: `dst[0,j]` = `bias[0,j]` + sum over `k` of `a[0,k] * b[k,j]`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### Common shape and location constraints
-
-- Static shape constraints:
-    - `TileLeft::Rows == TileRes::Rows`
-    - `TileLeft::Cols == TileRight::Rows`
-    - `TileRight::Cols == TileRes::Cols`
-
-- Tile locations:
-    - `TileLeft::Loc == Left`
-    - `TileRight::Loc == Right`
-    - `TileRes::Loc == Acc`
-
-- Runtime valid-size constraints:
-    - `m` must be `1`
-    - `k` and `n` (taken from `bMatrix.GetValidRow()` and `bMatrix.GetValidCol()`) must be in `[1, 4095]`
-
-### Datatype constraints
-
-- Bias tile datatype must exactly match `TileRes::DType`.
-
-- Bias tile must be configured as a single row.
-
-- Bias tile location must be `TileType::Bias`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Supported `(CType, AType, BType)` triples:
-        - `(int32_t, int8_t, int8_t)`
-        - `(float, half, half)`
-        - `(float, float, float)`
-        - `(float, bfloat16_t, bfloat16_t)`
-
-- **Implementation checks (A5)**:
-    - Accumulator type must be `int32_t` or `float`.
-    - If `int32_t`: `AType == int8_t` and `BType == int8_t`.
-    - If `float`: supports `half`, `bfloat16_t`, `float`, and selected fp8 pairs (target-defined).
-    - Fractal/layout constraints are enforced:
-        - Left: `Loc == Left`, `!isRowMajor`, `SFractal == RowMajor`
-        - Right: `Loc == Right`, `isRowMajor`, `SFractal == ColMajor`
-        - Acc: `Loc == Acc`, `!isRowMajor`, `SFractal == RowMajor`
-
-### Bias-specific constraints
-
-- **Additional A5 note**:
-    - No separate explicit `m/k/n` runtime assertions are enforced in the underlying A5 matmul implementation beyond the GEMV contract described above.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TGEMV_BIAS(c, a, b, bias);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(bias, 0x3000);
-  TASSIGN(c, 0x4000);
-  TGEMV_BIAS(c, a, b, bias);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md)
-- Previous op in family: [pto.tgemv_acc](./tgemv-acc.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias_zh.md
deleted file mode 100644
index a17b6265..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias_zh.md
+++ /dev/null
@@ -1,179 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-bias_zh.md` -->
-
-# TGEMV_BIAS
-
-## 指令示意图
-
-![TGEMV_BIAS tile operation](../figures/isa/TGEMV_BIAS.svg)
-
-## 简介
-
-带偏置加法的 GEMV。
-
-## 另请参见
-
-- 基础 GEMV 指令：`docs/isa/TGEMV.md`。
-- 累加变体：`docs/isa/TGEMV_ACC.md`。
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename TileBias,
-          typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
-```
-
-## 数学语义
-
-设：
-
-- `M = 1`
-- `K = bMatrix.GetValidRow()`
-- `N = bMatrix.GetValidCol()`
-
-对于 `0 <= j < N`（将偏置项加入矩阵乘积）：
-
-$$ \mathrm{C}_{0,j} = \mathrm{Bias}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
-
-**注意：** 精确的累加器行为和数据类型提升由目标/实现定义。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## 约束
-
-### 通用形状与位置约束
-
-- 静态形状约束：
-    - `TileLeft::Rows == TileRes::Rows`
-    - `TileLeft::Cols == TileRight::Rows`
-    - `TileRight::Cols == TileRes::Cols`
-- Tile 位置约束：
-    - `TileLeft::Loc == Left`
-    - `TileRight::Loc == Right`
-    - `TileRes::Loc == Acc`
-- 运行时有效尺寸约束：
-    - `m` 必须为 `1`
-    - `k` 和 `n`（取自 `bMatrix.GetValidRow()` 与 `bMatrix.GetValidCol()`）必须位于 `[1, 4095]`
-
-### 数据类型约束
-
-- **实现检查 (A2A3)**:
-    - 支持的 `(CType, AType, BType)` 三元组：
-        - `(int32_t, int8_t, int8_t)`
-        - `(float, half, half)`
-        - `(float, float, float)`
-        - `(float, bfloat16_t, bfloat16_t)`
-- **实现检查 (A5)**:
-    - 累加器类型必须是 `int32_t` 或 `float`。
-    - 如果为 `int32_t`：`AType == int8_t` 且 `BType == int8_t`。
-    - 如果为 `float`：支持 `half`、`bfloat16_t`、`float` 以及选定的 fp8 组合（目标定义）。
-    - 会强制执行以下分形/布局约束：
-        - Left：`Loc == Left`、`!isRowMajor`、`SFractal == RowMajor`
-        - Right：`Loc == Right`、`isRowMajor`、`SFractal == ColMajor`
-        - Acc：`Loc == Acc`、`!isRowMajor`、`SFractal == RowMajor`
-
-### 偏置专属约束
-
-- 偏置 tile 的数据类型必须与 `TileRes::DType` 完全一致。
-- 偏置 tile 必须配置为单行。
-- 偏置 tile 的位置必须为 `TileType::Bias`。
-- **A5 附加说明**：
-    - 除上述 GEMV 约定外，底层 A5 matmul 实现不会再单独补充一组显式的 `m/k/n` 运行时断言。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TGEMV_BIAS(c, a, b, bias);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(bias, 0x3000);
-  TASSIGN(c, 0x4000);
-  TGEMV_BIAS(c, a, b, bias);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx.md
deleted file mode 100644
index 51afe987..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx.md
+++ /dev/null
@@ -1,163 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx.md` -->
-
-# pto.tgemv_mx
-
-Standalone reference page for `pto.tgemv_mx`. This page belongs to the [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md) family in the PTO ISA manual.
-
-## Summary
-
-GEMV with additional scaling tiles for mixed-precision / quantized matrix-vector compute.
-
-## Mechanism
-
-GEMV with scaling tiles for mixed-precision / quantized matrix-vector compute on supported targets.
-
-This instruction family extends `TGEMV` with additional scale operands (mx path). Accumulator and scale handling are target-dependent. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Conceptually (base GEMV path):
-
-$$
-\mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j}
-$$
-
-For `TGEMV_MX`, scale tiles participate in implementation-defined mixed-precision reconstruction / scaling. The architectural contract is that output corresponds to the target-defined mx GEMV semantics.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Schematic form:
-
-```text
-%acc = tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
-```
-
-Additional overloads support accumulation/bias variants and `AccPhase` selection.
-
-## Inputs
-
-- `a` is the left operand tile (must be TileLeft location).
-- `aScale` is the left scaling tile for mixed-precision reconstruction.
-- `b` is the right operand tile (must be TileRight location).
-- `bScale` is the right scaling tile for mixed-precision reconstruction.
-- `bias` (optional): bias tile (must be TileType::Bias).
-- `cIn` (optional): input accumulator tile for accumulation variants.
-- `dst` names the destination accumulator tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the mx matrix-vector product with mixed-precision scaling applied.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- Uses backend-specific mx legality checks for data types, tile locations, fractal/layout combinations, and scaling formats.
-
-- Scale tile compatibility and accumulator promotion are implementation-defined by target backend.
-
-- For portability, validate the exact `(A, B, scaleA, scaleB, C)` type tuple and tile layout against target implementation constraints.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tgemv_mx` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-For practical usage patterns, see:
-
-- [pto.tmatmul_mx](./tmatmul-mx.md)
-- [pto.tgemv](./tgemv.md)
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md)
-- Next op in family: [pto.tmatmul_mx](./tmatmul-mx.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx_zh.md
deleted file mode 100644
index eee9630f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx_zh.md
+++ /dev/null
@@ -1,100 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tgemv-mx_zh.md` -->
-
-# TGEMV_MX
-
-## 指令示意图
-
-![TGEMV_MX tile operation](../figures/isa/TGEMV_MX.svg)
-
-## 简介
-
-带缩放 Tile 的 GEMV 变体，支持混合精度/量化矩阵向量计算。
-
-## 数学语义
-
-Conceptually (base GEMV path):
-
-$$
-\mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j}
-$$
-
-For `TGEMV_MX`, scale tiles participate in implementation-defined mixed-precision reconstruction / scaling. The architectural contract is that output corresponds to the target-defined mx GEMV semantics.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-Schematic form:
-
-```text
-%acc = tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
-```
-
-Additional overloads support accumulation/bias variants and `AccPhase` selection.
-
-## 约束
-
-- Uses backend-specific mx legality checks for data types, tile locations, fractal/layout combinations, and scaling formats.
-- Scale tile compatibility and accumulator promotion are implementation-defined by target backend.
-- For portability, validate the exact `(A, B, scaleA, scaleB, C)` type tuple and tile layout against target implementation constraints.
-
-## 示例
-
-For practical usage patterns, see:
-
-- `docs/isa/TMATMUL_MX.md`
-- `docs/isa/TGEMV.md`
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv.md
deleted file mode 100644
index f952c8be..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv.md
+++ /dev/null
@@ -1,314 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tgemv.md` -->
-
-# pto.tgemv
-
-Standalone reference page for `pto.tgemv`. This page belongs to the [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md) family in the PTO ISA manual.
-
-## Summary
-
-General Matrix-Vector multiplication producing an accumulator/output tile.
-
-## Mechanism
-
-General Matrix-Vector multiplication (GEMV) producing an accumulator/output tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let:
-
-- `M = 1`
-- `K = bMatrix.GetValidRow()`
-- `N = bMatrix.GetValidCol()`
-
-### 1. TGEMV (Tile-based GEMV)
-
-For `0 <= j < N` (output elements in the effective matmul domain):
-
-$$ \mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
-
-### 2. TGEMV_ACC (Tile-based GEMV with Accumulation)
-
-For `0 <= j < N` (accumulates into existing tile):
-
-$$ \mathrm{C}_{0,j} \gets \mathrm{C}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
-
-### 3. TGEMV_BIAS (Tile-based GEMV with Bias)
-
-For `0 <= j < N` (adds bias term to matrix product):
-
-$$ \mathrm{C}_{0,j} = \mathrm{Bias}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
-
-**Note:** Exact accumulator behavior and datatype promotion are target/implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%acc = tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-
-%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-
-%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tgemv ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents&... events);
-
-template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents&... events);
-
-template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents&... events);
-```
-
-## Inputs
-
-- `a` is the left operand tile (must be TileLeft location).
-- `b` is the right operand tile (must be TileRight location).
-- `dst` names the destination accumulator tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the matrix-vector product: `dst[0,j]` = sum over `k` of `a[0,k] * b[k,j]`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### Common shape and location constraints
-
-These constraints apply to `TGEMV`, `TGEMV_ACC`, and `TGEMV_BIAS` unless otherwise noted.
-
-- Static shape constraints:
-    - `TileLeft::Rows == TileRes::Rows`
-    - `TileLeft::Cols == TileRight::Rows`
-    - `TileRight::Cols == TileRes::Cols`
-
-- Tile locations:
-    - `TileLeft::Loc == Left`
-    - `TileRight::Loc == Right`
-    - `TileRes::Loc == Acc`
-
-- Runtime valid-size constraints:
-    - `m` must be `1`
-    - `k` and `n` (taken from `bMatrix.GetValidRow()` and `bMatrix.GetValidCol()`) must be in `[1, 4095]`
-
-### TGEMV / TGEMV_ACC datatype constraints
-
-- Bias tile datatype must exactly match `TileRes::DType`.
-
-- Bias tile must be configured as a single row.
-
-- Bias tile location must be `TileType::Bias`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Supported `(CType, AType, BType)` triples:
-        - `(int32_t, int8_t, int8_t)`
-        - `(float, half, half)`
-        - `(float, float, float)`
-        - `(float, bfloat16_t, bfloat16_t)`
-
-- **Implementation checks (A5)**:
-    - Accumulator type must be `int32_t` or `float`.
-    - If `int32_t`: `AType == int8_t` and `BType == int8_t`.
-    - If `float`: supports `half`, `bfloat16_t`, `float`, and selected fp8 pairs (target-defined).
-    - Fractal/layout constraints are enforced:
-        - Left: `Loc == Left`, `!isRowMajor`, `SFractal == RowMajor`
-        - Right: `Loc == Right`, `isRowMajor`, `SFractal == ColMajor`
-        - Acc: `Loc == Acc`, `!isRowMajor`, `SFractal == RowMajor`
-
-### TGEMV_BIAS additional constraints
-
-- **Additional A5 note**:
-    - No separate explicit `m/k/n` runtime assertions are enforced in the underlying A5 matmul implementation beyond the GEMV contract described above.
-
-## Examples
-
-### Auto
-
-#### 1. TGEMV
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c;
-  TGEMV(c, a, b);
-}
-```
-
-#### 2. TGEMV_ACC
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TGEMV_ACC(c1, c0, a, b);
-}
-```
-
-#### 3. TGEMV_BIAS
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TGEMV_BIAS(c, a, b, bias);
-}
-```
-
-### Manual
-
-#### 1. TGEMV
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c, 0x3000);
-  TGEMV(c, a, b);
-}
-```
-
-#### 2. TGEMV_ACC
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c0, 0x3000);
-  TASSIGN(c1, 0x4000);
-  TGEMV_ACC(c1, c0, a, b);
-}
-```
-
-#### 3. TGEMV_BIAS
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(bias, 0x3000);
-  TASSIGN(c, 0x4000);
-  TGEMV_BIAS(c, a, b, bias);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%acc = tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tgemv ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md)
-- Previous op in family: [pto.tmatmul_bias](./tmatmul-bias.md)
-- Next op in family: [pto.tgemv_acc](./tgemv-acc.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv_zh.md
deleted file mode 100644
index 404a8121..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tgemv_zh.md
+++ /dev/null
@@ -1,283 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tgemv_zh.md` -->
-
-# TGEMV
-
-## 指令示意图
-
-![TGEMV tile operation](../figures/isa/TGEMV.svg)
-
-## 简介
-
-通用矩阵-向量乘法，生成累加器/输出 Tile。
-
-## 数学语义
-
-设：
-
-- `M = 1`
-- `K = bMatrix.GetValidRow()`
-- `N = bMatrix.GetValidCol()`
-
-### 1. TGEMV（基于 Tile 的 GEMV）
-
-对于 `0 <= j < N`（有效矩阵乘法域中的输出元素）：
-
-$$ \mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
-
-### 2. TGEMV_ACC（带累加的基于 Tile 的 GEMV）
-
-对于 `0 <= j < N`（累加到现有 tile）：
-
-$$ \mathrm{C}_{0,j} \gets \mathrm{C}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
-
-### 3. TGEMV_BIAS（带偏置的基于 Tile 的 GEMV）
-
-对于 `0 <= j < N`（将偏置项添加到矩阵乘积）：
-
-$$ \mathrm{C}_{0,j} = \mathrm{Bias}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
-
-**注意：** 精确的累加器行为和数据类型提升由目标/实现定义。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%acc = tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-
-%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-
-%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tgemv ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents&... events);
-
-template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents&... events);
-
-template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents&... events);
-```
-
-## 约束
-
-### 通用形状与位置约束
-
-以下约束在未特别说明时同时适用于 `TGEMV`、`TGEMV_ACC` 和 `TGEMV_BIAS`。
-
-- 静态形状约束：
-    - `TileLeft::Rows == TileRes::Rows`
-    - `TileLeft::Cols == TileRight::Rows`
-    - `TileRight::Cols == TileRes::Cols`
-- Tile 位置约束：
-    - `TileLeft::Loc == Left`
-    - `TileRight::Loc == Right`
-    - `TileRes::Loc == Acc`
-- 运行时有效尺寸约束：
-    - `m` 必须为 `1`
-    - `k` 和 `n`（取自 `bMatrix.GetValidRow()` 与 `bMatrix.GetValidCol()`）必须位于 `[1, 4095]`
-
-### TGEMV / TGEMV_ACC 数据类型约束
-
-- **实现检查 (A2A3)**:
-    - 支持的 `(CType, AType, BType)` 三元组：
-        - `(int32_t, int8_t, int8_t)`
-        - `(float, half, half)`
-        - `(float, float, float)`
-        - `(float, bfloat16_t, bfloat16_t)`
-- **实现检查 (A5)**:
-    - 累加器类型必须是 `int32_t` 或 `float`。
-    - 如果为 `int32_t`：`AType == int8_t` 且 `BType == int8_t`。
-    - 如果为 `float`：支持 `half`、`bfloat16_t`、`float` 以及选定的 fp8 组合（目标定义）。
-    - 会强制执行以下分形/布局约束：
-        - Left：`Loc == Left`、`!isRowMajor`、`SFractal == RowMajor`
-        - Right：`Loc == Right`、`isRowMajor`、`SFractal == ColMajor`
-        - Acc：`Loc == Acc`、`!isRowMajor`、`SFractal == RowMajor`
-
-### TGEMV_BIAS 的附加约束
-
-- 偏置 tile 的数据类型必须与 `TileRes::DType` 完全一致。
-- 偏置 tile 必须配置为单行。
-- 偏置 tile 的位置必须为 `TileType::Bias`。
-- **A5 附加说明**：
-    - 除上述 GEMV 约定外，底层 A5 matmul 实现不会再单独补充一组显式的 `m/k/n` 运行时断言。
-
-## 示例
-
-### 自动（Auto）
-
-#### 1. TGEMV
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c;
-  TGEMV(c, a, b);
-}
-```
-
-#### 2. TGEMV_ACC
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TGEMV_ACC(c1, c0, a, b);
-}
-```
-
-#### 3. TGEMV_BIAS
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TGEMV_BIAS(c, a, b, bias);
-}
-```
-
-### 手动（Manual）
-
-#### 1. TGEMV
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c, 0x3000);
-  TGEMV(c, a, b);
-}
-```
-
-#### 2. TGEMV_ACC
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c0, 0x3000);
-  TASSIGN(c1, 0x4000);
-  TGEMV_ACC(c1, c0, a, b);
-}
-```
-
-#### 3. TGEMV_BIAS
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 1, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 1, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(bias, 0x3000);
-  TASSIGN(c, 0x4000);
-  TGEMV_BIAS(c, a, b, bias);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%acc = tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tgemv ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc.md
deleted file mode 100644
index b04fb8be..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc.md
+++ /dev/null
@@ -1,176 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc.md` -->
-
-# pto.tmatmul_acc
-
-Standalone reference page for `pto.tmatmul_acc`. This page belongs to the [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md) family in the PTO ISA manual.
-
-## Summary
-
-Matrix multiply with accumulator input (fused accumulate).
-
-## Mechanism
-
-Matrix multiply with accumulator input (fused accumulate). It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let:
-
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
-
-For `0 <= i < M` and `0 <= j < N`:
-
-$$ \mathrm{C1}_{i,j} = \mathrm{C0}_{i,j} + \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%acc1 = tmatmul.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-
-template <AccPhase Phase = AccPhase::Unspecified, typename TileRes, typename TileLeft, typename TileRight,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_ACC(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-```
-
-## Inputs
-
-- `cIn` is the input accumulator tile.
-- `a` is the left operand tile (must be TileLeft location).
-- `b` is the right operand tile (must be TileRight location).
-- `dst` names the output accumulator tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the accumulated matrix multiply result: `dst[i,j]` = `cIn[i,j]` + sum over `k` of `a[i,k] * b[k,j]`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- All constraints from `TMATMUL` apply to the `(cOutMatrix, aMatrix, bMatrix)` triple.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation notes (A2A3/A5)**:
-    - `TMATMUL_ACC_IMPL` uses `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, and `bMatrix.GetValidCol()` for `m/k/n`.
-    - `cInMatrix` is not validated by explicit assertions in the current implementations (target-defined behavior).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TMATMUL_ACC(c1, c0, a, b);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c0, 0x3000);
-  TASSIGN(c1, 0x4000);
-  TMATMUL_ACC(c1, c0, a, b);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%acc1 = tmatmul.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md)
-- Previous op in family: [pto.tmatmul](./tmatmul.md)
-- Next op in family: [pto.tmatmul_bias](./tmatmul-bias.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc_zh.md
deleted file mode 100644
index 83119feb..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc_zh.md
+++ /dev/null
@@ -1,122 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-acc_zh.md` -->
-
-# TMATMUL_ACC
-
-## 指令示意图
-
-![TMATMUL_ACC tile operation](../figures/isa/TMATMUL_ACC.svg)
-
-## 简介
-
-带累加器输入的矩阵乘法（融合累加）。
-
-## 数学语义
-
-Let:
-
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
-
-For `0 <= i < M` and `0 <= j < N`:
-
-$$ \mathrm{C1}_{i,j} = \mathrm{C0}_{i,j} + \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%acc1 = tmatmul.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-
-template <AccPhase Phase = AccPhase::Unspecified, typename TileRes, typename TileLeft, typename TileRight,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_ACC(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-```
-
-## 约束
-
-- All constraints from `TMATMUL` apply to the `(cOutMatrix, aMatrix, bMatrix)` triple.
-- **Implementation notes (A2A3/A5)**:
-    - `TMATMUL_ACC_IMPL` uses `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, and `bMatrix.GetValidCol()` for `m/k/n`.
-    - `cInMatrix` is not validated by explicit assertions in the current implementations (target-defined behavior).
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TMATMUL_ACC(c1, c0, a, b);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  C c0, c1;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c0, 0x3000);
-  TASSIGN(c1, 0x4000);
-  TMATMUL_ACC(c1, c0, a, b);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias.md
deleted file mode 100644
index e1739616..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias.md
+++ /dev/null
@@ -1,183 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias.md` -->
-
-# pto.tmatmul_bias
-
-Standalone reference page for `pto.tmatmul_bias`. This page belongs to the [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md) family in the PTO ISA manual.
-
-## Summary
-
-Matrix multiply with bias add.
-
-## Mechanism
-
-Matrix multiply with bias add. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let:
-
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
-
-For `0 <= i < M` and `0 <= j < N`:
-
-$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} + \mathrm{Bias}_{0,j} $$
-
-Bias broadcasting behavior is implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%acc = tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename TileBias,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
-```
-
-## Inputs
-
-- `a` is the left operand tile (must be TileLeft location).
-- `b` is the right operand tile (must be TileRight location).
-- `bias` is the bias tile (must be TileType::Bias, single row).
-- `dst` names the destination accumulator tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the biased matrix multiply result: `dst[i,j]` = `bias[0,j]` + sum over `k` of `a[i,k] * b[k,j]`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- All constraints from `TMATMUL` apply to the `(cMatrix, aMatrix, bMatrix)` triple.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Bias constraints (A2A3)**:
-    - `TileBias::DType` must match `TileRes::DType`.
-    - `TileBias::Loc == TileType::Bias` and `TileBias::Rows == 1`.
-
-- **Bias constraints (A5)**:
-    - `TileBias::DType` must match `TileRes::DType`.
-    - `TileBias::Loc == TileType::Bias`, `TileBias::Rows == 1`, and `TileBias::isRowMajor`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TMATMUL_BIAS(c, a, b, bias);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(bias, 0x3000);
-  TASSIGN(c, 0x4000);
-  TMATMUL_BIAS(c, a, b, bias);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%acc = tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md)
-- Previous op in family: [pto.tmatmul_acc](./tmatmul-acc.md)
-- Next op in family: [pto.tgemv](./tgemv.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias_zh.md
deleted file mode 100644
index 05b7924b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias_zh.md
+++ /dev/null
@@ -1,128 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-bias_zh.md` -->
-
-# TMATMUL_BIAS
-
-## 指令示意图
-
-![TMATMUL_BIAS tile operation](../figures/isa/TMATMUL_BIAS.svg)
-
-## 简介
-
-带偏置加法的矩阵乘法。
-
-## 数学语义
-
-Let:
-
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
-
-For `0 <= i < M` and `0 <= j < N`:
-
-$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} + \mathrm{Bias}_{0,j} $$
-
-Bias broadcasting behavior is implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%acc = tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename TileBias,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
-```
-
-## 约束
-
-- All constraints from `TMATMUL` apply to the `(cMatrix, aMatrix, bMatrix)` triple.
-- **Bias constraints (A2A3)**:
-    - `TileBias::DType` must match `TileRes::DType`.
-    - `TileBias::Loc == TileType::Bias` and `TileBias::Rows == 1`.
-- **Bias constraints (A5)**:
-    - `TileBias::DType` must match `TileRes::DType`.
-    - `TileBias::Loc == TileType::Bias`, `TileBias::Rows == 1`, and `TileBias::isRowMajor`.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TMATMUL_BIAS(c, a, b, bias);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using Bias = Tile<TileType::Bias, half, 1, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  Bias bias;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(bias, 0x3000);
-  TASSIGN(c, 0x4000);
-  TMATMUL_BIAS(c, a, b, bias);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx.md
deleted file mode 100644
index ab0ac7ac..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx.md
+++ /dev/null
@@ -1,238 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx.md` -->
-
-# pto.tmatmul_mx
-
-Standalone reference page for `pto.tmatmul_mx`. This page belongs to the [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md) family in the PTO ISA manual.
-
-## Summary
-
-Matrix multiply (GEMM) with additional scaling tiles for mixed-precision / quantized matmul on supported targets.
-
-## Mechanism
-
-Matrix multiply (GEMM) with additional scaling tiles for mixed-precision / quantized matmul on supported targets.
-
-This instruction is currently implemented on A5 (see `include/pto/npu/a5/TMatmul.hpp`). It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let:
-
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
-
-Conceptually, the result corresponds to a matrix multiply over the effective matmul domain (`0 <= i < M`, `0 <= j < N`), with the scaling tiles `aScaleMatrix` / `bScaleMatrix` configuring implementation-defined mixed-precision behavior:
-
-$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
-
-The exact role of `aScaleMatrix` / `bScaleMatrix` (and any dequant/quant semantics) is target-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous forms (conceptual):
-
-```text
-%c = tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-%c_out = tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-%c = tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
--> !pto.tile<...>
-%c_out = pto.tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>,
-!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
-%c = pto.tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>,
-!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
-outs(%c :  !pto.tile_buf<...>)
-pto.tmatmul.mx.acc ins(%c_in, %a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
-!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-pto.tmatmul.mx.bias ins(%a, %a_scale, %b, %b_scale, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
-!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
--> !pto.tile<...>
-%c_out = pto.tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>,
-!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
-%c = pto.tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>,
-!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
-outs(%c :  !pto.tile_buf<...>)
-pto.tmatmul.mx.acc ins(%c_in, %a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
-!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-pto.tmatmul.mx.bias ins(%a, %a_scale, %b, %b_scale, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
-!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
-```
-
-## Inputs
-
-- `a` is the left operand tile (must be TileLeft location).
-- `aScale` is the left scaling tile for mixed-precision reconstruction.
-- `b` is the right operand tile (must be TileRight location).
-- `bScale` is the right scaling tile for mixed-precision reconstruction.
-- `bias` (optional): bias tile (must be TileType::Bias).
-- `cIn` (optional): input accumulator tile for accumulation variants.
-- `dst` names the destination accumulator tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the mx matrix multiply result with mixed-precision scaling applied.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- Source and destination shapes, layouts, and element types MUST satisfy the legality rules documented by the family and target profile.
-
-- Programs must not assume implicit broadcasting, reshaping, or valid-region repair unless the operation documents it.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A5)**:
-    - `m/k/n` are taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`.
-    - Static legality checks are enforced via `CheckMadMxValid<...>()` (types, shapes, fractals, and scaling tile legality).
-
-- **Bias form**:
-    - `TileBias::DType` must be `float` and `TileBias::Loc == TileType::Bias` with `TileBias::Rows == 1` (A5 checks via `static_assert`).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<float8_e5m2_t, 16, 64>;
-  using B = TileRight<float8_e5m2_t, 64, 32>;
-  using ScaleA = TileLeftScale<float8_e8m0_t, 16, 2>;
-  using ScaleB = TileRightScale<float8_e8m0_t, 2, 32>;
-  using Bias = Tile<TileType::Bias, float, 1, 32>;
-  using C = TileAcc<float, 16, 32>;
-  A a;
-  B b;
-  ScaleA scaleA;
-  ScaleB scaleB;
-  Bias bias;
-  C c;
-  TMATMUL_MX(c, a, scaleA, b, scaleB, bias);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<float8_e5m2_t, 16, 64>;
-  using B = TileRight<float8_e5m2_t, 64, 32>;
-  using ScaleA = TileLeftScale<float8_e8m0_t, 16, 2>;
-  using ScaleB = TileRightScale<float8_e8m0_t, 2, 32>;
-  using Bias = Tile<TileType::Bias, float, 1, 32>;
-  using C = TileAcc<float, 16, 32>;
-  A a;
-  B b;
-  ScaleA scaleA;
-  ScaleB scaleB;
-  Bias bias;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(scaleA, GetScaleAddr(a.data()));
-  TASSIGN(scaleB, GetScaleAddr(b.data()));
-  TASSIGN(bias, 0x3000);
-  TASSIGN(c, 0x4000);
-  TMATMUL_MX(c, a, scaleA, b, scaleB, bias);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
-```
-
-### PTO Assembly Form
-
-```text
-%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
-# AS Level 2 (DPS)
-pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md)
-- Previous op in family: [pto.tgemv_mx](./tgemv-mx.md)
-- Next op in family: [pto.tmatmul](./tmatmul.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx_zh.md
deleted file mode 100644
index f610936b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx_zh.md
+++ /dev/null
@@ -1,175 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul-mx_zh.md` -->
-
-# TMATMUL_MX
-
-## 指令示意图
-
-![TMATMUL_MX tile operation](../figures/isa/TMATMUL_MX.svg)
-
-## 简介
-
-带额外缩放 Tile 的矩阵乘法 (GEMM)，用于支持目标上的混合精度/量化矩阵乘法。
-
-## 数学语义
-
-Let:
-
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
-
-Conceptually, the result corresponds to a matrix multiply over the effective matmul domain (`0 <= i < M`, `0 <= j < N`), with the scaling tiles `aScaleMatrix` / `bScaleMatrix` configuring implementation-defined mixed-precision behavior:
-
-$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
-
-The exact role of `aScaleMatrix` / `bScaleMatrix` (and any dequant/quant semantics) is target-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-Synchronous forms (conceptual):
-
-```text
-%c = tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-%c_out = tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-%c = tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
--> !pto.tile<...>
-%c_out = pto.tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>,
-!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
-%c = pto.tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>,
-!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
-outs(%c :  !pto.tile_buf<...>)
-pto.tmatmul.mx.acc ins(%c_in, %a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
-!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-pto.tmatmul.mx.bias ins(%a, %a_scale, %b, %b_scale, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
-!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
--> !pto.tile<...>
-%c_out = pto.tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>,
-!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
-%c = pto.tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>,
-!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
-outs(%c :  !pto.tile_buf<...>)
-pto.tmatmul.mx.acc ins(%c_in, %a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
-!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
-pto.tmatmul.mx.bias ins(%a, %a_scale, %b, %b_scale, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
-!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
-
-template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
-          typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
-          typename TileRightScale, typename TileBias, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A5)**:
-    - `m/k/n` are taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`.
-    - Static legality checks are enforced via `CheckMadMxValid<...>()` (types, shapes, fractals, and scaling tile legality).
-- **Bias form**:
-    - `TileBias::DType` must be `float` and `TileBias::Loc == TileType::Bias` with `TileBias::Rows == 1` (A5 checks via `static_assert`).
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<float8_e5m2_t, 16, 64>;
-  using B = TileRight<float8_e5m2_t, 64, 32>;
-  using ScaleA = TileLeftScale<float8_e8m0_t, 16, 2>;
-  using ScaleB = TileRightScale<float8_e8m0_t, 2, 32>;
-  using Bias = Tile<TileType::Bias, float, 1, 32>;
-  using C = TileAcc<float, 16, 32>;
-  A a;
-  B b;
-  ScaleA scaleA;
-  ScaleB scaleB;
-  Bias bias;
-  C c;
-  TMATMUL_MX(c, a, scaleA, b, scaleB, bias);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<float8_e5m2_t, 16, 64>;
-  using B = TileRight<float8_e5m2_t, 64, 32>;
-  using ScaleA = TileLeftScale<float8_e8m0_t, 16, 2>;
-  using ScaleB = TileRightScale<float8_e8m0_t, 2, 32>;
-  using Bias = Tile<TileType::Bias, float, 1, 32>;
-  using C = TileAcc<float, 16, 32>;
-  A a;
-  B b;
-  ScaleA scaleA;
-  ScaleB scaleB;
-  Bias bias;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(scaleA, GetScaleAddr(a.data()));
-  TASSIGN(scaleB, GetScaleAddr(b.data()));
-  TASSIGN(bias, 0x3000);
-  TASSIGN(c, 0x4000);
-  TMATMUL_MX(c, a, scaleA, b, scaleB, bias);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul.md
deleted file mode 100644
index 6fa90a6d..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul.md
+++ /dev/null
@@ -1,191 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul.md` -->
-
-# pto.tmatmul
-
-Standalone reference page for `pto.tmatmul`. This page belongs to the [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md) family in the PTO ISA manual.
-
-## Summary
-
-Matrix multiply (GEMM) producing an accumulator/output tile.
-
-## Mechanism
-
-Matrix multiply (GEMM) producing an accumulator/output tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let:
-
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
-
-For `0 <= i < M` and `0 <= j < N` (output elements in the effective matmul domain):
-
-$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
-
-Exact accumulator behavior and datatype promotion are target/implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%acc = tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-```
-
-## Inputs
-
-- `a` is the left operand tile (must be TileLeft location).
-- `b` is the right operand tile (must be TileRight location).
-- `dst` names the destination accumulator tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the matrix multiply result: `dst[i,j]` = sum over `k` of `a[i,k] * b[k,j]`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- Source and destination shapes, layouts, and element types MUST satisfy the legality rules documented by the family and target profile.
-
-- Programs must not assume implicit broadcasting, reshaping, or valid-region repair unless the operation documents it.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Supported `(CType, AType, BType)` triples:
-    - `(int32_t, int8_t, int8_t)`
-    - `(float, half, half)`
-    - `(float, float, float)`
-    - `(float, bfloat16_t, bfloat16_t)`
-    - Static shape constraints: `TileLeft::Rows == TileRes::Rows`, `TileLeft::Cols == TileRight::Rows`, `TileRight::Cols == TileRes::Cols`.
-    - Tile locations: `TileLeft::Loc == Left`, `TileRight::Loc == Right`, `TileRes::Loc == Acc`.
-    - Runtime: `m/k/n` (taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`) must be in `[1, 4095]`.
-
-- **Implementation checks (A5)**:
-    - Accumulator type must be `int32_t` or `float`.
-    - If `int32_t`: `AType == int8_t` and `BType == int8_t`.
-    - If `float`: supports `half/bfloat16_t/float` and selected fp8 pairs (target-defined).
-    - Static shape constraints: `TileLeft::Rows == TileRes::Rows`, `TileLeft::Cols == TileRight::Rows`, `TileRight::Cols == TileRes::Cols`.
-    - Fractal/layout constraints are enforced:
-    - Left: `Loc == Left`, `!isRowMajor`, `SFractal == RowMajor`
-    - Right: `Loc == Right`, `isRowMajor`, `SFractal == ColMajor`
-    - Acc: `Loc == Acc`, `!isRowMajor`, `SFractal == RowMajor`
-    - Runtime: `m/k/n` (taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`) must be in `[1, 4095]`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  C c;
-  TMATMUL(c, a, b);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c, 0x3000);
-  TMATMUL(c, a, b);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%acc = tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Matrix And Matrix Vector](../../matrix-and-matrix-vector.md)
-- Previous op in family: [pto.tmatmul_mx](./tmatmul-mx.md)
-- Next op in family: [pto.tmatmul_acc](./tmatmul-acc.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul_zh.md
deleted file mode 100644
index 7240c511..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul_zh.md
+++ /dev/null
@@ -1,134 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/matrix-and-matrix-vector/tmatmul_zh.md` -->
-
-# TMATMUL
-
-## 指令示意图
-
-![TMATMUL tile operation](../figures/isa/TMATMUL.svg)
-
-## 简介
-
-矩阵乘法 (GEMM)，生成累加器/输出 Tile。
-
-## 数学语义
-
-Let:
-
-- `M = aMatrix.GetValidRow()`
-- `K = aMatrix.GetValidCol()`
-- `N = bMatrix.GetValidCol()`
-
-For `0 <= i < M` and `0 <= j < N` (output elements in the effective matmul domain):
-
-$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
-
-Exact accumulator behavior and datatype promotion are target/implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%acc = tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-
-template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
-PTO_INST RecordEvent TMATMUL(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - Supported `(CType, AType, BType)` triples:
-    - `(int32_t, int8_t, int8_t)`
-    - `(float, half, half)`
-    - `(float, float, float)`
-    - `(float, bfloat16_t, bfloat16_t)`
-    - Static shape constraints: `TileLeft::Rows == TileRes::Rows`, `TileLeft::Cols == TileRight::Rows`, `TileRight::Cols == TileRes::Cols`.
-    - Tile locations: `TileLeft::Loc == Left`, `TileRight::Loc == Right`, `TileRes::Loc == Acc`.
-    - Runtime: `m/k/n` (taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`) must be in `[1, 4095]`.
-- **实现检查 (A5)**:
-    - Accumulator type must be `int32_t` or `float`.
-    - If `int32_t`: `AType == int8_t` and `BType == int8_t`.
-    - If `float`: supports `half/bfloat16_t/float` and selected fp8 pairs (target-defined).
-    - Static shape constraints: `TileLeft::Rows == TileRes::Rows`, `TileLeft::Cols == TileRight::Rows`, `TileRight::Cols == TileRes::Cols`.
-    - Fractal/layout constraints are enforced:
-    - Left: `Loc == Left`, `!isRowMajor`, `SFractal == RowMajor`
-    - Right: `Loc == Right`, `isRowMajor`, `SFractal == ColMajor`
-    - Acc: `Loc == Acc`, `!isRowMajor`, `SFractal == RowMajor`
-    - Runtime: `m/k/n` (taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`) must be in `[1, 4095]`.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  C c;
-  TMATMUL(c, a, b);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using A = TileLeft<half, 16, 16>;
-  using B = TileRight<half, 16, 16>;
-  using C = TileAcc<float, 16, 16>;
-  A a;
-  B b;
-  C c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c, 0x3000);
-  TMATMUL(c, a, b);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mgather.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mgather.md
deleted file mode 100644
index 3f0c9867..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mgather.md
+++ /dev/null
@@ -1,133 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/mgather.md` -->
-
-# pto.mgather
-
-Standalone reference page for `pto.mgather`. This page belongs to the [Memory And Data Movement](../../memory-and-data-movement.md) family in the PTO ISA manual.
-
-## Summary
-
-Gather-load elements from global memory into a tile using per-element indices.
-
-## Mechanism
-
-Gather-load elements from global memory into a tile using per-element indices. It is part of the tile memory/data-movement surface, so the visible behavior includes explicit transfer between GM-visible data and tile-visible state.
-
-For each element `(i, j)` in the destination valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{mem}[\mathrm{idx}_{i,j}] $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = mgather %mem, %idx : !pto.memref<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
--> !pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.mgather ins(%mem, %idx : !pto.partition_tensor_view<MxNxdtype>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDst, typename GlobalData, typename TileInd, typename... WaitEvents>
-PTO_INST RecordEvent MGATHER(TileDst &dst, GlobalData &src, TileInd &indexes, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source GlobalTensor.
-- `indexes` is an index tile providing per-element indices into `src`.
-- `dst` names the destination tile. The operation uses dst's valid region for the transfer shape.
-
-## Expected Outputs
-
-`dst` contains gathered elements from `src` at positions specified by `indexes`.
-
-## Side Effects
-
-This operation reads from global memory. Index bounds are target-defined.
-
-## Constraints
-
-- **Supported data types**:
-    - `dst`/`src` element type must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `bfloat16_t`, `float`.
-    - On AICore targets, `float8_e4m3_t` and `float8_e5m2_t` are also supported.
-    - `indexes` element type must be `int32_t` or `uint32_t`.
-
-- **Tile and memory types**:
-    - `dst` must be a vector tile (`TileType::Vec`).
-    - `indexes` must be a vector tile (`TileType::Vec`).
-    - `dst` and `indexes` must use row-major layout.
-    - `src` must be a `GlobalTensor` in GM memory.
-    - `src` must use `ND` layout.
-
-- **Shape constraints**:
-    - `dst.Rows == indexes.Rows`.
-    - `indexes` must be shaped as `[N, 1]` for row-indexed gather or `[N, M]` for element-indexed gather.
-    - `dst` row width must be 32-byte aligned, that is, `dst.Cols * sizeof(DType)` must be a multiple of 32.
-    - `src` static shape must satisfy `Shape<1, 1, 1, TableRows, RowWidth>`.
-
-- **Index interpretation**:
-    - Index interpretation is target-defined. The CPU simulator treats indices as linear element indices into `src.data()`.
-    - The CPU simulator does not enforce bounds checks on `indexes`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.mgather` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
-```
-
-### PTO Assembly Form
-
-```text
-%dst = mgather %mem, %idx : !pto.memref<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.mgather ins(%mem, %idx : !pto.partition_tensor_view<MxNxdtype>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Memory And Data Movement](../../memory-and-data-movement.md)
-- Previous op in family: [pto.tstore_fp](./tstore-fp.md)
-- Next op in family: [pto.mscatter](./mscatter.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mgather_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mgather_zh.md
deleted file mode 100644
index cbc5db6f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mgather_zh.md
+++ /dev/null
@@ -1,101 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/mgather_zh.md` -->
-
-# MGATHER
-
-## 指令示意图
-
-![MGATHER tile operation](../figures/isa/MGATHER.svg)
-
-## 简介
-
-使用逐元素索引从全局内存收集加载元素到 Tile 中。
-
-## 数学语义
-
-对目标有效区域中的每个元素 `(i, j)`：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{mem}[\mathrm{idx}_{i,j}] $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = mgather %mem, %idx : !pto.memref<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
--> !pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.mgather ins(%mem, %idx : !pto.partition_tensor_view<MxNxdtype>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDst, typename GlobalData, typename TileInd, typename... WaitEvents>
-PTO_INST RecordEvent MGATHER(TileDst &dst, GlobalData &src, TileInd &indexes, WaitEvents &... events);
-```
-
-## 约束
-
-- **支持的数据类型**：
-    - `dst`/`src` 的元素类型必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`bfloat16_t`、`float`。
-    - 在 AICore 目标上，还支持 `float8_e4m3_t` 和 `float8_e5m2_t`。
-    - `indexes` 的元素类型必须是 `int32_t` 或 `uint32_t`。
-- **Tile 与内存类型约束**：
-    - `dst` 必须是向量 Tile（`TileType::Vec`）。
-    - `indexes` 必须是向量 Tile（`TileType::Vec`）。
-    - `dst` 和 `indexes` 必须使用行主序布局。
-    - `src` 必须是位于 GM 内存中的 `GlobalTensor`。
-    - `src` 必须使用 `ND` 布局。
-- **形状约束**：
-    - `dst.Rows == indexes.Rows`。
-    - `indexes` 的形状必须为 `[N, 1]`（按行 gather）或 `[N, M]`（按元素 gather）。
-    - `dst` 的行宽必须满足 32 字节对齐，即 `dst.Cols * sizeof(DType)` 必须是 32 的倍数。
-    - `src` 的静态 shape 必须满足 `Shape<1, 1, 1, TableRows, RowWidth>`。
-- **索引解释**：
-    - 索引解释由目标定义。CPU 模拟器将索引视为 `src.data()` 中的线性元素索引。
-    - CPU 模拟器不对 `indexes` 强制执行边界检查。
-
-## 示例
-
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
-```
-
-### PTO 汇编形式
-
-```text
-%dst = mgather %mem, %idx : !pto.memref<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.mgather ins(%mem, %idx : !pto.partition_tensor_view<MxNxdtype>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mscatter.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mscatter.md
deleted file mode 100644
index 1f6d388e..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mscatter.md
+++ /dev/null
@@ -1,138 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/mscatter.md` -->
-
-# pto.mscatter
-
-Standalone reference page for `pto.mscatter`. This page belongs to the [Memory And Data Movement](../../memory-and-data-movement.md) family in the PTO ISA manual.
-
-## Summary
-
-Scatter-store elements from a tile into global memory using per-element indices.
-
-## Mechanism
-
-Scatter-store elements from a tile into global memory using per-element indices. It is part of the tile memory/data-movement surface, so the visible behavior includes explicit transfer between GM-visible data and tile-visible state.
-
-For each element `(i, j)` in the source valid region:
-
-$$ \mathrm{mem}[\mathrm{idx}_{i,j}] = \mathrm{src}_{i,j} $$
-
-If multiple elements map to the same destination location, the final value is implementation-defined (CPU simulator: last writer wins in row-major iteration order).
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-mscatter %src, %mem, %idx : !pto.memref<...>, !pto.tile<...>, !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.mscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename GlobalData, typename TileSrc, typename TileInd, typename... WaitEvents>
-PTO_INST RecordEvent MSCATTER(GlobalData &dst, TileSrc &src, TileInd &indexes, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `indexes` is an index tile providing per-element indices into `dst`.
-- `dst` is the destination GlobalTensor.
-
-## Expected Outputs
-
-Elements from `src` are scattered to positions in `dst` specified by `indexes`.
-
-## Side Effects
-
-This operation writes to global memory. Concurrent writes to the same location produce implementation-defined results.
-
-## Constraints
-
-- **Supported data types**:
-    - `src`/`dst` element type must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `bfloat16_t`, `float`.
-    - On AICore targets, `float8_e4m3_t` and `float8_e5m2_t` are also supported.
-    - `indexes` element type must be `int32_t` or `uint32_t`.
-
-- **Tile and memory types**:
-    - `src` must be a vector tile (`TileType::Vec`).
-    - `indexes` must be a vector tile (`TileType::Vec`).
-    - `src` and `indexes` must use row-major layout.
-    - `dst` must be a `GlobalTensor` in GM memory.
-    - `dst` must use `ND` layout.
-
-- **Atomic operation constraints**:
-    - Non-atomic scatter is supported for all supported element types.
-    - `Add` atomic mode requires `int32_t`, `uint32_t`, `float`, or `half`.
-    - `Max`/`Min` atomic mode requires `int32_t` or `float`.
-
-- **Shape constraints**:
-    - `src.Rows == indexes.Rows`.
-    - `indexes` must be shaped as `[N, 1]` for row-indexed scatter or `[N, M]` for element-indexed scatter.
-    - `src` row width must be 32-byte aligned, that is, `src.Cols * sizeof(DType)` must be a multiple of 32.
-    - `dst` static shape must satisfy `Shape<1, 1, 1, TableRows, RowWidth>`.
-
-- **Index interpretation**:
-    - Index interpretation is target-defined. The CPU simulator treats indices as linear element indices into `dst.data()`.
-    - The CPU simulator does not enforce bounds checks on `indexes`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.mscatter` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### PTO Assembly Form
-
-```text
-mscatter %src, %mem, %idx : !pto.memref<...>, !pto.tile<...>, !pto.tile<...>
-# AS Level 2 (DPS)
-pto.mscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Memory And Data Movement](../../memory-and-data-movement.md)
-- Previous op in family: [pto.mgather](./mgather.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mscatter_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mscatter_zh.md
deleted file mode 100644
index e47e9331..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/mscatter_zh.md
+++ /dev/null
@@ -1,106 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/mscatter_zh.md` -->
-
-# MSCATTER
-
-## 指令示意图
-
-![MSCATTER tile operation](../figures/isa/MSCATTER.svg)
-
-## 简介
-
-使用逐元素索引将 Tile 中的元素散播存储到全局内存。
-
-## 数学语义
-
-对源有效区域中的每个元素 `(i, j)`：
-
-$$ \mathrm{mem}[\mathrm{idx}_{i,j}] = \mathrm{src}_{i,j} $$
-
-如果多个元素映射到同一目标位置，最终值由实现定义（CPU 模拟器：按行主序迭代顺序，最后写入者获胜）。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-mscatter %src, %mem, %idx : !pto.memref<...>, !pto.tile<...>, !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.mscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename GlobalData, typename TileSrc, typename TileInd, typename... WaitEvents>
-PTO_INST RecordEvent MSCATTER(GlobalData &dst, TileSrc &src, TileInd &indexes, WaitEvents &... events);
-```
-
-## 约束
-
-- **支持的数据类型**：
-    - `src`/`dst` 的元素类型必须是以下之一：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
-    - 在 AICore 目标上，还支持 `float8_e4m3_t` 和 `float8_e5m2_t`。
-    - `indexes` 的元素类型必须是 `int32_t` 或 `uint32_t`。
-- **Tile 与内存类型约束**：
-    - `src` 必须是向量 Tile（`TileType::Vec`）。
-    - `indexes` 必须是向量 Tile（`TileType::Vec`）。
-    - `src` 和 `indexes` 必须使用行主序布局。
-    - `dst` 必须是位于 GM 内存中的 `GlobalTensor`。
-    - `dst` 必须使用 `ND` 布局。
-- **原子操作约束**：
-    - 非原子 scatter 对所有受支持元素类型都可用。
-    - `Add` 原子模式要求元素类型为 `int32_t`、`uint32_t`、`float` 或 `half`。
-    - `Max`/`Min` 原子模式要求元素类型为 `int32_t` 或 `float`。
-- **形状约束**：
-    - `src.Rows == indexes.Rows`。
-    - `indexes` 的形状必须为 `[N, 1]`（按行 scatter）或 `[N, M]`（按元素 scatter）。
-    - `src` 的行宽必须满足 32 字节对齐，即 `src.Cols * sizeof(DType)` 必须是 32 的倍数。
-    - `dst` 的静态 shape 必须满足 `Shape<1, 1, 1, TableRows, RowWidth>`。
-- **索引解释**：
-    - 索引解释由目标定义。CPU 模拟器将索引视为 `dst.data()` 中的线性元素索引。
-    - CPU 模拟器不对 `indexes` 强制执行边界检查。
-
-## 示例
-
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### PTO 汇编形式
-
-```text
-mscatter %src, %mem, %idx : !pto.memref<...>, !pto.tile<...>, !pto.tile<...>
-# AS Level 2 (DPS)
-pto.mscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tload.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tload.md
deleted file mode 100644
index ef6594d7..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tload.md
+++ /dev/null
@@ -1,185 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/tload.md` -->
-
-# pto.tload
-
-Standalone reference page for `pto.tload`. This page belongs to the [Memory And Data Movement](../../memory-and-data-movement.md) family in the PTO ISA manual.
-
-## Summary
-
-Load data from a GlobalTensor (GM) into a Tile.
-
-## Mechanism
-
-Load data from a GlobalTensor (GM) into a Tile. It is part of the tile memory/data-movement surface, so the visible behavior includes explicit transfer between GM-visible data and tile-visible state.
-
-Notation depends on the `GlobalTensor` shape/stride and the `Tile` layout. Conceptually (2D view, with a base offset):
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{r_0 + i,\; c_0 + j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%t0 = tload %sv[%c0, %c0] : (!pto.memref<...>, index, index) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
-!pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
-!pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename GlobalData, typename... WaitEvents>
-PTO_INST RecordEvent TLOAD(TileData &dst, GlobalData &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source GlobalTensor to load from.
-- `dst` names the destination tile. The operation uses dst's valid region for the transfer shape.
-
-## Expected Outputs
-
-`dst` contains the loaded data from `src`, with element layout determined by the tile layout and global tensor stride.
-
-## Side Effects
-
-This operation reads from global memory and writes to the tile local storage. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The implementation uses `dst.GetValidRow()` / `dst.GetValidCol()` as the transfer size.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`.
-    - Destination tile location must be `TileType::Vec` or `TileType::Mat`.
-    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
-    - Runtime: all `src.GetShape(dim)` values and `dst.GetValidRow()/GetValidCol()` must be `> 0`.
-    - `TileType::Vec` loads only support matching layouts: ND->ND, DN->DN, NZ->NZ.
-    - `TileType::Mat` loads support: ND->ND, DN->DN, NZ->NZ, plus ND->NZ and DN->ZN.
-    - For ND->NZ or DN->ZN: `GlobalData::staticShape[0..2] == 1` and `TileData::SFractalSize == 512`.
-    - For `int64_t/uint64_t`, only ND->ND or DN->DN are supported.
-
-- **Implementation checks (A5)**:
-    - `sizeof(TileData::DType)` must be `1`, `2`, `4`, or `8` bytes, and must match `sizeof(GlobalData::DType)`.
-    - For `int64_t/uint64_t`, `TileData::PadVal` must be `PadValue::Null` or `PadValue::Zero`.
-    - `TileType::Vec` loads require one of the following layout pairs:
-    - ND with row-major + `SLayout::NoneBox` (ND->ND),
-    - DN with col-major + `SLayout::NoneBox` (DN->DN),
-    - NZ with `SLayout::RowMajor` (NZ->NZ).
-    - For row-major ND->ND with compile-time-known shapes, `TileData::ValidCol` must equal `GlobalData::staticShape[4]`, and `TileData::ValidRow` must equal the product of `GlobalData::staticShape[0..3]`.
-    - `TileType::Mat` loads are additionally constrained by `TLoadCubeCheck` (e.g., only specific ND/DN/NZ conversions and L1-size limits).
-    - `TileType::Mat` loads also handle loads for mx format, which include `MX_A_ZZ/MX_A_ND/MX_A_DN` to ZZ for scalarA and `MX_B_NN/MX_B_ND/MX_B_DN` to NN for scalarB.
-    - for `MX_A_ZZ/MX_B_NN`: `GlobalData::staticShape[3] == 16` and `GlobalData::staticShape[4] == 2`.
-    - for `MX_A_ND/MX_ADN/MX_B_ND/MX_B_DN`: `GlobalData::staticShape[0] == 1` and `GlobalData::staticShape[1] == 1` and `GlobalData::staticShape[4] == 2`.
-    - for scaleA, `dst.GetValidCol() % 2 == 0`.
-    - for scaleB, `dst.GetValidRow() % 2 == 0`
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void example_auto(__gm__ T* in) {
-  using TileT = Tile<TileType::Vec, T, 16, 16>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-  GTensor gin(in);
-  TileT t;
-  TLOAD(t, gin);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void example_manual(__gm__ T* in) {
-  using TileT = Tile<TileType::Vec, T, 16, 16>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-  GTensor gin(in);
-  TileT t;
-  TASSIGN(t, 0x1000);
-  TLOAD(t, gin);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
-```
-
-### PTO Assembly Form
-
-```text
-%t0 = tload %sv[%c0, %c0] : (!pto.memref<...>, index, index) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Memory And Data Movement](../../memory-and-data-movement.md)
-- Next op in family: [pto.tprefetch](./tprefetch.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tload_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tload_zh.md
deleted file mode 100644
index 18d79090..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tload_zh.md
+++ /dev/null
@@ -1,134 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/tload_zh.md` -->
-
-# TLOAD
-
-## 指令示意图
-
-![TLOAD tile operation](../figures/isa/TLOAD.svg)
-
-## 简介
-
-从 GlobalTensor (GM) 加载数据到 Tile。
-
-## 数学语义
-
-Notation depends on the `GlobalTensor` shape/stride and the `Tile` layout. Conceptually (2D view, with a base offset):
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{r_0 + i,\; c_0 + j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%t0 = tload %sv[%c0, %c0] : (!pto.memref<...>, index, index) -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
-!pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
-!pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename GlobalData, typename... WaitEvents>
-PTO_INST RecordEvent TLOAD(TileData &dst, GlobalData &src, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`.
-    - Destination tile location must be `TileType::Vec` or `TileType::Mat`.
-    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
-    - Runtime: all `src.GetShape(dim)` values and `dst.GetValidRow()/GetValidCol()` must be `> 0`.
-    - `TileType::Vec` loads only support matching layouts: ND->ND, DN->DN, NZ->NZ.
-    - `TileType::Mat` loads support: ND->ND, DN->DN, NZ->NZ, plus ND->NZ and DN->ZN.
-    - For ND->NZ or DN->ZN: `GlobalData::staticShape[0..2] == 1` and `TileData::SFractalSize == 512`.
-    - For `int64_t/uint64_t`, only ND->ND or DN->DN are supported.
-- **实现检查 (A5)**:
-    - `sizeof(TileData::DType)` must be `1`, `2`, `4`, or `8` bytes, and must match `sizeof(GlobalData::DType)`.
-    - For `int64_t/uint64_t`, `TileData::PadVal` must be `PadValue::Null` or `PadValue::Zero`.
-    - `TileType::Vec` loads require one of the following layout pairs:
-    - ND with row-major + `SLayout::NoneBox` (ND->ND),
-    - DN with col-major + `SLayout::NoneBox` (DN->DN),
-    - NZ with `SLayout::RowMajor` (NZ->NZ).
-    - For row-major ND->ND with compile-time-known shapes, `TileData::ValidCol` must equal `GlobalData::staticShape[4]`, and `TileData::ValidRow` must equal the product of `GlobalData::staticShape[0..3]`.
-    - `TileType::Mat` loads are additionally constrained by `TLoadCubeCheck` (e.g., only specific ND/DN/NZ conversions and L1-size limits).
-    - `TileType::Mat` loads also handle loads for mx format, which include `MX_A_ZZ/MX_A_ND/MX_A_DN` to ZZ for scalarA and `MX_B_NN/MX_B_ND/MX_B_DN` to NN for scalarB.
-    - for `MX_A_ZZ/MX_B_NN`: `GlobalData::staticShape[3] == 16` and `GlobalData::staticShape[4] == 2`.
-    - for `MX_A_ND/MX_ADN/MX_B_ND/MX_B_DN`: `GlobalData::staticShape[0] == 1` and `GlobalData::staticShape[1] == 1` and `GlobalData::staticShape[4] == 2`.
-    - for scaleA, `dst.GetValidCol() % 2 == 0`.
-    - for scaleB, `dst.GetValidRow() % 2 == 0`
-
-- **有效区域**:
-    - The implementation uses `dst.GetValidRow()` / `dst.GetValidCol()` as the transfer size.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void example_auto(__gm__ T* in) {
-  using TileT = Tile<TileType::Vec, T, 16, 16>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-  GTensor gin(in);
-  TileT t;
-  TLOAD(t, gin);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void example_manual(__gm__ T* in) {
-  using TileT = Tile<TileType::Vec, T, 16, 16>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-  GTensor gin(in);
-  TileT t;
-  TASSIGN(t, 0x1000);
-  TLOAD(t, gin);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tprefetch.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tprefetch.md
deleted file mode 100644
index 46c2cf37..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tprefetch.md
+++ /dev/null
@@ -1,125 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/tprefetch.md` -->
-
-# pto.tprefetch
-
-Standalone reference page for `pto.tprefetch`. This page belongs to the [Memory And Data Movement](../../memory-and-data-movement.md) family in the PTO ISA manual.
-
-## Summary
-
-Prefetch data from global memory into a tile-local cache/buffer (hint).
-
-## Mechanism
-
-Prefetch data from global memory into a tile-local cache/buffer (implementation-defined). This is typically used to reduce latency before a subsequent `TLOAD`.
-
-Note: unlike most PTO instructions, `TPREFETCH` does **not** implicitly call `TSYNC(events...)` in the C++ wrapper. It is part of the tile memory/data-movement surface, so the visible behavior includes explicit transfer between GM-visible data and tile-visible state.
-
-Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tprefetch %src : !pto.global<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tprefetch ins(%src : !pto.global<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tprefetch ins(%src : !pto.global<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename GlobalData>
-PTO_INST RecordEvent TPREFETCH(TileData &dst, GlobalData &src);
-```
-
-## Inputs
-
-- `src` is the source GlobalTensor to prefetch.
-- `dst` names the destination tile (cache buffer).
-
-## Expected Outputs
-
-`dst` holds the prefetched data from `src`. This is a hint; behavior is implementation-defined.
-
-## Side Effects
-
-This operation may read from global memory. Prefetch hints may be ignored by some targets.
-
-## Constraints
-
-- Semantics and caching behavior are target/implementation-defined.
-
-- Some targets may ignore prefetches or treat them as hints.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tprefetch` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tprefetch %src : !pto.global<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tprefetch ins(%src : !pto.global<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Memory And Data Movement](../../memory-and-data-movement.md)
-- Previous op in family: [pto.tload](./tload.md)
-- Next op in family: [pto.tstore](./tstore.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tprefetch_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tprefetch_zh.md
deleted file mode 100644
index 7d43bf28..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tprefetch_zh.md
+++ /dev/null
@@ -1,67 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/tprefetch_zh.md` -->
-
-# TPREFETCH
-
-## 指令示意图
-
-![TPREFETCH tile operation](../figures/isa/TPREFETCH.svg)
-
-## 简介
-
-将数据从全局内存预取到 Tile 本地缓存/缓冲区（提示）。
-
-## 数学语义
-
-除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tprefetch %src : !pto.global<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tprefetch ins(%src : !pto.global<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tprefetch ins(%src : !pto.global<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename GlobalData>
-PTO_INST RecordEvent TPREFETCH(TileData &dst, GlobalData &src);
-```
-
-## 约束
-
-- Semantics and caching behavior are target/implementation-defined.
-- Some targets may ignore prefetches or treat them as hints.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore-fp.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore-fp.md
deleted file mode 100644
index 957f71df..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore-fp.md
+++ /dev/null
@@ -1,180 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/tstore-fp.md` -->
-
-# pto.tstore_fp
-
-Standalone reference page for `pto.tstore_fp`. This page belongs to the [Memory And Data Movement](../../memory-and-data-movement.md) family in the PTO ISA manual.
-
-## Summary
-
-Store an accumulator tile into global memory using a scaling (`fp`) tile for vector quantization parameters.
-
-## Mechanism
-
-Store an accumulator tile into global memory using a scaling (`fp`) tile for vector quantization parameters.
-
-`TSTORE_FP` is the fp-quantization overload of [`pto.tstore`](./tstore.md). It is part of the tile memory/data-movement surface, so the visible behavior includes explicit transfer between GM-visible data and tile-visible state.
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. Conceptually (2D view, with a base offset), for `0 <= i < R` and `0 <= j < C`:
-
-$$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-tstore.fp %src, %fp, %sv_out[%c0, %c0]
-```
-
-### AS Level 1 (SSA)
-
-```text
-pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tstore.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tstore.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
-
-```cpp
-template <typename TileData, typename GlobalData, typename FpTileData, AtomicType atomicType = AtomicType::AtomicNone,
-          ReluPreMode reluPreMode = ReluPreMode::NoRelu, typename... WaitEvents>
-PTO_INST RecordEvent TSTORE_FP(GlobalData &dst, TileData &src, FpTileData &fp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source accumulator tile.
-- `fp` is the scaling tile containing vector quantization parameters.
-- `dst` is the destination GlobalTensor.
-- `reluPreMode` (optional): specifies ReLU pre-processing mode.
-
-## Expected Outputs
-
-`src` is converted using `fp` scaling parameters and written to `dst`.
-
-## Side Effects
-
-This operation writes to global memory with quantization conversion.
-
-## Constraints
-
-- Addressing, layout, and transfer shape MUST satisfy the legality rules of the selected target profile.
-
-- Programs must not assume hidden ordering stronger than the documented PTO synchronization rules.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - The fp store path is implemented via `TSTORE_IMPL(dst, src, fp)` and uses the same accumulator-to-GM legality checks as quantized accumulator stores:
-    - Destination layout must be ND or NZ.
-    - Source dtype must be `int32_t` or `float`.
-    - Static shape constraints: `1 <= TileData::Cols <= 4095`; if ND then `1 <= TileData::Rows <= 8192`; if NZ then `1 <= TileData::Rows <= 65535` and `TileData::Cols % 16 == 0`.
-    - Runtime: `1 <= src.GetValidCol() <= 4095`.
-    - No explicit `static_assert` is enforced on `FpTileData` (the implementation uses `fp` to set FPC state).
-
-- **Implementation checks (A5)**:
-    - Implemented via `TSTORE_IMPL(dst, src, fp)` and validated by `CheckStaticAcc<..., true>()` for the accumulator path (ND/NZ only, `int32_t/float` source dtype, rows/cols ranges).
-    - No explicit `static_assert` is enforced on `FpTileData` (the implementation uses `fp` to set FPC state).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto(__gm__ int8_t* out) {
-  using AccT = TileAcc<float, 16, 16>;
-  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, DYNAMIC, SLayout::NoneBox>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<int8_t, 16, 16, Layout::ND>;
-  using GT = GlobalTensor<int8_t, GShape, GStride, Layout::ND>;
-
-  GT gout(out);
-  AccT acc;
-  FpT fp(16);
-  TSTORE_FP(gout, acc, fp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual(__gm__ int8_t* out) {
-  using AccT = TileAcc<float, 16, 16>;
-  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, DYNAMIC, SLayout::NoneBox>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<int8_t, 16, 16, Layout::ND>;
-  using GT = GlobalTensor<int8_t, GShape, GStride, Layout::ND>;
-
-  GT gout(out);
-  AccT acc;
-  FpT fp(16);
-  TASSIGN(acc, 0x1000);
-  TASSIGN(fp,  0x2000);
-  TSTORE_FP(gout, acc, fp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### PTO Assembly Form
-
-```text
-tstore.fp %src, %fp, %sv_out[%c0, %c0]
-# AS Level 2 (DPS)
-pto.tstore.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Memory And Data Movement](../../memory-and-data-movement.md)
-- Previous op in family: [pto.tstore](./tstore.md)
-- Next op in family: [pto.mgather](./mgather.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore-fp_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore-fp_zh.md
deleted file mode 100644
index 57eab884..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore-fp_zh.md
+++ /dev/null
@@ -1,120 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/tstore-fp_zh.md` -->
-
-# TSTORE_FP
-
-## 指令示意图
-
-![TSTORE_FP tile operation](../figures/isa/TSTORE_FP.svg)
-
-## 简介
-
-使用缩放 (`fp`) Tile 作为向量量化参数，将累加器 Tile 存储到全局内存。
-
-## 数学语义
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. Conceptually (2D view, with a base offset), for `0 <= i < R` and `0 <= j < C`:
-
-$$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-tstore.fp %src, %fp, %sv_out[%c0, %c0]
-```
-
-### AS Level 1 (SSA)
-
-```text
-pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tstore.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
-
-### AS Level 1（SSA）
-
-```text
-pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tstore.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
-
-```cpp
-template <typename TileData, typename GlobalData, typename FpTileData, AtomicType atomicType = AtomicType::AtomicNone,
-          ReluPreMode reluPreMode = ReluPreMode::NoRelu, typename... WaitEvents>
-PTO_INST RecordEvent TSTORE_FP(GlobalData &dst, TileData &src, FpTileData &fp, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - The fp store path is implemented via `TSTORE_IMPL(dst, src, fp)` and uses the same accumulator-to-GM legality checks as quantized accumulator stores:
-    - Destination layout must be ND or NZ.
-    - Source dtype must be `int32_t` or `float`.
-    - Static shape constraints: `1 <= TileData::Cols <= 4095`; if ND then `1 <= TileData::Rows <= 8192`; if NZ then `1 <= TileData::Rows <= 65535` and `TileData::Cols % 16 == 0`.
-    - Runtime: `1 <= src.GetValidCol() <= 4095`.
-    - No explicit `static_assert` is enforced on `FpTileData` (the implementation uses `fp` to set FPC state).
-- **实现检查 (A5)**:
-    - Implemented via `TSTORE_IMPL(dst, src, fp)` and validated by `CheckStaticAcc<..., true>()` for the accumulator path (ND/NZ only, `int32_t/float` source dtype, rows/cols ranges).
-    - No explicit `static_assert` is enforced on `FpTileData` (the implementation uses `fp` to set FPC state).
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto(__gm__ int8_t* out) {
-  using AccT = TileAcc<float, 16, 16>;
-  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, DYNAMIC, SLayout::NoneBox>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<int8_t, 16, 16, Layout::ND>;
-  using GT = GlobalTensor<int8_t, GShape, GStride, Layout::ND>;
-
-  GT gout(out);
-  AccT acc;
-  FpT fp(16);
-  TSTORE_FP(gout, acc, fp);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual(__gm__ int8_t* out) {
-  using AccT = TileAcc<float, 16, 16>;
-  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, DYNAMIC, SLayout::NoneBox>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<int8_t, 16, 16, Layout::ND>;
-  using GT = GlobalTensor<int8_t, GShape, GStride, Layout::ND>;
-
-  GT gout(out);
-  AccT acc;
-  FpT fp(16);
-  TASSIGN(acc, 0x1000);
-  TASSIGN(fp,  0x2000);
-  TSTORE_FP(gout, acc, fp);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore.md
deleted file mode 100644
index ec9eacb6..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore.md
+++ /dev/null
@@ -1,189 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/tstore.md` -->
-
-# pto.tstore
-
-Standalone reference page for `pto.tstore`. This page belongs to the [Memory And Data Movement](../../memory-and-data-movement.md) family in the PTO ISA manual.
-
-## Summary
-
-Store data from a Tile into a GlobalTensor (GM), optionally using atomic write or quantization parameters.
-
-## Mechanism
-
-Store data from a Tile into a GlobalTensor (GM), optionally using atomic write or quantization parameters. It is part of the tile memory/data-movement surface, so the visible behavior includes explicit transfer between GM-visible data and tile-visible state.
-
-Notation depends on the `GlobalTensor` shape/stride and the `Tile` layout. Conceptually (2D view, with a base offset):
-
-$$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-tstore %t1, %sv_out[%c0, %c0]
-```
-
-### IR Level 1 (SSA)
-
-```text
-pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tstore ins(%src : !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
-
-```cpp
-template <typename TileData, typename GlobalData, AtomicType atomicType = AtomicType::AtomicNone,
-          typename... WaitEvents>
-PTO_INST RecordEvent TSTORE(GlobalData &dst, TileData &src, WaitEvents &... events);
-
-template <typename TileData, typename GlobalData, AtomicType atomicType = AtomicType::AtomicNone,
-          typename... WaitEvents>
-PTO_INST RecordEvent TSTORE(GlobalData &dst, TileData &src, uint64_t preQuantScalar, WaitEvents &... events);
-
-template <typename TileData, typename GlobalData, typename FpTileData, AtomicType atomicType = AtomicType::AtomicNone,
-          typename... WaitEvents>
-PTO_INST RecordEvent TSTORE_FP(GlobalData &dst, TileData &src, FpTileData &fp, WaitEvents &... events);
-```
-
-The `preQuantScalar` and `TSTORE_FP` quantized-store overloads are only legal for `TileType::Acc` on current A2/A3 and A5 backends. They do not provide a native vec-tile quantized store contract.
-
-## Inputs
-
-- `src` is the source tile to store.
-- `dst` is the destination GlobalTensor.
-- `atomicType` (optional): specifies atomic store mode (e.g., `AtomicAdd`).
-- `preQuantScalar` (optional): scalar for pre-quantization.
-- `fp` (optional, for TSTORE_FP): scaling tile for vector quantization parameters.
-
-## Expected Outputs
-
-Data is written from `src` to `dst`. With atomic operations, values are accumulated. With quantization, data is converted using the quantization parameters.
-
-## Side Effects
-
-This operation writes to global memory. With atomic modes, concurrent access may produce implementation-defined results.
-
-## Constraints
-
-- **Valid region**:
-  - The implementation uses `src.GetValidRow()` / `src.GetValidCol()` as the transfer size.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-  - Source tile location must be one of: `TileType::Vec`, `TileType::Mat`, `TileType::Acc`.
-  - Runtime: all `dst.GetShape(dim)` values and `src.GetValidRow()/GetValidCol()` must be `> 0`.
-  - For `TileType::Vec` / `TileType::Mat`:
-    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`.
-    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
-    - Layouts must match ND/DN/NZ (or a special case where `TileData::Rows == 1` or `TileData::Cols == 1`).
-    - For `int64_t/uint64_t`, only ND->ND or DN->DN are supported.
-    - A2/A3 does not expose a native vec quantized-store path. Frontends that need `vec -> GM` dtype conversion or quantization MUST first materialize the converted vec tile (for example via `TCVT`) and then issue a same-dtype `TSTORE`.
-  - For `TileType::Acc` (including quantized/atomic variants):
-    - Destination layout must be ND or NZ.
-    - Source dtype must be `int32_t` or `float`.
-    - When not using quantization, destination dtype must be `__gm__ int32_t/float/half/bfloat16_t`.
-    - Static shape constraints: `1 <= TileData::Cols <= 4095`; if ND then `1 <= TileData::Rows <= 8192`; if NZ then `1 <= TileData::Rows <= 65535` and `TileData::Cols % 16 == 0`.
-    - Runtime: `1 <= src.GetValidCol() <= 4095`.
-
-- **Implementation checks (A5)**:
-  - Source tile location must be `TileType::Vec` or `TileType::Acc` (no `Mat` store on this target).
-  - For `TileType::Vec`:
-    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
-    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`, `float8_e4m3_t`, `float8_e5m2_t`, `hifloat8_t`, `float4_e1m2x2_t`, `float4_e2m1x2_t`.
-    - Layouts must match ND/DN/NZ (or a special case where `TileData::Rows == 1` or `TileData::Cols == 1`).
-    - Additional alignment constraints are enforced (e.g., for ND the row-major width in bytes must be a multiple of 32; for DN the column-major height in bytes must be a multiple of 32, with special-case exceptions).
-  - For `TileType::Acc`:
-    - Destination layout must be ND or NZ; source dtype must be `int32_t` or `float`.
-    - When not using quantization, destination dtype must be `__gm__ int32_t/float/half/bfloat16_t`.
-    - Static shape constraints match A2A3 for rows/cols; `AtomicAdd` additionally restricts destination dtype to supported atomic types.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void example_auto(__gm__ T* out) {
-  using TileT = Tile<TileType::Vec, T, 16, 16>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-  GTensor gout(out);
-  TileT t;
-  TSTORE(gout, t);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void example_manual(__gm__ T* out) {
-  using TileT = Tile<TileType::Vec, T, 16, 16>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-  GTensor gout(out);
-  TileT t;
-  TASSIGN(t, 0x1000);
-  TSTORE<TileT, GTensor, AtomicType::AtomicAdd>(gout, t);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### PTO Assembly Form
-
-```text
-tstore %t1, %sv_out[%c0, %c0]
-# IR Level 2 (DPS)
-pto.tstore ins(%src : !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Memory And Data Movement](../../memory-and-data-movement.md)
-- Previous op in family: [pto.tprefetch](./tprefetch.md)
-- Next op in family: [pto.tstore_fp](./tstore-fp.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore_zh.md
deleted file mode 100644
index cf923776..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/memory-and-data-movement/tstore_zh.md
+++ /dev/null
@@ -1,133 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/memory-and-data-movement/tstore_zh.md` -->
-
-# TSTORE
-
-## 指令示意图
-
-![TSTORE tile operation](../figures/isa/TSTORE.svg)
-
-## 简介
-
-将 Tile 中的数据存储到 GlobalTensor (GM)，可选使用原子写入或量化参数。
-
-## 数学语义
-
-Notation depends on the `GlobalTensor` shape/stride and the `Tile` layout. Conceptually (2D view, with a base offset):
-
-$$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-tstore %t1, %sv_out[%c0, %c0]
-```
-
-### AS Level 1（SSA）
-
-```text
-pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tstore ins(%src : !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
-
-```cpp
-template <typename TileData, typename GlobalData, AtomicType atomicType = AtomicType::AtomicNone,
-          typename... WaitEvents>
-PTO_INST RecordEvent TSTORE(GlobalData &dst, TileData &src, WaitEvents &... events);
-
-template <typename TileData, typename GlobalData, AtomicType atomicType = AtomicType::AtomicNone,
-          typename... WaitEvents>
-PTO_INST RecordEvent TSTORE(GlobalData &dst, TileData &src, uint64_t preQuantScalar, WaitEvents &... events);
-
-template <typename TileData, typename GlobalData, typename FpTileData, AtomicType atomicType = AtomicType::AtomicNone,
-          typename... WaitEvents>
-PTO_INST RecordEvent TSTORE_FP(GlobalData &dst, TileData &src, FpTileData &fp, WaitEvents &... events);
-```
-
-The `preQuantScalar` and `TSTORE_FP` quantized-store overloads are only legal for `TileType::Acc` on current A2/A3 and A5 backends. They do not provide a native vec-tile quantized store contract.
-
-## 约束
-
-- **实现检查 (A2A3)**:
-  - Source tile location must be one of: `TileType::Vec`, `TileType::Mat`, `TileType::Acc`.
-  - Runtime: all `dst.GetShape(dim)` values and `src.GetValidRow()/GetValidCol()` must be `> 0`.
-  - For `TileType::Vec` / `TileType::Mat`:
-    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`.
-    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
-    - Layouts must match ND/DN/NZ (or a special case where `TileData::Rows == 1` or `TileData::Cols == 1`).
-    - For `int64_t/uint64_t`, only ND->ND or DN->DN are supported.
-    - A2/A3 does not expose a native vec quantized-store path. Frontends that need `vec -> GM` dtype conversion or quantization MUST first materialize the converted vec tile (for example via `TCVT`) and then issue a same-dtype `TSTORE`.
-  - For `TileType::Acc` (including quantized/atomic variants):
-    - Destination layout must be ND or NZ.
-    - Source dtype must be `int32_t` or `float`.
-    - When not using quantization, destination dtype must be `__gm__ int32_t/float/half/bfloat16_t`.
-    - Static shape constraints: `1 <= TileData::Cols <= 4095`; if ND then `1 <= TileData::Rows <= 8192`; if NZ then `1 <= TileData::Rows <= 65535` and `TileData::Cols % 16 == 0`.
-    - Runtime: `1 <= src.GetValidCol() <= 4095`.
-- **实现检查 (A5)**:
-  - Source tile location must be `TileType::Vec` or `TileType::Acc` (no `Mat` store on this target).
-  - For `TileType::Vec`:
-    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
-    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`, `float8_e4m3_t`, `float8_e5m2_t`, `hifloat8_t`, `float4_e1m2x2_t`, `float4_e2m1x2_t`.
-    - Layouts must match ND/DN/NZ (or a special case where `TileData::Rows == 1` or `TileData::Cols == 1`).
-    - Additional alignment constraints are enforced (e.g., for ND the row-major width in bytes must be a multiple of 32; for DN the column-major height in bytes must be a multiple of 32, with special-case exceptions).
-  - For `TileType::Acc`:
-    - Destination layout must be ND or NZ; source dtype must be `int32_t` or `float`.
-    - When not using quantization, destination dtype must be `__gm__ int32_t/float/half/bfloat16_t`.
-    - Static shape constraints match A2A3 for rows/cols; `AtomicAdd` additionally restricts destination dtype to supported atomic types.
-- **有效区域**:
-  - The implementation uses `src.GetValidRow()` / `src.GetValidCol()` as the transfer size.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void example_auto(__gm__ T* out) {
-  using TileT = Tile<TileType::Vec, T, 16, 16>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-  GTensor gout(out);
-  TileT t;
-  TSTORE(gout, t);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T>
-void example_manual(__gm__ T* out) {
-  using TileT = Tile<TileType::Vec, T, 16, 16>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
-  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
-
-  GTensor gout(out);
-  TileT t;
-  TASSIGN(t, 0x1000);
-  TSTORE<TileT, GTensor, AtomicType::AtomicAdd>(gout, t);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpand.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpand.md
deleted file mode 100644
index 4a165ee1..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpand.md
+++ /dev/null
@@ -1,133 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpand.md` -->
-
-# pto.tcolexpand
-
-Standalone reference page for `pto.tcolexpand`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Broadcast the first element of each source column across the destination column.
-
-## Mechanism
-
-Broadcast the first element of each source column across the destination column. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{0,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tcolexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPAND(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the column-wise broadcast: each column `j` of `dst` is filled with `src[0,j]`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tcolexpand` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TCOLEXPAND(dst, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.tcolmin](./tcolmin.md)
-- Next op in family: [pto.tcolexpanddiv](./tcolexpanddiv.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpand_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpand_zh.md
deleted file mode 100644
index 7c1ae125..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpand_zh.md
+++ /dev/null
@@ -1,78 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpand_zh.md` -->
-
-# TCOLEXPAND
-
-## 指令示意图
-
-![TCOLEXPAND tile operation](../figures/isa/TCOLEXPAND.svg)
-
-## 简介
-
-将每个源列的第一个元素广播到目标列中。
-
-## 数学语义
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{0,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPAND(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## 约束
-
-- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TCOLEXPAND(dst, src);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandadd.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandadd.md
deleted file mode 100644
index a184449b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandadd.md
+++ /dev/null
@@ -1,122 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandadd.md` -->
-
-# pto.tcolexpandadd
-
-Standalone reference page for `pto.tcolexpandadd`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Column-wise broadcast add with per-column scalar vector.
-
-## Mechanism
-
-Column-wise broadcast add: add each element of `src0` by a per-column scalar vector `src1`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + s_j
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-column scalar values.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = `src0[i,j]` + `src1[0,j]` (column-wise broadcast add of per-column scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-
-- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-
-- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
-
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tcolexpandadd` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.tcolexpandmul](./tcolexpandmul.md)
-- Next op in family: [pto.tcolexpandmax](./tcolexpandmax.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandadd_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandadd_zh.md
deleted file mode 100644
index 6b5f16e4..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandadd_zh.md
+++ /dev/null
@@ -1,90 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandadd_zh.md` -->
-
-# TCOLEXPANDADD
-
-## 指令示意图
-
-![TCOLEXPANDADD tile operation](../figures/isa/TCOLEXPANDADD.svg)
-
-## 简介
-
-列广播加法：对每一列加上每列标量向量。
-
-## 数学语义
-
-设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
-
-对于 `0 <= i < R` 和 `0 <= j < C`：
-
-$$
-\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + s_j
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
-- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
-
-## 示例
-
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv.md
deleted file mode 100644
index 8ad515ff..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv.md
+++ /dev/null
@@ -1,122 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv.md` -->
-
-# pto.tcolexpanddiv
-
-Standalone reference page for `pto.tcolexpanddiv`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Column-wise broadcast divide: divide each column by a per-column scalar vector.
-
-## Mechanism
-
-Column-wise broadcast divide: divide each element of `src0` by a per-column scalar vector `src1`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} / s_j
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-column scalar values.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = `src0[i,j]` / `src1[0,j]` (column-wise broadcast divide of per-column scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-
-- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-
-- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
-
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tcolexpanddiv` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.tcolexpand](./tcolexpand.md)
-- Next op in family: [pto.tcolexpandmul](./tcolexpandmul.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv_zh.md
deleted file mode 100644
index ca0e6638..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv_zh.md
+++ /dev/null
@@ -1,90 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv_zh.md` -->
-
-# TCOLEXPANDDIV
-
-## 指令示意图
-
-![TCOLEXPANDDIV tile operation](../figures/isa/TCOLEXPANDDIV.svg)
-
-## 简介
-
-列广播除法：将每一列除以一个每列标量向量。
-
-## 数学语义
-
-设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
-
-对于 `0 <= i < R` 和 `0 <= j < C`：
-
-$$
-\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} / s_j
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
-- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
-
-## 示例
-
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif.md
deleted file mode 100644
index 7a7a3a19..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif.md
+++ /dev/null
@@ -1,121 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif.md` -->
-
-# pto.tcolexpandexpdif
-
-Standalone reference page for `pto.tcolexpandexpdif`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Column-wise exp-diff: compute exp(src0 - src1) with per-column scalars.
-
-## Mechanism
-
-Column-wise exp-diff: compute `exp(src0 - src1)` using a per-column scalar vector `src1`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \exp(\mathrm{src0}_{i,j} - s_j)
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-column scalar values.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = exp(`src0[i,j]` - `src1[0,j]`) (column-wise exp-diff of per-column scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-
-- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-
-- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
-
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tcolexpandexpdif` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.tcolexpandsub](./tcolexpandsub.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif_zh.md
deleted file mode 100644
index 32389cf2..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif_zh.md
+++ /dev/null
@@ -1,90 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif_zh.md` -->
-
-# TCOLEXPANDEXPDIF
-
-## 指令示意图
-
-![TCOLEXPANDEXPDIF tile operation](../figures/isa/TCOLEXPANDEXPDIF.svg)
-
-## 简介
-
-列指数差运算：计算 exp(src0 - src1)，其中 src1 为每列标量。
-
-## 数学语义
-
-设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
-
-对于 `0 <= i < R` 和 `0 <= j < C`：
-
-$$
-\mathrm{dst}_{i,j} = \exp(\mathrm{src0}_{i,j} - s_j)
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
-- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
-
-## 示例
-
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmax.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmax.md
deleted file mode 100644
index 16e56b74..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmax.md
+++ /dev/null
@@ -1,122 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandmax.md` -->
-
-# pto.tcolexpandmax
-
-Standalone reference page for `pto.tcolexpandmax`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Column-wise broadcast max with per-column scalar vector.
-
-## Mechanism
-
-Column-wise broadcast max: take `max(src0, src1)` where `src1` provides one scalar per column. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, s_j)
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-column scalar values.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = max(`src0[i,j]`, `src1[0,j]`) (column-wise broadcast max of per-column scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-
-- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-
-- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
-
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tcolexpandmax` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.tcolexpandadd](./tcolexpandadd.md)
-- Next op in family: [pto.tcolexpandmin](./tcolexpandmin.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmax_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmax_zh.md
deleted file mode 100644
index 6138af8a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmax_zh.md
+++ /dev/null
@@ -1,90 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandmax_zh.md` -->
-
-# TCOLEXPANDMAX
-
-## 指令示意图
-
-![TCOLEXPANDMAX tile operation](../figures/isa/TCOLEXPANDMAX.svg)
-
-## 简介
-
-列广播最大值：与每列标量向量取最大值。
-
-## 数学语义
-
-设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
-
-对于 `0 <= i < R` 和 `0 <= j < C`：
-
-$$
-\mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, s_j)
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
-- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
-
-## 示例
-
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmin.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmin.md
deleted file mode 100644
index 31d43586..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmin.md
+++ /dev/null
@@ -1,122 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandmin.md` -->
-
-# pto.tcolexpandmin
-
-Standalone reference page for `pto.tcolexpandmin`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Column-wise broadcast min with per-column scalar vector.
-
-## Mechanism
-
-Column-wise broadcast min: take `min(src0, src1)` where `src1` provides one scalar per column. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, s_j)
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-column scalar values.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = min(`src0[i,j]`, `src1[0,j]`) (column-wise broadcast min of per-column scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-
-- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-
-- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
-
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tcolexpandmin` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.tcolexpandmax](./tcolexpandmax.md)
-- Next op in family: [pto.tcolexpandsub](./tcolexpandsub.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmin_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmin_zh.md
deleted file mode 100644
index 25f229ac..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmin_zh.md
+++ /dev/null
@@ -1,90 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandmin_zh.md` -->
-
-# TCOLEXPANDMIN
-
-## 指令示意图
-
-![TCOLEXPANDMIN tile operation](../figures/isa/TCOLEXPANDMIN.svg)
-
-## 简介
-
-列广播最小值：与每列标量向量取最小值。
-
-## 数学语义
-
-设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
-
-对于 `0 <= i < R` 和 `0 <= j < C`：
-
-$$
-\mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, s_j)
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
-- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
-
-## 示例
-
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmul.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmul.md
deleted file mode 100644
index 7fea3475..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmul.md
+++ /dev/null
@@ -1,122 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandmul.md` -->
-
-# pto.tcolexpandmul
-
-Standalone reference page for `pto.tcolexpandmul`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Column-wise broadcast multiply: multiply each column by a per-column scalar vector.
-
-## Mechanism
-
-Column-wise broadcast multiply: multiply each element of `src0` by a per-column scalar vector `src1`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot s_j
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-column scalar values.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = `src0[i,j]` * `src1[0,j]` (column-wise broadcast multiply of per-column scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-
-- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-
-- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
-
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tcolexpandmul` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.tcolexpanddiv](./tcolexpanddiv.md)
-- Next op in family: [pto.tcolexpandadd](./tcolexpandadd.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmul_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmul_zh.md
deleted file mode 100644
index 0f054a61..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandmul_zh.md
+++ /dev/null
@@ -1,90 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandmul_zh.md` -->
-
-# TCOLEXPANDMUL
-
-## 指令示意图
-
-![TCOLEXPANDMUL tile operation](../figures/isa/TCOLEXPANDMUL.svg)
-
-## 简介
-
-列广播乘法：将每一列乘以一个每列标量向量。
-
-## 数学语义
-
-设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
-
-对于 `0 <= i < R` 和 `0 <= j < C`：
-
-$$
-\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot s_j
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
-- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
-
-## 示例
-
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandsub.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandsub.md
deleted file mode 100644
index 62a08fe9..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandsub.md
+++ /dev/null
@@ -1,122 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandsub.md` -->
-
-# pto.tcolexpandsub
-
-Standalone reference page for `pto.tcolexpandsub`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Column-wise broadcast subtract: subtract a per-column scalar vector from each column.
-
-## Mechanism
-
-Column-wise broadcast subtract: subtract a per-column scalar vector `src1` from `src0`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - s_j
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-column scalar values.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = `src0[i,j]` - `src1[0,j]` (column-wise broadcast subtract of per-column scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-
-- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-
-- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
-
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tcolexpandsub` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.tcolexpandmin](./tcolexpandmin.md)
-- Next op in family: [pto.tcolexpandexpdif](./tcolexpandexpdif.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandsub_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandsub_zh.md
deleted file mode 100644
index 9ed59926..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolexpandsub_zh.md
+++ /dev/null
@@ -1,90 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolexpandsub_zh.md` -->
-
-# TCOLEXPANDSUB
-
-## 指令示意图
-
-![TCOLEXPANDSUB tile operation](../figures/isa/TCOLEXPANDSUB.svg)
-
-## 简介
-
-列广播减法：从每一列中减去一个每列标量向量。
-
-## 数学语义
-
-设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
-
-对于 `0 <= i < R` 和 `0 <= j < C`：
-
-$$
-\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - s_j
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
-- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
-
-## 示例
-
-参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmax.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmax.md
deleted file mode 100644
index 237fc69e..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmax.md
+++ /dev/null
@@ -1,158 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolmax.md` -->
-
-# pto.tcolmax
-
-Standalone reference page for `pto.tcolmax`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Reduce each column by taking the maximum across rows.
-
-## Mechanism
-
-Reduce each column by taking the maximum across rows. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
-
-$$ \mathrm{dst}_{0,j} = \max_{0 \le i < R} \mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolmax ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
-PTO_INST RecordEvent TCOLMAX(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the column-wise maximum: for each column `j`, `dst[0,j]` = max of all elements in column `j` of `src`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must be `TileType::Vec`.
-
-- `dst` and `src` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-
-- `dst` and `src` must use the same element type.
-
-- Runtime checks:
-  - `src.GetValidCol() == dst.GetValidCol()`
-
-- Supported element types: `half`, `float`, `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `bfloat16_t`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- If `src.GetValidRow() == 0` or `src.GetValidCol() == 0`, the implementation returns early.
-
-### A2A3 implementation checks
-
-- Supported element types: `half`, `float`, `int16_t`, `int32_t`.
-
-### A5 implementation checks
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TCOLMAX(dst, src);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TCOLMAX(dst, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolmax ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.tcolprod](./tcolprod.md)
-- Next op in family: [pto.trowmax](./trowmax.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmax_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmax_zh.md
deleted file mode 100644
index e89eb93d..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmax_zh.md
+++ /dev/null
@@ -1,130 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolmax_zh.md` -->
-
-# TCOLMAX
-
-## 指令示意图
-
-![TCOLMAX tile operation](../figures/isa/TCOLMAX.svg)
-
-## 简介
-
-通过取行间最大值来归约每一列。
-
-## 数学语义
-
-设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= j < C`：
-
-$$ \mathrm{dst}_{0,j} = \max_{0 \le i < R} \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolmax ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
-PTO_INST RecordEvent TCOLMAX(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须为 `TileType::Vec`。
-- `dst` 和 `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 和 `src` 的元素类型必须一致。
-- 运行时检查：
-  - `src.GetValidCol() == dst.GetValidCol()`
-- 若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
-
-### A2A3 实现检查
-
-- 支持的元素类型：`half`、`float`、`int16_t`、`int32_t`。
-
-### A5 实现检查
-
-- 支持的元素类型：`half`、`float`、`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`bfloat16_t`。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TCOLMAX(dst, src);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TCOLMAX(dst, src);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolmax %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolmax ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmin.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmin.md
deleted file mode 100644
index afa0e337..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmin.md
+++ /dev/null
@@ -1,158 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolmin.md` -->
-
-# pto.tcolmin
-
-Standalone reference page for `pto.tcolmin`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Reduce each column by taking the minimum across rows.
-
-## Mechanism
-
-Reduce each column by taking the minimum across rows. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
-
-$$ \mathrm{dst}_{0,j} = \min_{0 \le i < R} \mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolmin ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
-PTO_INST RecordEvent TCOLMIN(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the column-wise minimum: for each column `j`, `dst[0,j]` = min of all elements in column `j` of `src`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must be `TileType::Vec`.
-
-- `dst` and `src` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-
-- `dst` and `src` must use the same element type.
-
-- Runtime checks:
-  - `src.GetValidCol() == dst.GetValidCol()`
-
-- Supported element types: `half`, `float`, `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `bfloat16_t`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- If `src.GetValidRow() == 0` or `src.GetValidCol() == 0`, the implementation returns early.
-
-### A2A3 implementation checks
-
-- Supported element types: `half`, `float`, `int16_t`, `int32_t`.
-
-### A5 implementation checks
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TCOLMIN(dst, src);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TCOLMIN(dst, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolmin ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowexpandexpdif](./trowexpandexpdif.md)
-- Next op in family: [pto.tcolexpand](./tcolexpand.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmin_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmin_zh.md
deleted file mode 100644
index 5b197cc1..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolmin_zh.md
+++ /dev/null
@@ -1,130 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolmin_zh.md` -->
-
-# TCOLMIN
-
-## 指令示意图
-
-![TCOLMIN tile operation](../figures/isa/TCOLMIN.svg)
-
-## 简介
-
-通过取行间最小值来归约每一列。
-
-## 数学语义
-
-设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= j < C`：
-
-$$ \mathrm{dst}_{0,j} = \min_{0 \le i < R} \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolmin ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
-PTO_INST RecordEvent TCOLMIN(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须为 `TileType::Vec`。
-- `dst` 和 `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 和 `src` 的元素类型必须一致。
-- 运行时检查：
-  - `src.GetValidCol() == dst.GetValidCol()`
-- 若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
-
-### A2A3 实现检查
-
-- 支持的元素类型：`half`、`float`、`int16_t`、`int32_t`。
-
-### A5 实现检查
-
-- 支持的元素类型：`half`、`float`、`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`bfloat16_t`。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TCOLMIN(dst, src);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TCOLMIN(dst, src);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolmin %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolmin ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolprod.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolprod.md
deleted file mode 100644
index 7232b4c4..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolprod.md
+++ /dev/null
@@ -1,158 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolprod.md` -->
-
-# pto.tcolprod
-
-Standalone reference page for `pto.tcolprod`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Reduce each column by multiplying across rows.
-
-## Mechanism
-
-Reduce each column by multiplying across rows. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
-
-$$ \mathrm{dst}_{0,j} = \prod_{i=0}^{R-1} \mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
-PTO_INST RecordEvent TCOLPROD(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the column-wise product: for each column `j`, `dst[0,j]` = product of all elements in column `j` of `src`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must be `TileType::Vec`.
-
-- `dst` and `src` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-
-- `dst` and `src` must use the same element type.
-
-- Runtime checks:
-  - `src.GetValidCol() == dst.GetValidCol()`
-
-- Supported element types: `half`, `float`, `bfloat16_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- If `src.GetValidRow() == 0` or `src.GetValidCol() == 0`, the implementation returns early.
-
-### A2A3 implementation checks
-
-- Supported element types: `half`, `float`, `int16_t`, `int32_t`.
-
-### A5 implementation checks
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TCOLPROD(dst, src);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TCOLPROD(dst, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.tcolsum](./tcolsum.md)
-- Next op in family: [pto.tcolmax](./tcolmax.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolprod_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolprod_zh.md
deleted file mode 100644
index 78e8424c..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolprod_zh.md
+++ /dev/null
@@ -1,130 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolprod_zh.md` -->
-
-# TCOLPROD
-
-## 指令示意图
-
-![TCOLPROD tile operation](../figures/isa/TCOLPROD.svg)
-
-## 简介
-
-通过跨行乘积来归约每一列。
-
-## 数学语义
-
-设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= j < C`：
-
-$$ \mathrm{dst}_{0,j} = \prod_{i=0}^{R-1} \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
-PTO_INST RecordEvent TCOLPROD(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须为 `TileType::Vec`。
-- `dst` 和 `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 和 `src` 的元素类型必须一致。
-- 运行时检查：
-  - `src.GetValidCol() == dst.GetValidCol()`
-- 若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
-
-### A2A3 实现检查
-
-- 支持的元素类型：`half`、`float`、`int16_t`、`int32_t`。
-
-### A5 实现检查
-
-- 支持的元素类型：`half`、`float`、`bfloat16_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TCOLPROD(dst, src);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TCOLPROD(dst, src);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolprod %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolsum.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolsum.md
deleted file mode 100644
index f28876ab..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolsum.md
+++ /dev/null
@@ -1,186 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolsum.md` -->
-
-# pto.tcolsum
-
-Standalone reference page for `pto.tcolsum`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Reduce each column by summing across rows.
-
-## Mechanism
-
-Reduce each column by summing across rows. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
-
-$$ \mathrm{dst}_{0,j} = \sum_{i=0}^{R-1} \mathrm{src}_{i,j} $$
-
-`isBinary` selects the implementation path (binary-tree accumulation vs. sequential accumulation).
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcolsum %src {isBinary = false} : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
-%dst = pto.tcolsum %src, %tmp {isBinary = false} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcolsum ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-pto.tcolsum ins(%src, %tmp {isBinary = false} : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
-PTO_INST RecordEvent TCOLSUM(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
-
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TCOLSUM(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, bool isBinary, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-- `isBinary` (bool): selects the implementation path—binary-tree accumulation (`true`) or sequential accumulation (`false`).
-
-## Expected Outputs
-
-`dst` holds the column-wise sum: for each column `j`, `dst[0,j]` = sum of all elements in column `j` of `src`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must be `TileType::Vec`.
-
-- `dst` and `src` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-
-- `dst` and `src` must use the same element type.
-
-- Runtime checks:
-  - `src.GetValidCol() == dst.GetValidCol()`
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidCol() <= tmp` row stride measured in `src` elements
-
-- Supported element types: `half`, `float`, `int16_t`, `int32_t`.
-
-- `tmp` must be `TileType::Vec` and use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-
-- `tmp` must use the same element type as `src` and `dst`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `isBinary` selects the checked backend path:
-  - `true`: binary-tree accumulation using `tmp`
-  - `false`: sequential accumulation into `dst`
-
-### A2A3 implementation checks
-
-- If `src.GetValidRow() == 0` or `src.GetValidCol() == 0`, the implementation returns early.
-
-### A5 implementation checks
-
-- Shared A5 column-reduce checks allow `half`, `float`, `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `bfloat16_t`.
-
-- The checked A5 `TCOLSUM` path still takes `tmp` only for the binary accumulation path; no extra compile-time `tmp` type/layout assertions are explicitly enforced in `TCOLSUM_IMPL`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TCOLSUM(dst, src, tmp, /*isBinary=*/false);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TCOLSUM(dst, src, tmp, /*isBinary=*/false);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcolsum %src {isBinary = false} : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolsum ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowsum](./trowsum.md)
-- Next op in family: [pto.tcolprod](./tcolprod.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolsum_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolsum_zh.md
deleted file mode 100644
index ae8e1c91..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/tcolsum_zh.md
+++ /dev/null
@@ -1,153 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/tcolsum_zh.md` -->
-
-# TCOLSUM
-
-## 指令示意图
-
-![TCOLSUM tile operation](../figures/isa/TCOLSUM.svg)
-
-## 简介
-
-通过对行求和来归约每一列。
-
-## 数学语义
-
-设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= j < C`：
-
-$$ \mathrm{dst}_{0,j} = \sum_{i=0}^{R-1} \mathrm{src}_{i,j} $$
-
-`isBinary` 选择实现路径（二叉树累加 vs. 顺序累加）。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tcolsum %src {isBinary = false} : !pto.tile<...> -> !pto.tile<...>
-```
-
-降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
-%dst = pto.tcolsum %src, %tmp {isBinary = false} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tcolsum ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-pto.tcolsum ins(%src, %tmp {isBinary = false} : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
-PTO_INST RecordEvent TCOLSUM(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
-
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TCOLSUM(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, bool isBinary, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须为 `TileType::Vec`。
-- `dst` 和 `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 和 `src` 的元素类型必须一致。
-- 运行时检查：
-  - `src.GetValidCol() == dst.GetValidCol()`
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidCol()` 必须不大于按 `src` 元素计的 `tmp` 行跨度
-- `isBinary` 选择已检查到的后端路径：
-  - `true`：使用 `tmp` 做二叉树累加
-  - `false`：直接在 `dst` 上做顺序累加
-
-### A2A3 实现检查
-
-- 支持的元素类型：`half`、`float`、`int16_t`、`int32_t`。
-- `tmp` 必须为 `TileType::Vec`，且使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `tmp` 的元素类型必须与 `src` 和 `dst` 一致。
-- 若 `src.GetValidRow() == 0` 或 `src.GetValidCol() == 0`，实现会直接返回。
-
-### A5 实现检查
-
-- A5 共享列归约检查允许的元素类型为：`half`、`float`、`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`bfloat16_t`。
-- 已检查到的 A5 `TCOLSUM` 路径中，`tmp` 仍只用于二叉累加路径；`TCOLSUM_IMPL` 中没有额外显式加入 `tmp` 的编译期类型/布局断言。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TCOLSUM(dst, src, tmp, /*isBinary=*/false);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 1, 16>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TCOLSUM(dst, src, tmp, /*isBinary=*/false);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tcolsum %src {isBinary = false} : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcolsum ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmax.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmax.md
deleted file mode 100644
index 8ebdafd9..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmax.md
+++ /dev/null
@@ -1,187 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowargmax.md` -->
-
-# pto.trowargmax
-
-Standalone reference page for `pto.trowargmax`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Get the column index of the maximum element for each row.
-
-## Mechanism
-
-Get the column index of the maximum element for each row. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
-
-$$ \mathrm{dst}_{i,0} = \underset{0 \le j < C}{\operatorname{argmax}} \; \mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowargmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.trowargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.trowargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWARGMAX(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `tmp` is a temporary tile used for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the column index of the row-wise maximum: for each row `i`, `dst[i,0]` = argmax of elements in row `i` of `src`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must be `TileType::Vec`.
-
-- Supported source element types: `half`, `float`.
-
-- Supported destination element types: `uint32_t`, `int32_t`.
-
-- `src` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-
-- `dst` and `src` must satisfy the shared row-reduce-index check path used by `TRowArgMax`.
-
-- Temporary tile is not used when `srcValidCol <= ElementPerRepeat`, used when `srcValidCol > ElementPerRepeat`.
-
-- `tmp` tile's rows is the same as `src`.
-
-- Simply set `tmp` tile size the same as `src` when `src` is small.
-
-- `tmp` tile's stride can be calculated out based on `src`'s `validCol` using the following formula:
-
-```text
-repeats = ceil(validCol / elementPerRepeat)
-stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
-```
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- Runtime checks follow the shared row-reduce check path:
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-
-### A2A3 implementation checks
-
-- `dst` is checked through the shared row-reduce-index path and may use either of these non-fractal layouts:
-  - DN layout with one column (`BLayout::ColMajor`, `Cols == 1`), or
-  - ND layout whose valid column count is 1.
-
-### A5 implementation checks
-
-- In the checked A5 implementation path, `tmp` is accepted by the interface but not used by `TROWARGMAX_IMPL`.
-
-### About temporary tile `tmp` for A3
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWARGMAX(dst, src, tmp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWARGMAX(dst, src, tmp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
-pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowmin](./trowmin.md)
-- Next op in family: [pto.trowargmin](./trowargmin.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmax_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmax_zh.md
deleted file mode 100644
index 79037aa6..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmax_zh.md
+++ /dev/null
@@ -1,153 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowargmax_zh.md` -->
-
-# TROWARGMAX
-
-## 指令示意图
-
-![TROWARGMAX tile operation](../figures/isa/TROWARGMAX.svg)
-
-## 简介
-
-获取每行最大值对应列索引。
-
-## 数学语义
-
-设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
-
-$$ \mathrm{dst}_{i,0} = \underset{0 \le j < C}{\operatorname{argmax}} \; \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trowargmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### IR Level 1（SSA）
-
-```text
-%dst = pto.trowargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2（DPS）
-
-```text
-pto.trowargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWARGMAX(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须为 `TileType::Vec`。
-- 支持的源元素类型：`half`、`float`。
-- 支持的目标元素类型：`uint32_t`、`int32_t`。
-- 运行时检查遵循共享的行归约检查路径：
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-
-### A2A3 实现检查
-
-- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 通过共享的行归约索引检查路径约束，可使用以下任一非分形布局：
-  - 单列 DN 布局（`BLayout::ColMajor`、`Cols == 1`），或
-  - 有效列数为 1 的 ND 布局。
-
-### A5 实现检查
-
-- `dst` 和 `src` 必须满足 `TRowArgMax` 使用的共享行归约索引检查路径。
-- 在已检查到的 A5 实现路径中，接口仍接收 `tmp`，但 `TROWARGMAX_IMPL` 实际并不使用它。
-
-### A3 `tmp`临时Tile相关说明
-
-- `tmp`临时Tile在`srcValidCol <= ElementPerRepeat`时不使用，`srcValidCol > ElementPerRepeat`时需要使用。
-- `tmp` tile的行数和`src` tile的行数相同。
-- 按以下公式根据`src` tile的`validCol`算出`tmp` tile所需stride：
-
-```text
-repeats = ceil(validCol / elementPerRepeat)
-stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
-```
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWARGMAX(dst, src, tmp);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWARGMAX(dst, src, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trowargmax %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
-pto.trowargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmin.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmin.md
deleted file mode 100644
index 13605b9c..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmin.md
+++ /dev/null
@@ -1,187 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowargmin.md` -->
-
-# pto.trowargmin
-
-Standalone reference page for `pto.trowargmin`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Get the column index of the minimum element for each row.
-
-## Mechanism
-
-Get the column index of the minimum element for each row. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
-
-$$ \mathrm{dst}_{i,0} = \underset{0 \le j < C}{\operatorname{argmin}} \; \mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowargmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.trowargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWARGMIN(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `tmp` is a temporary tile used for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the column index of the row-wise minimum: for each row `i`, `dst[i,0]` = argmin of elements in row `i` of `src`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must be `TileType::Vec`.
-
-- Supported source element types: `half`, `float`.
-
-- Supported destination element types: `uint32_t`, `int32_t`.
-
-- `src` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-
-- `dst` and `src` must satisfy the shared row-reduce-index check path used by `TRowArgMin`.
-
-- Temporary tile is not used when `srcValidCol <= ElementPerRepeat`, used when `srcValidCol > ElementPerRepeat`.
-
-- `tmp` tile's rows is the same as `src`.
-
-- Simply set `tmp` tile size the same as `src` when `src` is small.
-
-- `tmp` tile's stride can be calculated out based on `src`'s `validCol` using the following formula:
-
-```text
-repeats = ceil(validCol / elementPerRepeat)
-stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
-```
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- Runtime checks follow the shared row-reduce check path:
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-
-### A2A3 implementation checks
-
-- `dst` is checked through the shared row-reduce-index path and may use either of these non-fractal layouts:
-  - DN layout with one column (`BLayout::ColMajor`, `Cols == 1`), or
-  - ND layout whose valid column count is 1.
-
-### A5 implementation checks
-
-- In the checked A5 implementation path, `tmp` is accepted by the interface but not used by `TROWARGMIN_IMPL`.
-
-### About temporary tile `tmp` for A3
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWARGMIN(dst, src, tmp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWARGMIN(dst, src, tmp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowargmin %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
-pto.trowargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowargmax](./trowargmax.md)
-- Next op in family: [pto.trowexpand](./trowexpand.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmin_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmin_zh.md
deleted file mode 100644
index 0bb8115c..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowargmin_zh.md
+++ /dev/null
@@ -1,153 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowargmin_zh.md` -->
-
-# TROWARGMIN
-
-## 指令示意图
-
-![TROWARGMIN tile operation](../figures/isa/TROWARGMIN.svg)
-
-## 简介
-
-获取每行最小值对应列索引。
-
-## 数学语义
-
-设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
-
-$$ \mathrm{dst}_{i,0} = \underset{0 \le j < C}{\operatorname{argmin}} \; \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trowargmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### IR Level 1（SSA）
-
-```text
-%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2（DPS）
-
-```text
-pto.trowargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWARGMIN(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须为 `TileType::Vec`。
-- 支持的源元素类型：`half`、`float`。
-- 支持的目标元素类型：`uint32_t`、`int32_t`。
-- 运行时检查遵循共享的行归约检查路径：
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-
-### A2A3 实现检查
-
-- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 通过共享的行归约索引检查路径约束，可使用以下任一非分形布局：
-  - 单列 DN 布局（`BLayout::ColMajor`、`Cols == 1`），或
-  - 有效列数为 1 的 ND 布局。
-
-### A5 实现检查
-
-- `dst` 和 `src` 必须满足 `TRowArgMin` 使用的共享行归约索引检查路径。
-- 在已检查到的 A5 实现路径中，接口仍接收 `tmp`，但 `TROWARGMIN_IMPL` 实际并不使用它。
-
-### A3 `tmp`临时Tile相关说明
-
-- `tmp`临时Tile在`srcValidCol <= ElementPerRepeat`时不使用，`srcValidCol > ElementPerRepeat`时需要使用。
-- `tmp` tile的行数和`src` tile的行数相同。
-- 按以下公式根据`src` tile的`validCol`算出`tmp` tile所需stride：
-
-```text
-repeats = ceil(validCol / elementPerRepeat)
-stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
-```
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWARGMIN(dst, src, tmp);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWARGMIN(dst, src, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trowargmin %src : !pto.tile<...> -> !pto.tile<...>
-# IR Level 2 (DPS)
-pto.trowargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpand.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpand.md
deleted file mode 100644
index 4310bcb9..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpand.md
+++ /dev/null
@@ -1,161 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpand.md` -->
-
-# pto.trowexpand
-
-Standalone reference page for `pto.trowexpand`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Broadcast the first element of each source row across the destination row.
-
-## Mechanism
-
-Broadcast the first element of each source row across the destination row. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,0} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.trowexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPAND(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the row-wise broadcast: each row `i` of `dst` is filled with `src[i,0]`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- Tile Type: `dst` and `src` must be `TileType::Vec`.
-
-- Tile layout: ND fractal (`isRowMajor` and `SLayout::NoneBox`) for both `src` and `dst`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-Implementation checks (NPU):
-
-- Data type: A2A3/A5 element types must be one of: `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
-
-- Runtime valid checks:
-    - A2A3: returns early if any of `dstValidRow`, `dstValidCol`, `srcValidRow`, `srcValidCol` is zero.
-    - A5: asserts `srcValidRow == dstValidRow` and asserts `srcValidRow != 0 && srcValidCol != 0`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TROWEXPAND(dst, src);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TROWEXPAND(dst, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowexpand %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowargmin](./trowargmin.md)
-- Next op in family: [pto.trowexpanddiv](./trowexpanddiv.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpand_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpand_zh.md
deleted file mode 100644
index 283de717..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpand_zh.md
+++ /dev/null
@@ -1,107 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpand_zh.md` -->
-
-# TROWEXPAND
-
-## 指令示意图
-
-![TROWEXPAND tile operation](../figures/isa/TROWEXPAND.svg)
-
-## 简介
-
-将每个源行的第一个元素广播到目标行中。
-
-## 数学语义
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,0} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = trowexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trowexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPAND(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
-```
-
-## 约束
-
-实现检查 (NPU):
-
-- Tile Type: `dst` and `src` must be `TileType::Vec`.
-- Tile 布局: ND fractal (`isRowMajor` and `SLayout::NoneBox`) for both `src` and `dst`.
-- Data type: A2A3/A5 element types must be one of: `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
-- 运行期有效区域检查:
-    - A2A3: returns early if any of `dstValidRow`, `dstValidCol`, `srcValidRow`, `srcValidCol` is zero.
-    - A5: asserts `srcValidRow == dstValidRow` and asserts `srcValidRow != 0 && srcValidCol != 0`.
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TROWEXPAND(dst, src);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TROWEXPAND(dst, src);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandadd.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandadd.md
deleted file mode 100644
index 041df419..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandadd.md
+++ /dev/null
@@ -1,143 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandadd.md` -->
-
-# pto.trowexpandadd
-
-Standalone reference page for `pto.trowexpandadd`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Row-wise broadcast add: add a per-row scalar vector.
-
-## Mechanism
-
-Row-wise broadcast add: add a per-row scalar vector `src1` to each row of `src0`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + s_i
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-row scalar values.
-- `tmp` (optional): temporary tile for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = `src0[i,j]` + `src1[i,0]` (row-wise broadcast add of per-row scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-
-- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-
-- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-
-- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
-
-- Mode 2: `src1` is expected to provide **32 bytes data per row**.
-
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.trowexpandadd` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowexpandsub](./trowexpandsub.md)
-- Next op in family: [pto.trowexpandmax](./trowexpandmax.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandadd_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandadd_zh.md
deleted file mode 100644
index ea327570..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandadd_zh.md
+++ /dev/null
@@ -1,81 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandadd_zh.md` -->
-
-# TROWEXPANDADD
-
-## 指令示意图
-
-![TROWEXPANDADD tile operation](../figures/isa/TROWEXPANDADD.svg)
-
-## 简介
-
-行广播加法：加上一个每行标量向量。
-
-## 数学语义
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + s_i
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-- Tile 形状/布局约束 (compile-time): `TileDataDst::isRowMajor`.
-- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
-- Mode 2: `src1` is expected to provide **32 bytes data per row**.
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpanddiv.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpanddiv.md
deleted file mode 100644
index 1f5b82ee..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpanddiv.md
+++ /dev/null
@@ -1,157 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpanddiv.md` -->
-
-# pto.trowexpanddiv
-
-Standalone reference page for `pto.trowexpanddiv`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Row-wise broadcast divide: divide each row of `src0` by a per-row scalar vector `src1`.
-
-## Mechanism
-
-Row-wise broadcast divide: divide each row of `src0` by a per-row scalar vector `src1`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \frac{\mathrm{src0}_{i,j}}{\mathrm{src1}_{0,i}} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-row scalar values.
-- `tmp` (optional): temporary tile for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = `src0[i,j]` / `src1[i,0]` (row-wise broadcast divide of per-row scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- Source and destination shapes, layouts, and element types MUST satisfy the legality rules documented by the family and target profile.
-
-- Programs must not assume implicit broadcasting, reshaping, or valid-region repair unless the operation documents it.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks**:
-    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType` (compile-time).
-    - `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-    - Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-    - Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
-    - Mode 2: `src1` is expected to provide **32 bytes data per row**.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TROWEXPANDDIV(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TASSIGN(src0, 0x1000);
-  TASSIGN(dst,  0x2000);
-  TASSIGN(src1, 0x3000);
-  TROWEXPANDDIV(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowexpand](./trowexpand.md)
-- Next op in family: [pto.trowexpandmul](./trowexpandmul.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpanddiv_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpanddiv_zh.md
deleted file mode 100644
index b25f4d77..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpanddiv_zh.md
+++ /dev/null
@@ -1,127 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpanddiv_zh.md` -->
-
-# TROWEXPANDDIV
-
-## 指令示意图
-
-![TROWEXPANDDIV tile operation](../figures/isa/TROWEXPANDDIV.svg)
-
-## 简介
-
-行广播除法：将 `src0` 的每一行除以一个每行标量向量 `src1`。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \frac{\mathrm{src0}_{i,j}}{\mathrm{src1}_{0,i}} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trowexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查**:
-    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`（编译时）。
-    - `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-    - Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-    - 模式 1：`src1` 预期提供**每行一个标量**（即，其有效形状必须覆盖 `R` 个值）。
-    - 模式 2：`src1` 预期提供**每行 32 字节数据**。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TROWEXPANDDIV(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TASSIGN(src0, 0x1000);
-  TASSIGN(dst,  0x2000);
-  TASSIGN(src1, 0x3000);
-  TROWEXPANDDIV(dst, src0, src1);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif.md
deleted file mode 100644
index 2f04b559..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif.md
+++ /dev/null
@@ -1,143 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif.md` -->
-
-# pto.trowexpandexpdif
-
-Standalone reference page for `pto.trowexpandexpdif`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Row-wise exp-diff: compute exp(src0 - src1) with per-row scalars.
-
-## Mechanism
-
-Row-wise exp-diff: compute `exp(src0 - src1)` where `src1` provides one scalar per row. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \exp(\mathrm{src0}_{i,j} - s_i)
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-row scalar values.
-- `tmp` (optional): temporary tile for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = exp(`src0[i,j]` - `src1[i,0]`) (row-wise exp-diff of per-row scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-
-- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-
-- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-
-- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
-
-- Mode 2: `src1` is expected to provide **32 bytes data per row**.
-
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.trowexpandexpdif` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowexpandmin](./trowexpandmin.md)
-- Next op in family: [pto.tcolmin](./tcolmin.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif_zh.md
deleted file mode 100644
index f3d912f8..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif_zh.md
+++ /dev/null
@@ -1,81 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif_zh.md` -->
-
-# TROWEXPANDEXPDIF
-
-## 指令示意图
-
-![TROWEXPANDEXPDIF tile operation](../figures/isa/TROWEXPANDEXPDIF.svg)
-
-## 简介
-
-行指数差运算：计算 exp(src0 - src1)，其中 src1 为每行标量。
-
-## 数学语义
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \exp(\mathrm{src0}_{i,j} - s_i)
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-- Tile 形状/布局约束 (compile-time): `TileDataDst::isRowMajor`.
-- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
-- Mode 2: `src1` is expected to provide **32 bytes data per row**.
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmax.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmax.md
deleted file mode 100644
index b277f734..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmax.md
+++ /dev/null
@@ -1,143 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandmax.md` -->
-
-# pto.trowexpandmax
-
-Standalone reference page for `pto.trowexpandmax`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Row-wise broadcast max with a per-row scalar vector.
-
-## Mechanism
-
-Row-wise broadcast max: take `max(src0, src1)` where `src1` provides one scalar per row. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, s_i)
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-row scalar values.
-- `tmp` (optional): temporary tile for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = max(`src0[i,j]`, `src1[i,0]`) (row-wise broadcast max of per-row scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-
-- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-
-- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-
-- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
-
-- Mode 2: `src1` is expected to provide **32 bytes data per row**.
-
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.trowexpandmax` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowexpandadd](./trowexpandadd.md)
-- Next op in family: [pto.trowexpandmin](./trowexpandmin.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmax_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmax_zh.md
deleted file mode 100644
index 7028ac30..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmax_zh.md
+++ /dev/null
@@ -1,81 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandmax_zh.md` -->
-
-# TROWEXPANDMAX
-
-## 指令示意图
-
-![TROWEXPANDMAX tile operation](../figures/isa/TROWEXPANDMAX.svg)
-
-## 简介
-
-行广播最大值：与每行标量向量取最大值。
-
-## 数学语义
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, s_i)
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-- Tile 形状/布局约束 (compile-time): `TileDataDst::isRowMajor`.
-- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
-- Mode 2: `src1` is expected to provide **32 bytes data per row**.
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmin.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmin.md
deleted file mode 100644
index 6cad22f9..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmin.md
+++ /dev/null
@@ -1,143 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandmin.md` -->
-
-# pto.trowexpandmin
-
-Standalone reference page for `pto.trowexpandmin`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Row-wise broadcast min with a per-row scalar vector.
-
-## Mechanism
-
-Row-wise broadcast min: take `min(src0, src1)` where `src1` provides one scalar per row. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, s_i)
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-row scalar values.
-- `tmp` (optional): temporary tile for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = min(`src0[i,j]`, `src1[i,0]`) (row-wise broadcast min of per-row scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-
-- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-
-- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-
-- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
-
-- Mode 2: `src1` is expected to provide **32 bytes data per row**.
-
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.trowexpandmin` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowexpandmax](./trowexpandmax.md)
-- Next op in family: [pto.trowexpandexpdif](./trowexpandexpdif.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmin_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmin_zh.md
deleted file mode 100644
index 726c3a1f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmin_zh.md
+++ /dev/null
@@ -1,81 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandmin_zh.md` -->
-
-# TROWEXPANDMIN
-
-## 指令示意图
-
-![TROWEXPANDMIN tile operation](../figures/isa/TROWEXPANDMIN.svg)
-
-## 简介
-
-行广播最小值：与每行标量向量取最小值。
-
-## 数学语义
-
-Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
-
-For `0 <= i < R` and `0 <= j < C`:
-
-$$
-\mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, s_i)
-$$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
-- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-- Tile 形状/布局约束 (compile-time): `TileDataDst::isRowMajor`.
-- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
-- Mode 2: `src1` is expected to provide **32 bytes data per row**.
-- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmul.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmul.md
deleted file mode 100644
index 2fde7388..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmul.md
+++ /dev/null
@@ -1,157 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandmul.md` -->
-
-# pto.trowexpandmul
-
-Standalone reference page for `pto.trowexpandmul`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Row-wise broadcast multiply: multiply each row of `src0` by a per-row scalar vector `src1`.
-
-## Mechanism
-
-Row-wise broadcast multiply: multiply each row of `src0` by a per-row scalar vector `src1`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot \mathrm{src1}_{0,i} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-row scalar values.
-- `tmp` (optional): temporary tile for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = `src0[i,j]` * `src1[i,0]` (row-wise broadcast multiply of per-row scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- Source and destination shapes, layouts, and element types MUST satisfy the legality rules documented by the family and target profile.
-
-- Programs must not assume implicit broadcasting, reshaping, or valid-region repair unless the operation documents it.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks**:
-    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType` (compile-time).
-    - `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-    - Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-    - Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
-    - Mode 2: `src1` is expected to provide **32 bytes data per row**.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TROWEXPANDMUL(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TASSIGN(src0, 0x1000);
-  TASSIGN(dst,  0x2000);
-  TASSIGN(src1, 0x3000);
-  TROWEXPANDMUL(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowexpanddiv](./trowexpanddiv.md)
-- Next op in family: [pto.trowexpandsub](./trowexpandsub.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmul_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmul_zh.md
deleted file mode 100644
index 11fdde6a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandmul_zh.md
+++ /dev/null
@@ -1,127 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandmul_zh.md` -->
-
-# TROWEXPANDMUL
-
-## 指令示意图
-
-![TROWEXPANDMUL tile operation](../figures/isa/TROWEXPANDMUL.svg)
-
-## 简介
-
-行广播乘法：将 `src0` 的每一行乘以一个每行标量向量 `src1`。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot \mathrm{src1}_{0,i} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trowexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查**:
-    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`（编译时）。
-    - `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-    - Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-    - 模式 1：`src1` 预期提供**每行一个标量**（即，其有效形状必须覆盖 `R` 个值）。
-    - 模式 2：`src1` 预期提供**每行 32 字节数据**。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TROWEXPANDMUL(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TASSIGN(src0, 0x1000);
-  TASSIGN(dst,  0x2000);
-  TASSIGN(src1, 0x3000);
-  TROWEXPANDMUL(dst, src0, src1);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandsub.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandsub.md
deleted file mode 100644
index 5c551506..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandsub.md
+++ /dev/null
@@ -1,157 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandsub.md` -->
-
-# pto.trowexpandsub
-
-Standalone reference page for `pto.trowexpandsub`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Row-wise broadcast subtract: subtract a per-row scalar vector `src1` from each row of `src0`.
-
-## Mechanism
-
-Row-wise broadcast subtract: subtract a per-row scalar vector `src1` from each row of `src0`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{0,i} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src0` is the first source tile (the tile to be modified).
-- `src1` is the second source tile providing per-row scalar values.
-- `tmp` (optional): temporary tile for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst[i,j]` = `src0[i,j]` - `src1[i,0]` (row-wise broadcast subtract of per-row scalar).
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- Source and destination shapes, layouts, and element types MUST satisfy the legality rules documented by the family and target profile.
-
-- Programs must not assume implicit broadcasting, reshaping, or valid-region repair unless the operation documents it.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks**:
-    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType` (compile-time).
-    - `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
-    - Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
-    - Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
-    - Mode 2: `src1` is expected to provide **32 bytes data per row**.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TROWEXPANDSUB(dst, src0, src1);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TASSIGN(src0, 0x1000);
-  TASSIGN(dst,  0x2000);
-  TASSIGN(src1, 0x3000);
-  TROWEXPANDSUB(dst, src0, src1);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowexpandmul](./trowexpandmul.md)
-- Next op in family: [pto.trowexpandadd](./trowexpandadd.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandsub_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandsub_zh.md
deleted file mode 100644
index f50d9292..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowexpandsub_zh.md
+++ /dev/null
@@ -1,127 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowexpandsub_zh.md` -->
-
-# TROWEXPANDSUB
-
-## 指令示意图
-
-![TROWEXPANDSUB tile operation](../figures/isa/TROWEXPANDSUB.svg)
-
-## 简介
-
-行广播减法：从 `src0` 的每一行中减去一个每行标量向量 `src1`。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{0,i} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trowexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
-
-template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
-          typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查**:
-    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`（编译时）。
-    - `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
-    - Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
-    - 模式 1：`src1` 预期提供**每行一个标量**（即，其有效形状必须覆盖 `R` 个值）。
-    - 模式 2：`src1` 预期提供**每行 32 字节数据**。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TROWEXPANDSUB(dst, src0, src1);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, half, 16, 16>;
-  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
-
-  TileT src0, dst;
-  RowVecT src1(16);
-  TASSIGN(src0, 0x1000);
-  TASSIGN(dst,  0x2000);
-  TASSIGN(src1, 0x3000);
-  TROWEXPANDSUB(dst, src0, src1);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmax.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmax.md
deleted file mode 100644
index f71df56e..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmax.md
+++ /dev/null
@@ -1,177 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowmax.md` -->
-
-# pto.trowmax
-
-Standalone reference page for `pto.trowmax`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Reduce each row by taking the maximum across columns.
-
-## Mechanism
-
-Reduce each row by taking the maximum across columns. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
-
-$$ \mathrm{dst}_{i,0} = \max_{0 \le j < C} \mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWMAX(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `tmp` is a temporary tile used for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the row-wise maximum: for each row `i`, `dst[i,0]` = max of all elements in row `i` of `src`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must both be `TileType::Vec`.
-
-- `src` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-
-- `dst` must use one of the following non-fractal layouts:
-  - ND layout (`BLayout::RowMajor`, `SLayout::NoneBox`), or
-  - DN layout with exactly one column (`BLayout::ColMajor`, `SLayout::NoneBox`, `Cols == 1`).
-
-- `dst` and `src` must use the same element type.
-
-- Runtime valid-region checks:
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-
-- Supported element types: `half`, `float`, `int32_t`, `int16_t`.
-
-- The implementation accepts both ND output and DN output with `Cols == 1`.
-
-- Runtime checks follow the shared row-reduce check path:
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-
-- The current implementation path passes `tmp` into the backend call, but this document does not add extra `tmp` shape/layout constraints beyond what is explicitly enforced by the checked implementation.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- The intrinsic signature requires an explicit `tmp` operand.
-
-### A2A3 implementation checks
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWMAX(dst, src, tmp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWMAX(dst, src, tmp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.tcolmax](./tcolmax.md)
-- Next op in family: [pto.trowmin](./trowmin.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmax_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmax_zh.md
deleted file mode 100644
index fbab516b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmax_zh.md
+++ /dev/null
@@ -1,144 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowmax_zh.md` -->
-
-# TROWMAX
-
-## 指令示意图
-
-![TROWMAX tile operation](../figures/isa/TROWMAX.svg)
-
-## 简介
-
-通过取列间最大值来归约每一行。
-
-## 数学语义
-
-设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
-
-$$ \mathrm{dst}_{i,0} = \max_{0 \le j < C} \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWMAX(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须均为 `TileType::Vec`。
-- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 必须使用以下两种非分形布局之一：
-  - ND 布局（`BLayout::RowMajor`、`SLayout::NoneBox`），或
-  - 列数严格为 1 的 DN 布局（`BLayout::ColMajor`、`SLayout::NoneBox`、`Cols == 1`）。
-- `dst` 和 `src` 的元素类型必须一致。
-- 运行时有效区域检查：
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-- 内建接口签名要求显式传入 `tmp` 操作数。
-
-### A2A3 实现检查
-
-- 支持的元素类型：`half`、`float`、`int32_t`、`int16_t`。
-- 实现同时接受 ND 输出和 `Cols == 1` 的 DN 输出。
-- 运行时检查遵循共享的行归约检查路径：
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-- 当前实现路径会将 `tmp` 传入后端调用，但本文档不额外补充 checked implementation 未显式约束的 `tmp` shape/layout 要求。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWMAX(dst, src, tmp);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWMAX(dst, src, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmin.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmin.md
deleted file mode 100644
index 1036601b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmin.md
+++ /dev/null
@@ -1,177 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowmin.md` -->
-
-# pto.trowmin
-
-Standalone reference page for `pto.trowmin`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Reduce each row by taking the minimum across columns.
-
-## Mechanism
-
-Reduce each row by taking the minimum across columns. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
-
-$$ \mathrm{dst}_{i,0} = \min_{0 \le j < C} \mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWMIN(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `tmp` is a temporary tile used for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the row-wise minimum: for each row `i`, `dst[i,0]` = min of all elements in row `i` of `src`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must both be `TileType::Vec`.
-
-- `src` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-
-- `dst` must use one of the following non-fractal layouts:
-  - ND layout (`BLayout::RowMajor`, `SLayout::NoneBox`), or
-  - DN layout with exactly one column (`BLayout::ColMajor`, `SLayout::NoneBox`, `Cols == 1`).
-
-- `dst` and `src` must use the same element type.
-
-- Runtime valid-region checks:
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-
-- Supported element types: `half`, `float`, `int32_t`, `int16_t`.
-
-- The implementation accepts both ND output and DN output with `Cols == 1`.
-
-- Runtime checks follow the shared row-reduce check path:
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-
-- The current implementation path passes `tmp` into the backend call, but this document does not add extra `tmp` shape/layout constraints beyond what is explicitly enforced by the checked implementation.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- The intrinsic signature requires an explicit `tmp` operand.
-
-### A2A3 implementation checks
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWMIN(dst, src, tmp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWMIN(dst, src, tmp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowmin %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Previous op in family: [pto.trowmax](./trowmax.md)
-- Next op in family: [pto.trowargmax](./trowargmax.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmin_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmin_zh.md
deleted file mode 100644
index 580b010b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowmin_zh.md
+++ /dev/null
@@ -1,144 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowmin_zh.md` -->
-
-# TROWMIN
-
-## 指令示意图
-
-![TROWMIN tile operation](../figures/isa/TROWMIN.svg)
-
-## 简介
-
-通过取列间最小值来归约每一行。
-
-## 数学语义
-
-设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
-
-$$ \mathrm{dst}_{i,0} = \min_{0 \le j < C} \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trowmin %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trowmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWMIN(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须均为 `TileType::Vec`。
-- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 必须使用以下两种非分形布局之一：
-  - ND 布局（`BLayout::RowMajor`、`SLayout::NoneBox`），或
-  - 列数严格为 1 的 DN 布局（`BLayout::ColMajor`、`SLayout::NoneBox`、`Cols == 1`）。
-- `dst` 和 `src` 的元素类型必须一致。
-- 运行时有效区域检查：
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-- 内建接口签名要求显式传入 `tmp` 操作数。
-
-### A2A3 实现检查
-
-- 支持的元素类型：`half`、`float`、`int32_t`、`int16_t`。
-- 实现同时接受 ND 输出和 `Cols == 1` 的 DN 输出。
-- 运行时检查遵循共享的行归约检查路径：
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-- 当前实现路径会将 `tmp` 传入后端调用，但本文档不额外补充 checked implementation 未显式约束的 `tmp` shape/layout 要求。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWMIN(dst, src, tmp);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWMIN(dst, src, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trowmin %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowsum.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowsum.md
deleted file mode 100644
index ac7ad540..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowsum.md
+++ /dev/null
@@ -1,176 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowsum.md` -->
-
-# pto.trowsum
-
-Standalone reference page for `pto.trowsum`. This page belongs to the [Reduce And Expand](../../reduce-and-expand.md) family in the PTO ISA manual.
-
-## Summary
-
-Reduce each row by summing across columns.
-
-## Mechanism
-
-Reduce each row by summing across columns. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
-
-$$ \mathrm{dst}_{i,0} = \sum_{j=0}^{C-1} \mathrm{src}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trowsum %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trowsum ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWSUM(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `tmp` is a temporary tile used for intermediate storage.
-- `dst` names the destination tile. The operation iterates over dst's valid region.
-
-## Expected Outputs
-
-`dst` holds the row-wise sum: for each row `i`, `dst[i,0]` = sum of all elements in row `i` of `src`.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-### General constraints / checks
-
-- `dst` and `src` must both be `TileType::Vec`.
-
-- `src` must use standard ND layout: row-major and non-fractal (`BLayout::RowMajor`, `SLayout::NoneBox`).
-
-- `dst` must use one of the following non-fractal layouts:
-  - ND layout (`BLayout::RowMajor`, `SLayout::NoneBox`), or
-  - DN layout with exactly one column (`BLayout::ColMajor`, `SLayout::NoneBox`, `Cols == 1`).
-
-- `dst` and `src` must use the same element type.
-
-- Runtime valid-region checks:
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-
-- Supported element types: `half`, `float`, `int32_t`, `int16_t`.
-
-- The implementation accepts both ND output and DN output with `Cols == 1`; it is not limited to DN output.
-
-- Runtime checks follow the shared row-reduce check path:
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-
-- The current implementation path passes `tmp` into the backend call, but this document does not add extra `tmp` shape/layout constraints beyond what is explicitly enforced by the checked implementation.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- The intrinsic signature requires an explicit `tmp` operand.
-
-### A2A3 implementation checks
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWSUM(dst, src, tmp);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWSUM(dst, src, tmp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trowsum %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowsum ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduce And Expand](../../reduce-and-expand.md)
-- Next op in family: [pto.tcolsum](./tcolsum.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowsum_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowsum_zh.md
deleted file mode 100644
index 3d324bd4..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/reduce-and-expand/trowsum_zh.md
+++ /dev/null
@@ -1,144 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/reduce-and-expand/trowsum_zh.md` -->
-
-# TROWSUM
-
-## 指令示意图
-
-![TROWSUM tile operation](../figures/isa/TROWSUM.svg)
-
-## 简介
-
-通过对列求和来归约每一行。
-
-## 数学语义
-
-设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
-
-$$ \mathrm{dst}_{i,0} = \sum_{j=0}^{C-1} \mathrm{src}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = trowsum %src : !pto.tile<...> -> !pto.tile<...>
-```
-
-降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.trowsum ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWSUM(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## 约束
-
-### 通用约束或检查
-
-- `dst` 和 `src` 必须均为 `TileType::Vec`。
-- `src` 必须使用标准 ND 布局：行主且非分形（`BLayout::RowMajor`、`SLayout::NoneBox`）。
-- `dst` 必须使用以下两种非分形布局之一：
-  - ND 布局（`BLayout::RowMajor`、`SLayout::NoneBox`），或
-  - 列数严格为 1 的 DN 布局（`BLayout::ColMajor`、`SLayout::NoneBox`、`Cols == 1`）。
-- `dst` 和 `src` 的元素类型必须一致。
-- 运行时有效区域检查：
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-- 内建接口签名要求显式传入 `tmp` 操作数。
-
-### A2A3 实现检查
-
-- 支持的元素类型：`half`、`float`、`int32_t`、`int16_t`。
-- 实现同时接受 ND 输出和 `Cols == 1` 的 DN 输出，并非仅支持 DN 输出。
-- 运行时检查遵循共享的行归约检查路径：
-  - `src.GetValidRow() != 0`
-  - `src.GetValidCol() != 0`
-  - `src.GetValidRow() == dst.GetValidRow()`
-- 当前实现路径会将 `tmp` 传入后端调用，但本文档不额外补充 checked implementation 未显式约束的 `tmp` shape/layout 要求。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TROWSUM(dst, src, tmp);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
-  using TmpT = Tile<TileType::Vec, float, 16, 16>;
-  SrcT src;
-  DstT dst;
-  TmpT tmp;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TASSIGN(tmp, 0x3000);
-  TROWSUM(dst, src, tmp);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = trowsum %src : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.trowsum ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tassign.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tassign.md
deleted file mode 100644
index fb8b5b56..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tassign.md
+++ /dev/null
@@ -1,249 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tassign.md` -->
-
-# pto.tassign
-
-Standalone reference page for `pto.tassign`. This page belongs to the [Sync And Config](../../sync-and-config.md) family in the PTO ISA manual.
-
-## Summary
-
-Bind a Tile object to an implementation-defined on-chip address (manual placement).
-
-## Mechanism
-
-Bind a Tile object to an implementation-defined on-chip address (manual placement). It is part of the tile synchronization or configuration shell, so the visible effect is ordering or state setup rather than arithmetic payload transformation.
-
-Not applicable.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-`TASSIGN` is typically introduced by bufferization/lowering when mapping SSA tiles to physical storage.
-
-Synchronous form:
-
-```text
-tassign %tile, %addr : !pto.tile<...>, index
-```
-
-### AS Level 1 (SSA)
-
-```text
-pto.tassign %tile, %addr : !pto.tile<...>, dtype
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
-```
-
-### IR Level 1 (SSA)
-
-```text
-pto.tassign %tile, %addr : !pto.tile<...>, dtype
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`.
-
-### Form 1: Runtime address
-
-```cpp
-template <typename T, typename AddrType>
-PTO_INST void TASSIGN(T& obj, AddrType addr);
-```
-
-Binds `obj` to the on-chip address `addr`. No compile-time bounds checking is
-performed (the address value is not available at compile time).
-
-### Form 2: Compile-time address (with static bounds check)
-
-```cpp
-template <std::size_t Addr, typename T>
-PTO_INST void TASSIGN(T& obj);
-```
-
-Binds `obj` to the on-chip address `Addr`. Because `Addr` is a non-type
-template parameter, the compiler performs the following **compile-time** checks
-via `static_assert`:
-
-| Check | Condition | Assertion ID | Error message |
-|-------|-----------|--------------|---------------|
-| Memory space exists | `capacity > 0` | SA-0351 | Memory space is not available on this architecture. |
-| Tile fits in memory | `tile_size <= capacity` | SA-0352 | Tile storage size exceeds memory space capacity. |
-| Address in bounds | `Addr + tile_size <= capacity` | SA-0353 | addr + tile_size exceeds memory space capacity (out of bounds). |
-| Address aligned | `Addr % alignment == 0` | SA-0354 | addr is not properly aligned for the target memory space. |
-
-See `docs/coding/debug.md` (fix recipe `FIX-A12`) for suggested remedies.
-
-The memory space, capacity, and alignment are determined automatically from the
-Tile's `TileType` (i.e. `Loc` template parameter):
-
-| TileType | Memory | Capacity (A2A3) | Capacity (A5) | Capacity (Kirin9030) | Capacity (KirinX90) | Alignment |
-|----------|--------|-----------------|---------------|----------------------|---------------------|-----------|
-| Vec | UB | 192 KB | 256 KB | 128 KB | 128 KB | 32 B |
-| Mat | L1 | 512 KB | 512 KB | 512 KB | 1024 KB | 32 B |
-| Left | L0A | 64 KB | 64 KB | 32 KB | 64 KB | 32 B |
-| Right | L0B | 64 KB | 64 KB | 32 KB | 64 KB | 32 B |
-| Acc | L0C | 128 KB | 256 KB | 64 KB | 128 KB | 32 B |
-| Bias | Bias | 1 KB | 4 KB | 1 KB | 1 KB | 32 B |
-| Scaling | FBuffer | 2 KB | 4 KB | 7 KB | 6 KB | 32 B |
-| ScaleLeft | L0A | N/A | 4 KB | N/A | N/A | 32 B |
-| ScaleRight | L0B | N/A | 4 KB | N/A | N/A | 32 B |
-
-Capacities can be overridden at build time via `-D` flags (e.g.
-`-DPTO_UBUF_SIZE_BYTES=262144`). See `include/pto/common/buffer_limits.hpp`.
-
-**Note:** This overload is only available for `Tile` and `ConvTile` types. For
-`GlobalTensor`, use `TASSIGN(obj, pointer)` (Form 1).
-
-## Inputs
-
-- `tile` is the tile to bind.
-- `addr` is the on-chip address to bind the tile to.
-
-## Expected Outputs
-
-This form is defined primarily by its ordering or configuration effect. It does not introduce a new payload tile beyond any explicit state object named by the syntax.
-
-## Side Effects
-
-This operation may establish a synchronization edge, bind or configure architectural tile state, or update implementation-defined configuration that later tile instructions consume.
-
-## Constraints
-
-- Configuration and synchronization state MUST only be used where later instructions document the dependency.
-
-- Programs must not treat implementation-defined manual placement as a portable substitute for legal PTO behavior.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks**:
-    - If `obj` is a Tile:
-    - In manual mode (when `__PTO_AUTO__` is not defined), `addr` must be an integral type and is reinterpreted as the tile's storage address.
-    - In auto mode (when `__PTO_AUTO__` is defined), `TASSIGN(tile, addr)` is a no-op.
-    - If `obj` is a `GlobalTensor`:
-    - `addr` must be a pointer type.
-    - The pointed-to element type must match `GlobalTensor::DType`.
-
-## Examples
-
-### Runtime address (no compile-time check)
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_runtime() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c, 0x3000);
-  TADD(c, a, b);
-}
-```
-
-### Compile-time address (with static bounds check)
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_checked() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, c;
-
-  TASSIGN<0x0000>(a);   // OK: 0x0000 + 1024 <= 192KB
-  TASSIGN<0x0400>(b);   // OK: 0x0400 + 1024 <= 192KB
-  TASSIGN<0x0800>(c);   // OK: 0x0800 + 1024 <= 192KB
-  TADD(c, a, b);
-}
-```
-
-The following triggers a compile error:
-
-```cpp
-void example_oob() {
-  // Tile<Vec, float, 256, 256> occupies 256*256*4 = 256KB
-  using BigTile = Tile<TileType::Vec, float, 256, 256>;
-  BigTile t;
-
-  // static_assert fires: tile_size (256KB) > UB capacity (192KB on A2A3)
-  TASSIGN<0x0>(t);
-}
-```
-
-```cpp
-void example_oob_addr() {
-  using TileT = Tile<TileType::Vec, float, 128, 128>;  // 64KB
-  TileT t;
-
-  // static_assert fires: 0x20000 (128KB) + 64KB = 192KB,
-  //                       but 0x20001 + 64KB > 192KB
-  TASSIGN<0x20001>(t);
-}
-```
-
-### Ping-pong L0 buffer allocation
-
-```cpp
-void example_pingpong() {
-  using L0ATile = TileLeft<half, 64, 128>;   // L0A tile
-  using L0BTile = TileRight<half, 128, 64>;  // L0B tile
-
-  L0ATile a0, a1;
-  L0BTile b0, b1;
-
-  TASSIGN<0x0000>(a0);   // L0A ping
-  TASSIGN<0x8000>(a1);   // L0A pong
-  TASSIGN<0x0000>(b0);   // L0B ping  (separate physical memory from L0A)
-  TASSIGN<0x8000>(b1);   // L0B pong
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-pto.tassign %tile, %addr : !pto.tile<...>, dtype
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-pto.tassign %tile, %addr : !pto.tile<...>, dtype
-```
-
-### PTO Assembly Form
-
-```text
-tassign %tile, %addr : !pto.tile<...>, index
-# AS Level 2 (DPS)
-pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Sync And Config](../../sync-and-config.md)
-- Previous op in family: [pto.tsync](./tsync.md)
-- Next op in family: [pto.tsethf32mode](./tsethf32mode.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tassign_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tassign_zh.md
deleted file mode 100644
index 6bd2345a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tassign_zh.md
+++ /dev/null
@@ -1,194 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tassign_zh.md` -->
-
-# TASSIGN
-
-## 指令示意图
-
-![TASSIGN tile operation](../figures/isa/TASSIGN.svg)
-
-## 简介
-
-将 Tile 对象绑定到实现定义的片上地址（手动放置）。
-
-## 数学语义
-
-Not applicable.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-`TASSIGN` is typically introduced by bufferization/lowering when mapping SSA tiles to physical storage.
-
-同步形式：
-
-```text
-tassign %tile, %addr : !pto.tile<...>, index
-```
-
-### AS Level 1 (SSA)
-
-```text
-pto.tassign %tile, %addr : !pto.tile<...>, dtype
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
-```
-
-### AS Level 1（SSA）
-
-```text
-pto.tassign %tile, %addr : !pto.tile<...>, dtype
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`.
-
-### Form 1: Runtime address
-
-```cpp
-template <typename T, typename AddrType>
-PTO_INST void TASSIGN(T& obj, AddrType addr);
-```
-
-Binds `obj` to the on-chip address `addr`. No compile-time bounds checking is
-performed (the address value is not available at compile time).
-
-### Form 2: Compile-time address (with static bounds check)
-
-```cpp
-template <std::size_t Addr, typename T>
-PTO_INST void TASSIGN(T& obj);
-```
-
-Binds `obj` to the on-chip address `Addr`. Because `Addr` is a non-type
-template parameter, the compiler performs the following **compile-time** checks
-via `static_assert`:
-
-| Check | Condition | Assertion ID | Error message |
-|-------|-----------|--------------|---------------|
-| Memory space exists | `capacity > 0` | SA-0351 | Memory space is not available on this architecture. |
-| Tile fits in memory | `tile_size <= capacity` | SA-0352 | Tile storage size exceeds memory space capacity. |
-| Address in bounds | `Addr + tile_size <= capacity` | SA-0353 | addr + tile_size exceeds memory space capacity (out of bounds). |
-| Address aligned | `Addr % alignment == 0` | SA-0354 | addr is not properly aligned for the target memory space. |
-
-See `docs/coding/debug.md` (fix recipe `FIX-A12`) for suggested remedies.
-
-The memory space, capacity, and alignment are determined automatically from the
-Tile's `TileType` (i.e. `Loc` template parameter):
-
-| TileType | Memory | Capacity (A2A3) | Capacity (A5) | Capacity (Kirin9030) | Capacity (KirinX90) | Alignment |
-|----------|--------|-----------------|---------------|----------------------|---------------------|-----------|
-| Vec | UB | 192 KB | 256 KB | 128 KB | 128 KB | 32 B |
-| Mat | L1 | 512 KB | 512 KB | 512 KB | 1024 KB | 32 B |
-| Left | L0A | 64 KB | 64 KB | 32 KB | 64 KB | 32 B |
-| Right | L0B | 64 KB | 64 KB | 32 KB | 64 KB | 32 B |
-| Acc | L0C | 128 KB | 256 KB | 64 KB | 128 KB | 32 B |
-| Bias | Bias | 1 KB | 4 KB | 1 KB | 1 KB | 32 B |
-| Scaling | FBuffer | 2 KB | 4 KB | 7 KB | 6 KB | 32 B |
-| ScaleLeft | L0A | N/A | 4 KB | N/A | N/A | 32 B |
-| ScaleRight | L0B | N/A | 4 KB | N/A | N/A | 32 B |
-
-Capacities can be overridden at build time via `-D` flags (e.g.
-`-DPTO_UBUF_SIZE_BYTES=262144`). See `include/pto/common/buffer_limits.hpp`.
-
-**Note:** This overload is only available for `Tile` and `ConvTile` types. For
-`GlobalTensor`, use `TASSIGN(obj, pointer)` (Form 1).
-
-## 约束
-
-- **实现检查**:
-    - If `obj` is a Tile:
-    - In manual mode (when `__PTO_AUTO__` is not defined), `addr` must be an integral type and is reinterpreted as the tile's storage address.
-    - In auto mode (when `__PTO_AUTO__` is defined), `TASSIGN(tile, addr)` is a no-op.
-    - If `obj` is a `GlobalTensor`:
-    - `addr` must be a pointer type.
-    - The pointed-to element type must match `GlobalTensor::DType`.
-
-## 示例
-
-### Runtime address (no compile-time check)
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_runtime() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, c;
-  TASSIGN(a, 0x1000);
-  TASSIGN(b, 0x2000);
-  TASSIGN(c, 0x3000);
-  TADD(c, a, b);
-}
-```
-
-### Compile-time address (with static bounds check)
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_checked() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, c;
-
-  TASSIGN<0x0000>(a);   // OK: 0x0000 + 1024 <= 192KB
-  TASSIGN<0x0400>(b);   // OK: 0x0400 + 1024 <= 192KB
-  TASSIGN<0x0800>(c);   // OK: 0x0800 + 1024 <= 192KB
-  TADD(c, a, b);
-}
-```
-
-The following triggers a compile error:
-
-```cpp
-void example_oob() {
-  // Tile<Vec, float, 256, 256> occupies 256*256*4 = 256KB
-  using BigTile = Tile<TileType::Vec, float, 256, 256>;
-  BigTile t;
-
-  // static_assert fires: tile_size (256KB) > UB capacity (192KB on A2A3)
-  TASSIGN<0x0>(t);
-}
-```
-
-```cpp
-void example_oob_addr() {
-  using TileT = Tile<TileType::Vec, float, 128, 128>;  // 64KB
-  TileT t;
-
-  // static_assert fires: 0x20000 (128KB) + 64KB = 192KB,
-  //                       but 0x20001 + 64KB > 192KB
-  TASSIGN<0x20001>(t);
-}
-```
-
-### Ping-pong L0 buffer allocation
-
-```cpp
-void example_pingpong() {
-  using L0ATile = TileLeft<half, 64, 128>;   // L0A tile
-  using L0BTile = TileRight<half, 128, 64>;  // L0B tile
-
-  L0ATile a0, a1;
-  L0BTile b0, b1;
-
-  TASSIGN<0x0000>(a0);   // L0A ping
-  TASSIGN<0x8000>(a1);   // L0A pong
-  TASSIGN<0x0000>(b0);   // L0B ping  (separate physical memory from L0A)
-  TASSIGN<0x8000>(b1);   // L0B pong
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tget-scale-addr.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tget-scale-addr.md
deleted file mode 100644
index b78d3191..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tget-scale-addr.md
+++ /dev/null
@@ -1,122 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tget-scale-addr.md` -->
-
-# pto.tget_scale_addr
-
-Standalone reference page for `pto.tget_scale_addr`. This page belongs to the [Sync And Config](../../sync-and-config.md) family in the PTO ISA manual.
-
-## Summary
-
-Bind the on-chip address of output tile to a scaled factor of that of input tile.
-
-## Mechanism
-
-Bind the on-chip address of output tile as a scaled address of the input tile.
-
-The scaling factor is defined by a right-shift amount `SHIFT_MX_ADDR` in `include/pto/npu/a5/utils.hpp`. It is part of the tile synchronization or configuration shell, so the visible effect is ordering or state setup rather than arithmetic payload transformation.
-
-Address(`dst`) = Address(`src`) >> `SHIFT_MX_ADDR`
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tget_scale_addr %src : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tget_scale_addr %src : (!pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tget_scale_addr ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tget_scale_addr %src : (!pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tget_scale_addr ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TGET_SCALE_ADDR(TileDataDst &dst, TileDataSrc &src, aitEvents&... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `dst` names the destination tile that holds the scaled address.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-This operation may establish a synchronization edge, bind or configure architectural tile state, or update implementation-defined configuration that later tile instructions consume.
-
-## Constraints
-
-Enforced by `TGET_SCALE_ADDR_IMPL`:
-
-- **Both `src` and `dst` must be Tile instances**
-
-- **Currently only work in auto mode** (will support manual mode in the future)
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tget_scale_addr` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-template <typename T, int ARows, int ACols, int BRows, int BCols>
-void example() {
-    using LeftTile = TileLeft<T, ARows, ACols>;
-    using RightTile = TileRight<T, BRows, BCols>;
-
-    using LeftScaleTile = TileLeftScale<T, ARows, ACols>;
-    using RightScaleTile = TileRightScale<T, BRows, BCols>;
-
-    LeftTile aTile;
-    RightTile bTile;
-    LeftScaleTile aScaleTile;
-    RightScaleTile bScaleTile;
-
-    TGET_SCALE_ADDR(aScaleTile, aTile);
-    TGET_SCALE_ADDR(bScaleTile, bTile);
-}
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Sync And Config](../../sync-and-config.md)
-- Previous op in family: [pto.tsubview](./tsubview.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tget-scale-addr_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tget-scale-addr_zh.md
deleted file mode 100644
index 0d2ff81f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tget-scale-addr_zh.md
+++ /dev/null
@@ -1,83 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tget-scale-addr_zh.md` -->
-
-# TGET_SCALE_ADDR
-
-## Tile Operation Diagram
-
-![TGET_SCALE_ADDR tile operation](../figures/isa/TGET_SCALE_ADDR.svg)
-
-## Introduction
-
-将输入Tile的片上地址数值按比例扩展，将其结果数值绑定为输出Tile的片上地址。
-
-这个扩展因子是由`include/pto/npu/a5/utils.hpp`中的右移值`SHIFT_MX_ADDR`定义的。
-
-## 数学语义
-
-Address(`dst`) = Address(`src`) >> `SHIFT_MX_ADDR`
-
-## 汇编语法
-
-PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
-
-### IR Level 1 (SSA)
-
-TODO
-
-### IR Level 2 (DPS)
-
-TODO
-
-## C++ 内建接口
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TGET_SCALE_ADDR(TileDataDst &dst, TileDataSrc &src, aitEvents&... events);
-```
-
-## 约束
-
-- **输入和输出都必须为Tile对象**
-- **目前只能用在auto模式下**（以后会将支持manual模式下的实现）
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-> wa
-using namespace pto;
-
-template <typename T, int ARows, int ACols, BRows, BCols>
-void example() {
-    using LeftTile = TileLeft<T, ARows, ACols>;
-    using RightTile = TileRight<T, BRows, BCols>;
-
-    using LeftScaleTile = TileLeftScale<T, ARows, ACols>;
-    using RightScaleTile = TileRightScale<T, BRows, BCols>;
-
-    LeftTile aTile;
-    RightTile bTile;
-    LeftScaleTile aScaleTile;
-    RightScaleTile bScaleTile;
-
-    TGET_SCALE_ADDR(aScaleTile, aTile);
-    TGET_SCALE_ADDR(bScaleTile, bTile);
-}
-```
-
-## asm form examples
-
-### Auto Mode
-
-TODO
-
-### Manual Mode
-
-TODO
-
-### PTO Assembly Form
-
-TODO
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-padding.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-padding.md
deleted file mode 100644
index 6ded6065..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-padding.md
+++ /dev/null
@@ -1,139 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tset-img2col-padding.md` -->
-
-# pto.tset_img2col_padding
-
-Standalone reference page for `pto.tset_img2col_padding`. This page belongs to the [Sync And Config](../../sync-and-config.md) family in the PTO ISA manual.
-
-## Summary
-
-Set IMG2COL padding metadata from an IMG2COL configuration tile.
-
-## Mechanism
-
-Set IMG2COL padding metadata from an IMG2COL configuration tile (implementation-defined). It is part of the tile synchronization or configuration shell, so the visible effect is ordering or state setup rather than arithmetic payload transformation.
-
-No direct tensor arithmetic is produced by this instruction. It updates IMG2COL padding control state consumed by subsequent data-movement operations.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Schematic form:
-
-```text
-tset_img2col_padding %cfg
-```
-
-### AS Level 1 (SSA)
-
-```text
-pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tset_img2col_padding ins(%cfg : !pto.fmatrix_config) outs()
-```
-
-### IR Level 1 (SSA)
-
-```text
-pto.tset_img2col_padding %cfg
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tset_img2col_padding ins(%cfg) outs()
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename ConvTileData, typename... WaitEvents>
-PTO_INST RecordEvent TSET_IMG2COL_PADDING(ConvTileData &src, WaitEvents &... events);
-
-template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
-PTO_INST RecordEvent TSET_IMG2COL_PADDING(ConvTileData &src, WaitEvents &... events);
-```
-
-For `MEMORY_BASE` targets, an overload without `SetFmatrixMode` is also provided.
-
-## Inputs
-
-- `src` is the ConvTileData (IMG2COL configuration tile) containing padding metadata.
-
-## Expected Outputs
-
-This form is defined primarily by its ordering or configuration effect. It does not introduce a new payload tile beyond any explicit state object named by the syntax.
-
-## Side Effects
-
-This operation may establish a synchronization edge, bind or configure architectural tile state, or update implementation-defined configuration that later tile instructions consume.
-
-## Constraints
-
-- This instruction is backend-specific and available only for backends that expose IMG2COL configuration state.
-
-- `src` must be a valid IMG2COL configuration tile type accepted by the backend implementation.
-
-- The exact padding fields updated by this instruction are implementation-defined.
-
-- Use this instruction before dependent `TIMG2COL` operations in the same execution stream.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tset_img2col_padding` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_set_img2col_padding(Img2colTileConfig<uint64_t>& cfg) {
-  TSET_IMG2COL_PADDING(cfg);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
-```
-
-### PTO Assembly Form
-
-```text
-pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
-# AS Level 2 (DPS)
-pto.tset_img2col_padding ins(%cfg : !pto.fmatrix_config) outs()
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Sync And Config](../../sync-and-config.md)
-- Previous op in family: [pto.tset_img2col_rpt](./tset-img2col-rpt.md)
-- Next op in family: [pto.tsubview](./tsubview.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-padding_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-padding_zh.md
deleted file mode 100644
index eca5f494..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-padding_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# pto.tset_img2col_padding
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tset-img2col-padding.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [现有中文指令说明](../../../TSET_IMG2COL_PADDING_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-rpt.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-rpt.md
deleted file mode 100644
index 9497788a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-rpt.md
+++ /dev/null
@@ -1,139 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tset-img2col-rpt.md` -->
-
-# pto.tset_img2col_rpt
-
-Standalone reference page for `pto.tset_img2col_rpt`. This page belongs to the [Sync And Config](../../sync-and-config.md) family in the PTO ISA manual.
-
-## Summary
-
-Set IMG2COL repeat metadata from an IMG2COL configuration tile.
-
-## Mechanism
-
-Set IMG2COL repeat metadata from an IMG2COL configuration tile (implementation-defined). It is part of the tile synchronization or configuration shell, so the visible effect is ordering or state setup rather than arithmetic payload transformation.
-
-No direct tensor arithmetic is produced by this instruction. It updates IMG2COL control state used by subsequent data-movement operations.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Schematic form:
-
-```text
-tset_img2col_rpt %cfg
-```
-
-### AS Level 1 (SSA)
-
-```text
-pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tset_img2col_rpt ins(%cfg : !pto.fmatrix_config) outs()
-```
-
-### IR Level 1 (SSA)
-
-```text
-pto.tset_img2col_rpt %cfg
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tset_img2col_rpt ins(%cfg) outs()
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename ConvTileData, typename... WaitEvents>
-PTO_INST RecordEvent TSET_IMG2COL_RPT(ConvTileData &src, WaitEvents &... events);
-
-template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
-PTO_INST RecordEvent TSET_IMG2COL_RPT(ConvTileData &src, WaitEvents &... events);
-```
-
-For `MEMORY_BASE` targets, an overload without `SetFmatrixMode` is also provided.
-
-## Inputs
-
-- `src` is the ConvTileData (IMG2COL configuration tile) containing repeat metadata.
-
-## Expected Outputs
-
-This form is defined primarily by its ordering or configuration effect. It does not introduce a new payload tile beyond any explicit state object named by the syntax.
-
-## Side Effects
-
-This operation may establish a synchronization edge, bind or configure architectural tile state, or update implementation-defined configuration that later tile instructions consume.
-
-## Constraints
-
-- This instruction is backend-specific and available only for backends that expose IMG2COL configuration state.
-
-- `src` must be a valid IMG2COL configuration tile type accepted by the backend implementation.
-
-- The exact register/metadata fields updated by this instruction are implementation-defined.
-
-- Use this instruction before dependent `TIMG2COL` operations in the same execution stream.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tset_img2col_rpt` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_set_img2col_rpt(Img2colTileConfig<uint64_t>& cfg) {
-  TSET_IMG2COL_RPT(cfg);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
-```
-
-### PTO Assembly Form
-
-```text
-pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
-# AS Level 2 (DPS)
-pto.tset_img2col_rpt ins(%cfg : !pto.fmatrix_config) outs()
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Sync And Config](../../sync-and-config.md)
-- Previous op in family: [pto.tsetfmatrix](./tsetfmatrix.md)
-- Next op in family: [pto.tset_img2col_padding](./tset-img2col-padding.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-rpt_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-rpt_zh.md
deleted file mode 100644
index 45f69e41..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tset-img2col-rpt_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# pto.tset_img2col_rpt
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tset-img2col-rpt.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [现有中文指令说明](../../../TSET_IMG2COL_RPT_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsetfmatrix.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsetfmatrix.md
deleted file mode 100644
index 144b4b1a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsetfmatrix.md
+++ /dev/null
@@ -1,115 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tsetfmatrix.md` -->
-
-# pto.tsetfmatrix
-
-Standalone reference page for `pto.tsetfmatrix`. This page belongs to the [Sync And Config](../../sync-and-config.md) family in the PTO ISA manual.
-
-## Summary
-
-Set FMATRIX register(s) for IMG2COL-like ops.
-
-## Mechanism
-
-Set the FMATRIX register(s) used by IMG2COL-like operations from an `Img2colTileConfig` (target/implementation-defined). It is part of the tile synchronization or configuration shell, so the visible effect is ordering or state setup rather than arithmetic payload transformation.
-
-Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-### AS Level 1 (SSA)
-
-```text
-pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsetfmatrix ins(%cfg : !pto.fmatrix_config) outs()
-```
-
-### IR Level 1 (SSA)
-
-```text
-pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tsetfmatrix ins(%cfg : !pto.fmatrix_config) outs()
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
-PTO_INST RecordEvent TSETFMATRIX(ConvTileData &src, WaitEvents&... events);
-```
-
-## Inputs
-
-- `src` is the ConvTileData (IMG2COL configuration tile).
-- `FmatrixMode` (optional): FMATRIX register to target.
-
-## Expected Outputs
-
-This form is defined primarily by its ordering or configuration effect. It does not introduce a new payload tile beyond any explicit state object named by the syntax.
-
-## Side Effects
-
-This operation may establish a synchronization edge, bind or configure architectural tile state, or update implementation-defined configuration that later tile instructions consume.
-
-## Constraints
-
-Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tsetfmatrix` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
-```
-
-### PTO Assembly Form
-
-```text
-pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
-# AS Level 2 (DPS)
-pto.tsetfmatrix ins(%cfg : !pto.fmatrix_config) outs()
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Sync And Config](../../sync-and-config.md)
-- Previous op in family: [pto.tsettf32mode](./tsettf32mode.md)
-- Next op in family: [pto.tset_img2col_rpt](./tset-img2col-rpt.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsetfmatrix_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsetfmatrix_zh.md
deleted file mode 100644
index 40444dc1..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsetfmatrix_zh.md
+++ /dev/null
@@ -1,60 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tsetfmatrix_zh.md` -->
-
-# TSETFMATRIX
-
-## 指令示意图
-
-![TSETFMATRIX tile operation](../figures/isa/TSETFMATRIX.svg)
-
-## 简介
-
-为类 IMG2COL 操作设置 FMATRIX 寄存器。
-
-## 数学语义
-
-除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-### AS Level 1 (SSA)
-
-```text
-pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsetfmatrix ins(%cfg : !pto.fmatrix_config) outs()
-```
-
-### AS Level 1（SSA）
-
-```text
-pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tsetfmatrix ins(%cfg : !pto.fmatrix_config) outs()
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
-PTO_INST RecordEvent TSETFMATRIX(ConvTileData &src, WaitEvents&... events);
-```
-
-## 约束
-
-Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
-
-## 示例
-
-See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsethf32mode.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsethf32mode.md
deleted file mode 100644
index 2f42669f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsethf32mode.md
+++ /dev/null
@@ -1,97 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tsethf32mode.md` -->
-
-# pto.tsethf32mode
-
-Standalone reference page for `pto.tsethf32mode`. This page belongs to the [Sync And Config](../../sync-and-config.md) family in the PTO ISA manual.
-
-## Summary
-
-Configure HF32 transform mode (implementation-defined).
-
-## Mechanism
-
-Configure HF32 transform mode (implementation-defined).
-
-This instruction controls backend-specific HF32 transformation behavior used by supported compute paths. It is part of the tile synchronization or configuration shell, so the visible effect is ordering or state setup rather than arithmetic payload transformation.
-
-No direct tensor arithmetic is produced by this instruction. It updates target mode state used by subsequent instructions.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Schematic form:
-
-```text
-tsethf32mode {enable = true, mode = ...}
-```
-
-### IR Level 1 (SSA)
-
-```text
-pto.tsethf32mode {enable = true, mode = ...}
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tsethf32mode ins({enable = true, mode = ...}) outs()
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <bool isEnable, RoundMode hf32TransMode = RoundMode::CAST_ROUND, typename... WaitEvents>
-PTO_INST RecordEvent TSETHF32MODE(WaitEvents &... events);
-```
-
-## Inputs
-
-- `enable` (bool): enables or disables the HF32 transform mode.
-- `mode` (RoundMode): specifies the HF32 rounding mode.
-
-## Expected Outputs
-
-This form is defined primarily by its ordering or configuration effect. It does not introduce a new payload tile beyond any explicit state object named by the syntax.
-
-## Side Effects
-
-This operation may establish a synchronization edge, bind or configure architectural tile state, or update implementation-defined configuration that later tile instructions consume.
-
-## Constraints
-
-- Available only when the corresponding backend capability macro is enabled.
-
-- Exact mode values and hardware behavior are target-defined.
-
-- This instruction has control-state side effects and should be ordered appropriately relative to dependent compute instructions.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tsethf32mode` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void example_enable_hf32() {
-  TSETHF32MODE<true, RoundMode::CAST_ROUND>();
-}
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Sync And Config](../../sync-and-config.md)
-- Previous op in family: [pto.tassign](./tassign.md)
-- Next op in family: [pto.tsettf32mode](./tsettf32mode.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsethf32mode_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsethf32mode_zh.md
deleted file mode 100644
index d77b1b38..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsethf32mode_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# pto.tsethf32mode
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tsethf32mode.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [现有中文指令说明](../../../TSETHF32MODE_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsettf32mode.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsettf32mode.md
deleted file mode 100644
index 8adcc8d6..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsettf32mode.md
+++ /dev/null
@@ -1,97 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tsettf32mode.md` -->
-
-# pto.tsettf32mode
-
-Standalone reference page for `pto.tsettf32mode`. This page belongs to the [Sync And Config](../../sync-and-config.md) family in the PTO ISA manual.
-
-## Summary
-
-Configure TF32 transform mode (implementation-defined).
-
-## Mechanism
-
-Configure TF32 transform mode (implementation-defined).
-
-This instruction controls backend-specific TF32 transformation behavior used by supported compute paths. It is part of the tile synchronization or configuration shell, so the visible effect is ordering or state setup rather than arithmetic payload transformation.
-
-No direct tensor arithmetic is produced by this instruction. It updates target mode state used by subsequent instructions.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Schematic form:
-
-```text
-tsettf32mode {enable = true, mode = ...}
-```
-
-### IR Level 1 (SSA)
-
-```text
-pto.tsettf32mode {enable = true, mode = ...}
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tsettf32mode ins({enable = true, mode = ...}) outs()
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <bool isEnable, RoundMode tf32TransMode = RoundMode::CAST_ROUND, typename... WaitEvents>
-PTO_INST RecordEvent TSETTF32MODE(WaitEvents &... events);
-```
-
-## Inputs
-
-- `enable` (bool): enables or disables the TF32 transform mode.
-- `mode` (RoundMode): specifies the TF32 rounding mode.
-
-## Expected Outputs
-
-This form is defined primarily by its ordering or configuration effect. It does not introduce a new payload tile beyond any explicit state object named by the syntax.
-
-## Side Effects
-
-This operation may establish a synchronization edge, bind or configure architectural tile state, or update implementation-defined configuration that later tile instructions consume.
-
-## Constraints
-
-- Available only when the corresponding backend capability macro is enabled.
-
-- Exact mode values and hardware behavior are target-defined.
-
-- This instruction has control-state side effects and should be ordered appropriately relative to dependent compute instructions.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tsettf32mode` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void example_enable_tf32() {
-  TSETTF32MODE<true, RoundMode::CAST_ROUND>();
-}
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Sync And Config](../../sync-and-config.md)
-- Previous op in family: [pto.tsethf32mode](./tsethf32mode.md)
-- Next op in family: [pto.tsetfmatrix](./tsetfmatrix.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsettf32mode_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsettf32mode_zh.md
deleted file mode 100644
index f2a35407..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsettf32mode_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# pto.tsettf32mode
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tsettf32mode.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [现有中文指令说明](../../../TSETTF32MODE_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsubview.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsubview.md
deleted file mode 100644
index b6657104..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsubview.md
+++ /dev/null
@@ -1,139 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tsubview.md` -->
-
-# pto.tsubview
-
-Standalone reference page for `pto.tsubview`. This page belongs to the [Sync And Config](../../sync-and-config.md) family in the PTO ISA manual.
-
-## Summary
-
-Reinterpret a tile as a subtile of another tile.
-
-## Mechanism
-
-Reinterpret a tile as a subtile of another tile. It is part of the tile synchronization or configuration shell, so the visible effect is ordering or state setup rather than arithmetic payload transformation.
-
-- `rowIdx`: in the valid region of `src`, the starting row index of the `dst` subtile.
-- `colIdx`: in the valid region of `src`, the starting column index of the `dst` subtile.
-
-For each element `(i, j)` in the valid region of `dst`:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{rowIdx} + i,\mathrm{colIdx} + j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tsubview %src, %row_idx, %col_idx : !pto.tile<...>, i16, i16
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsubview %src, %row_idx, %col_idx : (!pto.tile<...>, i16, i16) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsubview ins(%src, %row_idx, %col_idx : !pto.tile_buf<...>, i16, i16) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tsubview %src, %row_idx, %col_idx : (!pto.tile<...>, i16, i16) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tsubview ins(%src, %row_idx, %col_idx : !pto.tile_buf<...>, i16, i16) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TSUBVIEW(TileDataDst &dst, TileDataSrc &src, uint16_t rowIdx, uint16_t colIdx, WaitEvents&... events);
-```
-
-## Inputs
-
-- `src` provides the source tile.
-- `rowIdx` and `colIdx` are zero-based offsets into the valid region of `src` for the top-left corner of `dst`.
-- `dst` names the destination tile that views a sub-rectangle of `src`.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-This operation may establish a synchronization edge, bind or configure architectural tile state, or update implementation-defined configuration that later tile instructions consume.
-
-## Constraints
-
-Enforced by `TSUBVIEW_IMPL`:
-
-- **Tile type must match**: `TileDataSrc::Loc == TileDataDst::Loc`.
-
-- **Both tiles must have the same static capacity**: `TileDataSrc::Rows == TileDataDst::Rows` and `TileDataSrc::Cols == TileDataDst::Cols`.
-
-- **Both tiles must have the same BLayout**: `TileDataSrc::BFractal == TileDataDst::BFractal`.
-
-- **The source tile's validRow (validCol) is at least as big as the destination tile's validRow (validCol)**
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- `pto.tsubview` preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but concrete support subsets may differ by profile.
-
-- Portable code must rely only on the documented type, layout, shape, and mode combinations that the selected target profile guarantees.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using Src = Tile<TileType::Vec, float, 4, 64, BLayout::RowMajor, 4, 64>;
-  using Dst = Tile<TileType::Vec, float, 4, 64, BLayout::RowMajor, 2, 32>;
-
-  Src src;
-  Dst dst0;
-  Dst dst1;
-  Dst dst2;
-  Dst dst3;
-
-  // e.g. split into four 2x32 subtiles
-  TSUBVIEW(dst0, src, 0, 0);
-  TSUBVIEW(dst1, src, 0, 32);
-  TSUBVIEW(dst2, src, 2, 0);
-  TSUBVIEW(dst3, src, 2, 32);
-}
-```
-
-### Auto And Manual Kernels
-
-The C++ intrinsic is the same whether the surrounding kernel uses Auto scheduling or Manual pipeline control. What changes is how `TLOAD`, `TSTORE`, `TSYNC`, and related edges are scheduled around the view; see [Auto Vs Manual](../../../programming-model/auto-vs-manual.md).
-
-### PTO-AS Form
-
-Concrete mnemonic spelling, attribute order, and register-like operand syntax live in the PTO-AS specification (`docs/assembly/PTO-AS.md` and `docs/assembly/PTO-AS.bnf`). This ISA page names the operation as `pto.tsubview` and the logical operands above.
-
-## Related Ops / Family Links
-
-- Family overview: [Sync And Config](../../sync-and-config.md)
-- Previous op in family: [pto.tset_img2col_padding](./tset-img2col-padding.md)
-- Next op in family: [pto.tget_scale_addr](./tget-scale-addr.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsubview_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsubview_zh.md
deleted file mode 100644
index 80128c56..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsubview_zh.md
+++ /dev/null
@@ -1,89 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tsubview_zh.md` -->
-
-# TSUBVIEW
-
-## Tile操作图例
-
-![TSUBVIEW tile operation](../figures/isa/TSUBVIEW.svg)
-
-## 简介
-
-表达一个Tile是另一个Tile的subview。
-
-## 数学表达
-
-- `rowIdx`: 在`src`的有效区域内的起始行的索引。
-- `colIdx`: 在`src`的有效区域内的起始列的索引。
-
-对于`dst`中有效区域内的每一个元素`(i, j)`：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{rowIdx} + i,\mathrm{colIdx} + j} $$
-
-## 汇编语法
-
-PTO-AS form: 详见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-### IR Level 1 (SSA)
-
-TODO
-
-### IR Level 2 (DPS)
-
-TODO
-
-## C++ Intrinsic
-
-定义在 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TSUBVIEW(TileDataDst &dst, TileDataSrc &src, uint16_t rowIdx, uint16_t colIdx, WaitEvents&... events);
-```
-
-## 限制
-
-规定在`TSUBVIEW_IMPL`中:
-
-- **Tile类型必须相同**: `TileDataSrc::Loc == TileDataDst::Loc`.
-- **输入和输出Tile的静态shape必须相同**: `TileDataSrc::Rows == TileDataDst::Rows` and `TileDataSrc::Cols == TileDataDst::Cols`.
-- **输入和输出Tile的BLayout必须相同**: `TileDataSrc::BFractal == TileDataDst::BFractal`.
-- **src的validRow和validCol必须大于等于dst的validRow和validCol**
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using Src = Tile<TileType::Vec, float, 4, 64, BLayout::RowMajor, 4, 64>;
-  using Dst = Tile<TileType::Vec, float, 4, 64, BLayout::RowMajor, 2, 32>;
-
-  Src src;
-  Dst dst0;
-  Dst dst1;
-  Dst dst2;
-  Dst dst3;
-
-  // e.g. split into four 2x32 subtiles
-  TSUBVIEW(dst0, src, 0, 0);
-  TSUBVIEW(dst1, src, 0, 32);
-  TSUBVIEW(dst2, src, 2, 0);
-  TSUBVIEW(dst3, src, 2, 32);
-}
-```
-
-## ASM示例
-
-### Auto模式
-
-TODO
-
-### Manual模式
-
-TODO
-
-### PTO汇编格式
-
-TODO
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsync.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsync.md
deleted file mode 100644
index 89736937..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsync.md
+++ /dev/null
@@ -1,149 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tsync.md` -->
-
-# pto.tsync
-
-Standalone reference page for `pto.tsync`. This page belongs to the [Sync And Config](../../sync-and-config.md) family in the PTO ISA manual.
-
-## Summary
-
-Synchronize PTO execution (wait on events or insert a per-op pipeline barrier).
-
-## Mechanism
-
-`TSYNC(events...)` waits on a set of explicit event tokens. `TSYNC<Op>()` inserts a pipeline barrier for a single operation class.
-
-Many intrinsics in `include/pto/common/pto_instr.hpp` call `TSYNC(events...)` internally before issuing the instruction. It is part of the tile synchronization or configuration shell, so the visible effect is ordering or state setup rather than arithmetic payload transformation.
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Event operand form:
-
-```text
-tsync %e0, %e1 : !pto.event<...>, !pto.event<...>
-```
-
-Single-op barrier form:
-
-```text
-tsync.op #pto.op<TADD>
-```
-
-### AS Level 1 (SSA)
-
-The SSA form for `TSYNC` does not use explicit event SSA values. The ISA-level primitive is the C++ `RecordEvent` chaining via `WaitEvents...` on tile intrinsics. SSA-level representations of event ordering use `record_event` / `wait_event` (see below) which are PTO-DSL internal IR nodes, not the portable ISA surface.
-
-### AS Level 2 (DPS)
-
-The AS Level 2 form exposes explicit event ordering primitives:
-
-```text
-pto.record_event[src_op, dst_op, eventID]
-// Supported ops: TLOAD, TSTORE_ACC, TSTORE_VEC, TMOV_M2L, TMOV_M2S,
-//                 TMOV_M2B, TMOV_M2V, TMOV_V2M, TMATMUL, TVEC
-pto.wait_event[src_op, dst_op, eventID]
-// Supported ops: same as record_event
-pto.barrier(op)
-// Supported ops: TVEC, TMATMUL
-```
-
-In the current PTO-DSL front-end flow, `record_event` and `wait_event` should be treated as low-level TSYNC forms. Front-end kernels SHOULD normally stay free of explicit event wiring and rely on `ptoas --enable-insert-sync` instead.
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <Op OpCode>
-PTO_INST void TSYNC();
-
-template <typename... WaitEvents>
-PTO_INST void TSYNC(WaitEvents &... events);
-```
-
-## Inputs
-
-`TSYNC(events...)` takes one or more `RecordEvent` values as operands. Each `RecordEvent` is produced by a prior tile operation (`TLOAD`, `TADD`, `TMATMUL`, etc.). The call waits for all supplied events before proceeding.
-
-`TSYNC<Op>()` takes a compile-time operation tag (`Op::TLOAD`, `Op::TADD`, `Op::TMATMUL`, etc.) and inserts a pipeline barrier for all operations of that class.
-
-## Expected Outputs
-
-This form is defined primarily by its ordering or configuration effect. It does not introduce a new payload tile beyond any explicit state object named by the syntax.
-
-## Side Effects
-
-This operation may establish a synchronization edge, bind or configure architectural tile state, or update implementation-defined configuration that later tile instructions consume.
-
-## Constraints
-
-- **`TSYNC(events...)` semantics**:
-    - `TSYNC(events...)` calls `WaitAllEvents(events...)`, which invokes `events.Wait()` on each event token. In auto mode, this is no-op.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (`TSYNC<Op>()`)**:
-    - `TSYNC_IMPL<Op>()` only supports vector-pipeline ops (`static_assert(pipe == PIPE_V)` in `include/pto/common/event.hpp`).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto(__gm__ float* in) {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<float, 16, 16, Layout::ND>;
-  using GT = GlobalTensor<float, GShape, GStride, Layout::ND>;
-
-  GT gin(in);
-  TileT t;
-  RecordEvent e = TLOAD(t, gin);  // TLOAD returns RecordEvent
-  TSYNC(e);                       // wait for load to complete
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, c;
-  RecordEvent e = TADD(c, a, b);  // TADD returns RecordEvent
-  TSYNC<Op::TADD>();              // pipeline barrier for TADD
-  TSYNC(e);                       // explicit wait
-}
-```
-
-### PTO Assembly Form
-
-Event-wait form (bare assembly):
-
-```text
-tsync %e0, %e1 : !pto.event<...>, !pto.event<...>
-```
-
-Barrier form:
-
-```text
-tsync.op #pto.op<TADD>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Sync And Config](../../sync-and-config.md)
-- Next op in family: [pto.tassign](./tassign.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsync_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsync_zh.md
deleted file mode 100644
index 82aa30e1..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/sync-and-config/tsync_zh.md
+++ /dev/null
@@ -1,143 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/sync-and-config/tsync_zh.md` -->
-
-# TSYNC
-
-## 指令示意图
-
-![TSYNC tile operation](../figures/isa/TSYNC.svg)
-
-## 简介
-
-同步 PTO 执行（等待事件或插入每操作流水线屏障）。
-
-- `TSYNC(events...)` 等待一组显式事件令牌。
-- `TSYNC<Op>()` 为单个向量操作类插入流水线屏障。
-
-`include/pto/common/pto_instr.hpp` 中的许多内建函数在发射指令前会在内部调用 `TSYNC(events...)`。
-
-## 数学语义
-
-不适用。
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-Event operand form:
-
-```text
-tsync %e0, %e1 : !pto.event<...>, !pto.event<...>
-```
-
-Single-op barrier form:
-
-```text
-tsync.op #pto.op<TADD>
-```
-
-### AS Level 1（SSA）
-
-```text
-// Level 1 (SSA) does not support explicit synchronization primitives.
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.record_event[src_op, dst_op, eventID]
-// 支持的op：TLOAD， TSTORE_ACC，TSTORE_VEC，TMOV_M2L，TMOV_M2S，TMOV_M2B，TMOV_M2V，TMOV_V2M，TMATMUL，TVEC
-pto.wait_event[src_op, dst_op, eventID]
-// 支持的op：TLOAD， TSTORE_ACC，TSTORE_VEC，TMOV_M2L，TMOV_M2S，TMOV_M2B，TMOV_M2V，TMOV_V2M，TMATMUL，TVEC
-pto.barrier(op)
-// 支持的op：TVEC,TMATMUL
-```
-
-在当前 PTO-DSL 前端流程中，`record_event` 和 `wait_event` 应视为 TSYNC 的低层形式。
-前端 kernel 通常不应手工编写事件连线，而应依赖 `ptoas --enable-insert-sync`
-自动插入同步。
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <Op OpCode>
-PTO_INST void TSYNC();
-
-template <typename... WaitEvents>
-PTO_INST void TSYNC(WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查（`TSYNC<Op>()`）**:
-  - `TSYNC_IMPL<Op>()` 仅支持向量流水线操作（`include/pto/common/event.hpp` 中通过 `static_assert(pipe == PIPE_V)` 强制执行）。
-- **`TSYNC(events...)` 语义**:
-  - `TSYNC(events...)` 调用 `WaitAllEvents(events...)`，后者对每个事件令牌调用 `events.Wait()`。在auto模式下是no-op。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto(__gm__ float* in) {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  using GShape = Shape<1, 1, 1, 16, 16>;
-  using GStride = BaseShape2D<float, 16, 16, Layout::ND>;
-  using GT = GlobalTensor<float, GShape, GStride, Layout::ND>;
-
-  GT gin(in);
-  TileT t;
-  Event<Op::TLOAD, Op::TADD> e;
-  e = TLOAD(t, gin);
-  TSYNC(e);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, c;
-  Event<Op::TADD, Op::TSTORE_VEC> e;
-  e = TADD(c, a, b);
-  TSYNC<Op::TADD>();
-  TSYNC(e);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%result = pto.tsync ...
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%result = pto.tsync ...
-```
-
-### PTO 汇编形式
-
-```text
-tsync %e0, %e1 : !pto.event<...>, !pto.event<...>
-# AS Level 2 (DPS)
-pto.record_event[src_op, dst_op, eventID]
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tadds.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tadds.md
deleted file mode 100644
index 0ac7f00b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tadds.md
+++ /dev/null
@@ -1,164 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tadds.md` -->
-
-# pto.tadds
-
-Standalone reference page for `pto.tadds`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise add a scalar to a tile.
-
-## Mechanism
-
-Elementwise add a scalar to a tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} + \mathrm{scalar} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tadds %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TADDS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TADDS(dst, src, 1.0f);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TADDS(dst, src, 1.0f);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tadds %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tmins](./tmins.md)
-- Next op in family: [pto.tsubs](./tsubs.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tadds_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tadds_zh.md
deleted file mode 100644
index bf31b90b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tadds_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.tadds
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tadds.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/taddsc.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/taddsc.md
deleted file mode 100644
index 3e4bfec9..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/taddsc.md
+++ /dev/null
@@ -1,147 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/taddsc.md` -->
-
-# pto.taddsc
-
-Standalone reference page for `pto.taddsc`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise fused add with scalar and a second tile: `src0 + scalar + src1`.
-
-## Mechanism
-
-Elementwise fused add with scalar and a second tile: `src0 + scalar + src1`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{scalar} + \mathrm{src1}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = taddsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.taddsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.taddsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TADDSC(TileData& dst, TileData& src0, typename TileData::DType scalar, TileData& src1,
-                            WaitEvents&... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Common constraints**:
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `dst`, `src0` and `src1` must have the same valid row/col.
-    - Scalar type must match the Tile data type.
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, out;
-  TADDSC(out, a, 2.0f, b);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = taddsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
-# AS Level 2 (DPS)
-pto.taddsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tlrelu](./tlrelu.md)
-- Next op in family: [pto.tsubsc](./tsubsc.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/taddsc_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/taddsc_zh.md
deleted file mode 100644
index 27eb4e42..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/taddsc_zh.md
+++ /dev/null
@@ -1,91 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/taddsc_zh.md` -->
-
-# TADDSC
-
-## 指令示意图
-
-![TADDSC tile operation](../figures/isa/TADDSC.svg)
-
-## 简介
-
-与标量和第二个 Tile 的融合逐元素加法：`src0 + scalar + src1`。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{scalar} + \mathrm{src1}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = taddsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.taddsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.taddsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TADDSC(TileData& dst, TileData& src0, typename TileData::DType scalar, TileData& src1,
-                            WaitEvents&... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **Common constraints**:
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `dst`, `src0` and `src1` must have the same valid row/col.
-    - Scalar type must match the Tile data type.
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, out;
-  TADDSC(out, a, 2.0f, b);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tands.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tands.md
deleted file mode 100644
index e4ead6d7..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tands.md
+++ /dev/null
@@ -1,135 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tands.md` -->
-
-# pto.tands
-
-Standalone reference page for `pto.tands`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise bitwise AND of a tile and a scalar.
-
-## Mechanism
-
-Elementwise bitwise AND of a tile and a scalar. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \;\&\; \mathrm{scalar} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tands %src, %scalar : !pto.tile<...>, i32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tands ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TANDS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Intended for integral element types.
-    - `dst` and `src` must use the same element type.
-    - `dst` and `src` must be vector tiles.
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
-    - In manual mode, setting the source tile and destination tile to the same memory is unsupported.
-
-- **Implementation checks (A5)**:
-    - Intended for integral element types supported by `TEXPANDS` and `TAND`.
-    - `dst` and `src` must use the same element type.
-    - `dst` and `src` must be vector tiles.
-    - In manual mode, setting the source tile and destination tile to the same memory is unsupported.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileDst dst;
-  TileSrc src;
-  TANDS(dst, src, 0xffu);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tands %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.tands ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tmaxs](./tmaxs.md)
-- Next op in family: [pto.tors](./tors.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tands_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tands_zh.md
deleted file mode 100644
index ae69782b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tands_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.tands
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tands.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tcmps.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tcmps.md
deleted file mode 100644
index c99fa028..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tcmps.md
+++ /dev/null
@@ -1,172 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tcmps.md` -->
-
-# pto.tcmps
-
-Standalone reference page for `pto.tcmps`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Compare a tile against a scalar and write per-element comparison results.
-
-## Mechanism
-
-Compare a tile against a scalar and write per-element comparison results. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \left(\mathrm{src}_{i,j}\ \mathrm{cmpMode}\ \mathrm{scalar}\right) $$
-
-The encoding/type of `dst` is implementation-defined (often a mask-like tile).
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tcmps %src, %scalar {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/type.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc0, typename T, typename... WaitEvents>
-PTO_INST RecordEvent TCMPS(TileDataDst& dst, TileDataSrc0& src0, T src1, CmpMode cmpMode, WaitEvents&... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes; `cmpMode` selects comparison predicate.
-- `dst` names the destination predicate tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Common constraints**:
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0` and `dst` must have the same valid row/col.
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-- **Comparison modes**:
-    - Supports `CmpMode::EQ`, `CmpMode::NE`, `CmpMode::LT`, `CmpMode::GT`, `CmpMode::LE`, `CmpMode::GE`.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `float`, `half`, `uint16_t`, `int16_t`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `float`, `half`, `uint16_t`, `int16_t`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src;
-  DstT dst(16, 2);
-  TCMPS(dst, src, 0.0f, CmpMode::GT);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using SrcT = Tile<TileType::Vec, float, 16, 16>;
-  using DstT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  SrcT src;
-  DstT dst(16, 2);
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TCMPS(dst, src, 0.0f, CmpMode::GT);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tcmps %src, %scalar {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.texpands](./texpands.md)
-- Next op in family: [pto.tsels](./tsels.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tcmps_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tcmps_zh.md
deleted file mode 100644
index da704fb5..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tcmps_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.tcmps
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tcmps.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tdivs.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tdivs.md
deleted file mode 100644
index 56547c56..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tdivs.md
+++ /dev/null
@@ -1,186 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tdivs.md` -->
-
-# pto.tdivs
-
-Standalone reference page for `pto.tdivs`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise division with a scalar (tile/scalar or scalar/tile).
-
-## Mechanism
-
-Elementwise division with a scalar (tile/scalar or scalar/tile). It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-- Tile/scalar:
-
-  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{src}_{i,j}}{\mathrm{scalar}} $$
-
-- Scalar/tile:
-
-  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{scalar}}{\mathrm{src}_{i,j}} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Tile/scalar form:
-
-```text
-%dst = tdivs %src, %scalar : !pto.tile<...>, f32
-```
-
-Scalar/tile form:
-
-```text
-%dst = tdivs %scalar, %src : f32, !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-%dst = pto.tdivs %scalar, %src : (dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-pto.tdivs ins(%scalar, %src : dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TDIVS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar,
-                           WaitEvents &... events);
-
-template <auto PrecisionType = DivAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc,
-          typename... WaitEvents>
-PTO_INST RecordEvent TDIVS(TileDataDst &dst, typename TileDataDst::DType scalar, TileDataSrc &src0,
-                           WaitEvents &... events)
-```
-
-`PrecisionType` has the following values available:
-
-* `DivAlgorithm::DEFAULT`: Normal algorithm, faster but with lower precision.
-* `DivAlgorithm::HIGH_PRECISION`: High precision algorithm, but slower.
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)** (both overloads):
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-- **Implementation checks (A5)** (both overloads):
-    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-- **Division-by-zero**:
-    - Behavior is target-defined; on A5 the tile/scalar form maps to multiply-by-reciprocal and uses `1/0 -> +inf` for `scalar == 0`.
-
-- **High Precision Algorithm**
-    - Only available on A5, `PrecisionType` option is ignored on A3.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TDIVS(dst, src, 2.0f);
-  TDIVS<DivAlgorithm::HIGH_PRECISION>(dst, src, 2.0f);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TDIVS(dst, 2.0f, src);
-  TDIVS<DivAlgorithm::HIGH_PRECISION>(dst, 2.0f, src);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tsubs](./tsubs.md)
-- Next op in family: [pto.tmuls](./tmuls.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tdivs_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tdivs_zh.md
deleted file mode 100644
index 5bb9ee25..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tdivs_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.tdivs
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tdivs.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/texpands.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/texpands.md
deleted file mode 100644
index 04435fdd..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/texpands.md
+++ /dev/null
@@ -1,160 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/texpands.md` -->
-
-# pto.texpands
-
-Standalone reference page for `pto.texpands`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Broadcast a scalar into a destination tile.
-
-## Mechanism
-
-Broadcast a scalar into a destination tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{scalar} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = texpands %scalar : f32, !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TEXPANDS(TileData &dst, typename TileData::DType scalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - For `TileType::Vec` :
-    - The op fills `dst` over `dst.GetValidRow()` / `dst.GetValidCol()`.
-    - For  `TileType::Mat` :
-    - For Tile : The op fills `dst` over `TileData::Rows` / `TileData::Cols`.
-    - For ConvTile : The op fills `dst` over `ConvTileData`'s shape.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - For `TileType::Vec` :
-      - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `bfloat16_t`, `float`.
-      - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - For  `TileType::Mat` :
-      - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `bfloat16_t`, `float`.
-      - Static valid bounds: `The range of  TileData::Rows * TileData::Cols * sizeof(T) / 32 is [1, 32767]`.
-
-- **Implementation checks (A5)**:
-    - For `TileType::Vec` :
-      - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`.
-      - Tile layout must be row-major (`TileData::isRowMajor`).
-      - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - For  `TileType::Mat` :
-      - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`.
-      - For`TileDataDst::layout == pto::Layout::NC1HWC0 || TileDataDst::layout == pto::Layout::FRACTAL_Z`:
-        - `The range of convtile's (shape0 * shape1 * shape2 * shape3) is [1, 32767]`.
-      - For`TileDataDst::layout == pto::Layout::NDC1HWC0 || TileDataDst::layout == pto::Layout::FRACTAL_Z_3D`:
-        - `The range of convtile's (shape0 * shape1 * shape2 * shape3 * shape4) is [1, 32767]`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT dst;
-  TEXPANDS(dst, 0.0f);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT dst;
-  TASSIGN(dst, 0x1000);
-  TEXPANDS(dst, 0.0f);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = texpands %scalar : f32, !pto.tile<...>
-# AS Level 2 (DPS)
-pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Next op in family: [pto.tcmps](./tcmps.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md
deleted file mode 100644
index b9977c30..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md
+++ /dev/null
@@ -1,132 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md` -->
-
-# TEXPANDS
-
-## 指令示意图
-
-![TEXPANDS tile operation](../figures/isa/TEXPANDS.svg)
-
-## 简介
-
-将标量广播到目标 Tile 中。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{scalar} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = texpands %scalar : f32, !pto.tile<...>
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TEXPANDS(TileData &dst, typename TileData::DType scalar, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - 对于Tile位置是向量（`TileData::Loc == TileType::Vec`）:
-    - `TileData::DType` 必须是以下之一：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
-    - 静态有效边界： `TileData::ValidRow <= TileData::Rows`且`TileData::ValidCol <= TileData::Cols`.
-    - 对于Tile位置是Mat（`TileData::Loc == TileType::Mat`）:
-    - `TileData::DType` 必须是以下之一：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
-    - 有效边界：`TileData::Rows * TileData::Cols * sizeof(T) / 32` 必须在`[1, 32767]`范围内。
-- **实现检查 (A5)**:
-    - 对于Tile位置是向量（`TileData::Loc == TileType::Vec`）:
-    - 静态有效边界： `TileData::ValidRow <= TileData::Rows`且`TileData::ValidCol <= TileData::Cols`.
-    - `TileData::DType` 必须是以下之一： `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`.
-    - 对于Tile位置是Mat（`TileData::Loc == TileType::Mat`）:
-    - `TileData::DType` 必须是以下之一： `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`.
-    - 对于`TileDataDst::layout == pto::Layout::NC1HWC0 || TileDataDst::layout == pto::Layout::FRACTAL_Z`:
-      - `TileData::shape0 * TileData::shape1 * TileData::shape2 * TileData::shape3` 必须在`[1, 32767]`范围内。
-    - 对于`TileDataDst::layout == pto::Layout::NDC1HWC0 || TileDataDst::layout == pto::Layout::FRACTAL_Z_3D`:
-      - `TileData::shape0 * TileData::shape1 * TileData::shape2 * TileData::shape3 * TileData::shape4` 必须在`[1, 32767]`范围内。
-- **有效区域**:
-    - 对于Tile位置是向量（`TileData::Loc == TileType::Vec`）:
-    - 该操作在 `dst.GetValidRow()` / `dst.GetValidCol()` 上填充 `dst`。
-    - 对于Tile位置是Mat（`TileData::Loc == TileType::Mat`）:
-    - 对于Tile，该操作在 `TileData::Rows` / `TileData::Cols` 上填充 `dst`。
-    - 对于convTile，该操作在`ConvTileData`的`shape`内填充`dst`。
-
-## 示例
-
-### 自动（Auto）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT dst;
-  TEXPANDS(dst, 0.0f);
-}
-```
-
-### 手动（Manual）
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT dst;
-  TASSIGN(dst, 0x1000);
-  TEXPANDS(dst, 0.0f);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = texpands %scalar : f32, !pto.tile<...>
-# AS Level 2 (DPS)
-pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tfmods.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tfmods.md
deleted file mode 100644
index 472fafee..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tfmods.md
+++ /dev/null
@@ -1,137 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tfmods.md` -->
-
-# pto.tfmods
-
-Standalone reference page for `pto.tfmods`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise remainder with a scalar: `fmod(src, scalar)`.
-
-## Mechanism
-
-Elementwise floor with a scalar: `fmod(src, scalar)`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$\mathrm{dst}_{i,j} = \mathrm{fmod}(\mathrm{src}_{i,j}, \mathrm{scalar})$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tfmods %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tfmods ins(%src, %scalar : !pto.tile_buf<...>, f32) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TFMODS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Division-by-zero**:
-    - Behavior is target-defined; the CPU simulator asserts in debug builds.
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `dst` and `src` must use the same element type.
-    - Supported element types are `float` and `float32_t`.
-    - `dst` and `src` must be vector tiles.
-    - `dst` and `src` must be row-major.
-    - Runtime: `dst.GetValidRow() == src.GetValidRow() > 0` and `dst.GetValidCol() == src.GetValidCol() > 0`.
-
-- **Implementation checks (A5)**:
-    - `dst` and `src` must use the same element type.
-    - Supported element types are 2-byte or 4-byte types supported by the target implementation (including `half` and `float`).
-    - `dst` and `src` must be vector tiles.
-    - Static valid bounds must satisfy `ValidRow <= Rows` and `ValidCol <= Cols` for both tiles.
-    - Runtime: `dst.GetValidRow() == src.GetValidRow()` and `dst.GetValidCol() == src.GetValidCol()`.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TFMODS(out, x, 3.0f);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tfmods %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tfmods ins(%src, %scalar : !pto.tile_buf<...>, f32) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tmuls](./tmuls.md)
-- Next op in family: [pto.trems](./trems.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tfmods_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tfmods_zh.md
deleted file mode 100644
index ebb3ce12..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tfmods_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.tfmods
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tfmods.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu.md
deleted file mode 100644
index ac0d431c..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu.md
+++ /dev/null
@@ -1,134 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu.md` -->
-
-# pto.tlrelu
-
-Standalone reference page for `pto.tlrelu`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Leaky ReLU with a scalar slope.
-
-## Mechanism
-
-Leaky ReLU with a scalar slope. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = (\mathrm{src}_{i,j} > 0) ? \mathrm{src}_{i,j} : (\mathrm{src}_{i,j} \cdot \mathrm{slope}) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tlrelu %src, %slope : !pto.tile<...>, f32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tlrelu ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TLRELU(TileDataDst& dst, TileDataSrc& src, typename TileDataSrc::DType scalar, WaitEvents&... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `slope` is the scalar slope for negative values (broadcast to all lanes).
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Common constraints**:
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `dst` and `src` must have the same valid row/col.
-    - Slope scalar type must match the Tile data type.
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `half`, `float16_t`, `float`, `float32_t` (floating-point types only).
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `half`, `float` (floating-point types only).
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TLRELU(out, x, 0.1f);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tlrelu %src, %slope : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tlrelu ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.txors](./txors.md)
-- Next op in family: [pto.taddsc](./taddsc.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md
deleted file mode 100644
index 1b378727..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md
+++ /dev/null
@@ -1,105 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md` -->
-
-# TLRELU
-
-## 指令示意图
-
-![TLRELU tile operation](../figures/isa/TLRELU.svg)
-
-## 简介
-
-带标量斜率的 Leaky ReLU。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = (\mathrm{src}_{i,j} > 0) ? \mathrm{src}_{i,j} : (\mathrm{src}_{i,j} \cdot \mathrm{slope}) $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tlrelu %src, %slope : !pto.tile<...>, f32
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tlrelu ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TLRELU(TileDataDst& dst, TileDataSrc& src, typename TileDataSrc::DType scalar, WaitEvents&... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` 必须是以下之一：`half`、`float16_t`、`float`、`float32_t`（仅浮点类型）。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **实现检查 (A5)**:
-    - `TileData::DType` 必须是以下之一：`half`、`float`（仅浮点类型）。
-    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
-- **通用约束**:
-    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
-    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
-    - 运行时：`dst` 和 `src` 的有效行列数必须相同。
-    - 斜率标量类型必须与 Tile 数据类型一致。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TLRELU(out, x, 0.1f);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tlrelu %src, %slope : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tlrelu ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmaxs.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmaxs.md
deleted file mode 100644
index c85d0ba5..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmaxs.md
+++ /dev/null
@@ -1,134 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tmaxs.md` -->
-
-# pto.tmaxs
-
-Standalone reference page for `pto.tmaxs`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise max of a tile and a scalar: `max(src, scalar)`.
-
-## Mechanism
-
-Elementwise max of a tile and a scalar: `max(src, scalar)`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \max(\mathrm{src}_{i,j}, \mathrm{scalar}) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tmaxs %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmaxs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TMAXS(TileDataDst& dst, TileDataSrc& src, typename TileDataSrc::DType scalar, WaitEvents&... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Common constraints**:
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `dst` and `src` must have the same valid row/col.
-    - Scalar type must match the Tile data type.
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`, `bfloat16_t`, `uint8_t`, `int8_t`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TMAXS(out, x, 0.0f);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tmaxs %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tmaxs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.trems](./trems.md)
-- Next op in family: [pto.tands](./tands.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmaxs_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmaxs_zh.md
deleted file mode 100644
index f6b35e59..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmaxs_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.tmaxs
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tmaxs.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmins.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmins.md
deleted file mode 100644
index 5d1d3350..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmins.md
+++ /dev/null
@@ -1,151 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tmins.md` -->
-
-# pto.tmins
-
-Standalone reference page for `pto.tmins`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise minimum of a tile and a scalar.
-
-## Mechanism
-
-Elementwise minimum of a tile and a scalar. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \min(\mathrm{src}_{i,j}, \mathrm{scalar}) $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tmins %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmins ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TMINS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Common constraints**:
-    - `dst` and `src` must use the same element type.
-    - Scalar type must match the tile data type.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
-    - Runtime: `src.GetValidCol() == dst.GetValidCol()`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TMINS(dst, src, 0.0f);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TMINS(dst, src, 0.0f);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tmins %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tmins ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tsels](./tsels.md)
-- Next op in family: [pto.tadds](./tadds.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmins_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmins_zh.md
deleted file mode 100644
index dbee78ee..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmins_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.tmins
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tmins.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmuls.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmuls.md
deleted file mode 100644
index 573567a7..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmuls.md
+++ /dev/null
@@ -1,164 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tmuls.md` -->
-
-# pto.tmuls
-
-Standalone reference page for `pto.tmuls`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise multiply a tile by a scalar.
-
-## Mechanism
-
-Elementwise multiply a tile by a scalar. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \cdot \mathrm{scalar} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tmuls %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TMULS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `src0.GetValidCol() == dst.GetValidCol()`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TMULS(dst, src, 2.0f);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT src, dst;
-  TASSIGN(src, 0x1000);
-  TASSIGN(dst, 0x2000);
-  TMULS(dst, src, 2.0f);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tmuls %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tdivs](./tdivs.md)
-- Next op in family: [pto.tfmods](./tfmods.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmuls_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmuls_zh.md
deleted file mode 100644
index fad02d7a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tmuls_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.tmuls
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tmuls.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tors.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tors.md
deleted file mode 100644
index 301d1895..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tors.md
+++ /dev/null
@@ -1,135 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tors.md` -->
-
-# pto.tors
-
-Standalone reference page for `pto.tors`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise bitwise OR of a tile and a scalar.
-
-## Mechanism
-
-Elementwise bitwise OR of a tile and a scalar. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \;|\; \mathrm{scalar} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tors %src, %scalar : !pto.tile<...>, i32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TORS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Intended for integral element types.
-    - `dst` and `src` must use the same element type.
-    - `dst` and `src` must be vector tiles.
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
-    - In manual mode, setting the source tile and destination tile to the same memory is unsupported.
-
-- **Implementation checks (A5)**:
-    - Intended for integral element types supported by `TEXPANDS` and `TOR`.
-    - `dst` and `src` must use the same element type.
-    - `dst` and `src` must be vector tiles.
-    - In manual mode, setting the source tile and destination tile to the same memory is unsupported.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileDst dst;
-  TileSrc src;
-  TORS(dst, src, 0xffu);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tors %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tands](./tands.md)
-- Next op in family: [pto.tshls](./tshls.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tors_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tors_zh.md
deleted file mode 100644
index fff8e391..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tors_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.tors
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tors.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/trems.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/trems.md
deleted file mode 100644
index b061e234..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/trems.md
+++ /dev/null
@@ -1,146 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/trems.md` -->
-
-# pto.trems
-
-Standalone reference page for `pto.trems`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise remainder with a scalar: `remainder(src, scalar)`.
-
-## Mechanism
-
-Elementwise remainder with a scalar: `%`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$\mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \bmod \mathrm{scalar}$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = trems %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.trems ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TREMS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar,
-                           TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Division by Zero**:
-    - Behavior is target-defined; the CPU simulator asserts in debug builds.
-
-- **Valid Region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation Checks (A2A3)**:
-    - `dst` and `src` must use the same element type.
-    - Supported element types: `float` and `int32_t`.
-    - `dst` and `src` must be vector tiles.
-    - `dst` and `src` must be row-major.
-    - Runtime: `dst.GetValidRow() == src.GetValidRow() > 0` and `dst.GetValidCol() == src.GetValidCol() > 0`.
-    - **tmp Buffer Requirements**:
-      - `tmp.GetValidCol() >= dst.GetValidCol()` (at least as many columns as dst)
-      - `tmp.GetValidRow() >= 1` (at least 1 row)
-      - Data type must match `TileDataDst::DType`.
-
-- **Implementation Checks (A5)**:
-    - `dst` and `src` must use the same element type.
-    - Supported element types: `float`, `int32_t`, `uint32_t`, `half`, `int16_t`, and `uint16_t`.
-    - `dst` and `src` must be vector tiles.
-    - Static valid bounds: `ValidRow <= Rows` and `ValidCol <= Cols` for both tiles.
-    - Runtime: `dst.GetValidRow() == src.GetValidRow()` and `dst.GetValidCol() == src.GetValidCol()`.
-    - Note: tmp parameter is accepted but not validated or used on A5.
-
-- **For `int32_t` Inputs (A2A3 Only)**: Both `src` elements and `scalar` must be in the range `[-2^24, 2^24]` (i.e., `[-16777216, 16777216]`) to ensure exact conversion to float32 during computation.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  Tile<TileType::Vec, float, 16, 16> tmp;
-  TREMS(out, x, 3.0f, tmp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = trems %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.trems ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tfmods](./tfmods.md)
-- Next op in family: [pto.tmaxs](./tmaxs.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/trems_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/trems_zh.md
deleted file mode 100644
index 7d7eb7ba..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/trems_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.trems
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](trems.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsels.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsels.md
deleted file mode 100644
index 1f3f734b..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsels.md
+++ /dev/null
@@ -1,191 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tsels.md` -->
-
-# pto.tsels
-
-Standalone reference page for `pto.tsels`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Select one of two source tiles using a scalar `selectMode` (global select).
-
-## Mechanism
-
-Select between source tile and scalar using a mask tile (per-element selection for source tile). It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$
-\mathrm{dst}_{i,j} =
-\begin{cases}
-\mathrm{src}_{i,j} & \text{if } \mathrm{mask}_{i,j}\ \text{is true} \\
-\mathrm{scalar} & \text{otherwise}
-\end{cases}
-$$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tsels %mask, %src, %scalar : !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsels ins(%mask, %src, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tsels %src0, %src1, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tsels ins(%src0, %src1, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataMask, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TSELS(TileDataDst &dst, TileDataMask &mask, TileDataSrc &src, TileDataTmp &tmp, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `mask` is the predicate mask tile; lane `(i,j)` selects from `src` if true, otherwise from `scalar`.
-- `src` is the source tile.
-- `scalar` is the scalar fallback value broadcast to all lanes.
-- `tmp` is a required temporary working tile for predicate unpacking.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-- **Mask encoding**:
-    - The mask tile is interpreted as packed predicate bits in a target-defined layout.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `sizeof(TileDataDst::DType)` must be `2` or `4` bytes.
-    - Supported data types are `half`, `float16_t`, `float`, and `float32_t`.
-    - `dst` and `src` must use the same element type.
-    - `dst` and `src` must be row-major.
-    - Runtime: `src.GetValidRow()/GetValidCol()` must match `dst.GetValidRow()/GetValidCol()`.
-
-- **Implementation checks (A5)**:
-    - `sizeof(TileDataDst::DType)` may be `1`, `2`, or `4` bytes.
-    - Supported data types are `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, and `float`.
-    - `dst` and `src` must use the same element type.
-    - `dst`, `mask`, and `src` must be row-major.
-    - Runtime: `src.GetValidRow()/GetValidCol()` must match `dst.GetValidRow()/GetValidCol()`.
-
-## Examples
-
-### Auto
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_auto() {
-  using TileDst = Tile<TileType::Vec, float, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, float, 16, 16>;
-  using TileTmp = Tile<TileType::Vec, float, 16, 16>;
-  using TileMask = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  TileDst dst;
-  TileSrc src;
-  TileTmp tmp;
-  TileMask mask(16, 2);
-  float scalar = 0.0f;
-  TSELS(dst, mask, src, tmp, scalar);
-}
-```
-
-### Manual
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example_manual() {
-  using TileDst = Tile<TileType::Vec, float, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, float, 16, 16>;
-  using TileTmp = Tile<TileType::Vec, float, 16, 16>;
-  using TileMask = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
-  TileDst dst;
-  TileSrc src;
-  TileTmp tmp;
-  TileMask mask(16, 2);
-  float scalar = 0.0f;
-  TASSIGN(src, 0x1000);
-  TASSIGN(tmp, 0x2000);
-  TASSIGN(dst, 0x3000);
-  TASSIGN(mask, 0x4000);
-  TSELS(dst, mask, src, tmp, scalar);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tsels %mask, %src, %scalar : !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tsels ins(%mask, %src, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tcmps](./tcmps.md)
-- Next op in family: [pto.tmins](./tmins.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsels_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsels_zh.md
deleted file mode 100644
index 7d5db8a9..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsels_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.tsels
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tsels.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshls.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshls.md
deleted file mode 100644
index 9a203ba9..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshls.md
+++ /dev/null
@@ -1,137 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tshls.md` -->
-
-# pto.tshls
-
-Standalone reference page for `pto.tshls`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise shift-left a tile by a scalar.
-
-## Mechanism
-
-Elementwise shift-left of a tile, shift bits given by scalar. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \ll \mathrm{scalar} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tshls %src, %scalar : !pto.tile<...>, i32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tshls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TSHLS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the unsigned integer shift count.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Supported element types are `int32_t`, `int`, `int16_t`, `uint32_t`, `unsigned int`, and `uint16_t`.
-    - `dst` and `src` must use the same element type.
-    - `dst` and `src` must be vector tiles.
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
-    - Scalar only supports zero and positive values.
-
-- **Implementation checks (A5)**:
-    - Supported element types are `int32_t`, `int16_t`, `int8_t`, `uint32_t`, `uint16_t`, and `uint8_t`.
-    - `dst` and `src` must use the same element type.
-    - `dst` and `src` must be vector tiles.
-    - Static valid bounds must satisfy `ValidRow <= Rows` and `ValidCol <= Cols` for both tiles.
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
-    - Scalar only supports zero and positive values.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileDst dst;
-  TileSrc src;
-  TSHLS(dst, src, 0x2);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tshls %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.tshls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tors](./tors.md)
-- Next op in family: [pto.tshrs](./tshrs.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md
deleted file mode 100644
index 6696955f..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md
+++ /dev/null
@@ -1,109 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md` -->
-
-# TSHLS
-
-## 指令示意图
-
-![TSHLS tile operation](../figures/isa/TSHLS.svg)
-
-## 简介
-
-Tile 按标量逐元素左移。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \ll \mathrm{scalar} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tshls %src, %scalar : !pto.tile<...>, i32
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tshls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TSHLS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - 支持的元素类型为 `int32_t`、`int`、`int16_t`、`uint32_t`、`unsigned int` 和 `uint16_t`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 标量仅支持零和正值。
-- **实现检查 (A5)**:
-    - 支持的元素类型为 `int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t` 和 `uint8_t`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 标量仅支持零和正值。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileDst dst;
-  TileSrc src;
-  TSHLS(dst, src, 0x2);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tshls %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.tshls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs.md
deleted file mode 100644
index b8b88d96..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs.md
+++ /dev/null
@@ -1,137 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tshrs.md` -->
-
-# pto.tshrs
-
-Standalone reference page for `pto.tshrs`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise shift-right a tile by a scalar.
-
-## Mechanism
-
-Elementwise shift-right of a tile, shift bits given by scalar. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \gg \mathrm{scalar} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tshrs %src, %scalar : !pto.tile<...>, i32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tshrs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TSHRS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the unsigned integer shift count.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Supported element types are `int32_t`, `int`, `int16_t`, `uint32_t`, `unsigned int`, and `uint16_t`.
-    - `dst` and `src` must use the same element type.
-    - `dst` and `src` must be vector tiles.
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
-    - Scalar only supports zero and positive values.
-
-- **Implementation checks (A5)**:
-    - Supported element types are `int32_t`, `int16_t`, `int8_t`, `uint32_t`, `uint16_t`, and `uint8_t`.
-    - `dst` and `src` must use the same element type.
-    - `dst` and `src` must be vector tiles.
-    - Static valid bounds must satisfy `ValidRow <= Rows` and `ValidCol <= Cols` for both tiles.
-    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
-    - Scalar only supports zero and positive values.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileDst dst;
-  TileSrc src;
-  TSHRS(dst, src, 0x2);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tshrs %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.tshrs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tshls](./tshls.md)
-- Next op in family: [pto.txors](./txors.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md
deleted file mode 100644
index e9d674f6..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md
+++ /dev/null
@@ -1,109 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md` -->
-
-# TSHRS
-
-## 指令示意图
-
-![TSHRS tile operation](../figures/isa/TSHRS.svg)
-
-## 简介
-
-Tile 按标量逐元素右移。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \gg \mathrm{scalar} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
-
-同步形式：
-
-```text
-%dst = tshrs %src, %scalar : !pto.tile<...>, i32
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tshrs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`：
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TSHRS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - 支持的元素类型为 `int32_t`、`int`、`int16_t`、`uint32_t`、`unsigned int` 和 `uint16_t`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 标量仅支持零和正值。
-- **实现检查 (A5)**:
-    - 支持的元素类型为 `int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t` 和 `uint8_t`。
-    - `dst` 和 `src` 必须使用相同的元素类型。
-    - `dst` 和 `src` 必须是向量 Tile。
-    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
-    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
-    - 标量仅支持零和正值。
-- **有效区域**:
-    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
-  TileDst dst;
-  TileSrc src;
-  TSHRS(dst, src, 0x2);
-}
-```
-
-## 汇编示例（ASM）
-
-### 自动模式
-
-```text
-# 自动模式：由编译器/运行时负责资源放置与调度。
-%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### 手动模式
-
-```text
-# 手动模式：先显式绑定资源，再发射指令。
-# 可选（当该指令包含 tile 操作数时）：
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO 汇编形式
-
-```text
-%dst = tshrs %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.tshrs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubs.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubs.md
deleted file mode 100644
index 4ff83ee7..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubs.md
+++ /dev/null
@@ -1,135 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tsubs.md` -->
-
-# pto.tsubs
-
-Standalone reference page for `pto.tsubs`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise subtract a scalar from a tile.
-
-## Mechanism
-
-Elementwise subtract a scalar from a tile. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} - \mathrm{scalar} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tsubs %src, %scalar : !pto.tile<...>, f32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsubs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
-PTO_INST RecordEvent TSUBS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Common constraints**:
-    - `dst` and `src0` must use the same element type.
-    - Scalar type must match `TileDataSrc::DType`.
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
-    - Tile location must be vector (`TileDataDst::Loc == TileType::Vec` and `TileDataSrc::Loc == TileType::Vec`).
-    - Static valid bounds: `TileDataDst::ValidRow <= TileDataDst::Rows`, `TileDataDst::ValidCol <= TileDataDst::Cols`, `TileDataSrc::ValidRow <= TileDataSrc::Rows`, and `TileDataSrc::ValidCol <= TileDataSrc::Cols`.
-    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT x, out;
-  TSUBS(out, x, 1.0f);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tsubs %src, %scalar : !pto.tile<...>, f32
-# AS Level 2 (DPS)
-pto.tsubs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tadds](./tadds.md)
-- Next op in family: [pto.tdivs](./tdivs.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubs_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubs_zh.md
deleted file mode 100644
index 52ee41cd..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubs_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.tsubs
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tsubs.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubsc.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubsc.md
deleted file mode 100644
index a6b8f993..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubsc.md
+++ /dev/null
@@ -1,146 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tsubsc.md` -->
-
-# pto.tsubsc
-
-Standalone reference page for `pto.tsubsc`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise fused op: `src0 - scalar + src1`.
-
-## Mechanism
-
-Elementwise fused op: `src0 - scalar + src1`. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{scalar} + \mathrm{src1}_{i,j} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = tsubsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsubsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### IR Level 1 (SSA)
-
-```text
-%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### IR Level 2 (DPS)
-
-```text
-pto.tsubsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TSUBSC(TileData& dst, TileData& src0, typename TileData::DType scalar, TileData& src1,
-                            WaitEvents&... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Common constraints**:
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `dst`, `src0` and `src1` must have the same valid row/col.
-    - Scalar type must match the Tile data type.
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-- **Implementation checks (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile layout must be row-major (`TileData::isRowMajor`).
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, out;
-  TSUBSC(out, a, 2.0f, b);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = tsubsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
-# AS Level 2 (DPS)
-pto.tsubsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.taddsc](./taddsc.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubsc_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubsc_zh.md
deleted file mode 100644
index 5f665e3c..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/tsubsc_zh.md
+++ /dev/null
@@ -1,91 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/tsubsc_zh.md` -->
-
-# TSUBSC
-
-## 指令示意图
-
-![TSUBSC tile operation](../figures/isa/TSUBSC.svg)
-
-## 简介
-
-融合逐元素运算：`src0 - scalar + src1`。
-
-## 数学语义
-
-对每个元素 `(i, j)` 在有效区域内：
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{scalar} + \mathrm{src1}_{i,j} $$
-
-## 汇编语法
-
-PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
-
-同步形式：
-
-```text
-%dst = tsubsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.tsubsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-### AS Level 1（SSA）
-
-```text
-%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
-```
-
-### AS Level 2（DPS）
-
-```text
-pto.tsubsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ 内建接口
-
-声明于 `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileData, typename... WaitEvents>
-PTO_INST RecordEvent TSUBSC(TileData& dst, TileData& src0, typename TileData::DType scalar, TileData& src1,
-                            WaitEvents&... events);
-```
-
-## 约束
-
-- **实现检查 (A2A3)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **实现检查 (A5)**:
-    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
-    - Tile 布局 must be row-major (`TileData::isRowMajor`).
-- **Common constraints**:
-    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
-    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
-    - Runtime: `dst`, `src0` and `src1` must have the same valid row/col.
-    - Scalar type must match the Tile data type.
-- **有效区域**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## 示例
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileT = Tile<TileType::Vec, float, 16, 16>;
-  TileT a, b, out;
-  TSUBSC(out, a, 2.0f, b);
-}
-```
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/txors.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/txors.md
deleted file mode 100644
index 8a3e9fa9..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/txors.md
+++ /dev/null
@@ -1,134 +0,0 @@
-<!-- Generated from `docs/isa/tile/ops/tile-scalar-and-immediate/txors.md` -->
-
-# pto.txors
-
-Standalone reference page for `pto.txors`. This page belongs to the [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md) family in the PTO ISA manual.
-
-## Summary
-
-Elementwise bitwise XOR of a tile and a scalar.
-
-## Mechanism
-
-Elementwise bitwise XOR of a tile and a scalar. It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.
-
-For each element `(i, j)` in the valid region:
-
-$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \oplus \mathrm{scalar} $$
-
-## Syntax
-
-Textual spelling is defined by the PTO ISA syntax-and-operands pages.
-
-Synchronous form:
-
-```text
-%dst = txors %src, %scalar : !pto.tile<...>, i32
-```
-
-### AS Level 1 (SSA)
-
-```text
-%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### AS Level 2 (DPS)
-
-```text
-pto.txors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
-PTO_INST RecordEvent TXORS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, TileDataTmp &tmp, WaitEvents &... events);
-```
-
-## Inputs
-
-- `src` is the source tile.
-- `scalar` is the scalar value broadcast to all lanes.
-- `dst` names the destination tile.
-- The operation iterates over `dst`'s valid region.
-
-## Expected Outputs
-
-`dst` carries the result tile or updated tile payload produced by the operation.
-
-## Side Effects
-
-No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
-
-## Constraints
-
-- **Valid region**:
-    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
-
-## Exceptions
-
-- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend surface.
-- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
-
-## Target-Profile Restrictions
-
-- **Implementation checks (A2A3)**:
-    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, and `int16_t`.
-    - `dst`, `src`, and `tmp` must use the same element type.
-    - In manual mode, source, destination, and temporary storage must not overlap in memory.
-
-- **Implementation checks (A5)**:
-    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
-    - `dst` and `src` element types must match.
-    - `src.GetValidRow()/GetValidCol()` must match `dst`.
-
-## Examples
-
-```cpp
-#include <pto/pto-inst.hpp>
-
-using namespace pto;
-
-void example() {
-  using TileDst = Tile<TileType::Vec, uint32_t, 16, 16>;
-  using TileSrc = Tile<TileType::Vec, uint32_t, 16, 16>;
-  using TileTmp = Tile<TileType::Vec, uint32_t, 16, 16>;
-  TileDst dst;
-  TileSrc src;
-  TileTmp tmp;
-  TXORS(dst, src, 0x1u, tmp);
-}
-```
-
-### Auto Mode
-
-```text
-# Auto mode: compiler/runtime-managed placement and scheduling.
-%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### Manual Mode
-
-```text
-# Manual mode: bind resources explicitly before issuing the instruction.
-# Optional for tile operands:
-# pto.tassign %arg0, @tile(0x1000)
-# pto.tassign %arg1, @tile(0x2000)
-%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
-```
-
-### PTO Assembly Form
-
-```text
-%dst = txors %src, %scalar : !pto.tile<...>, i32
-# AS Level 2 (DPS)
-pto.txors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Tile Scalar And Immediate](../../tile-scalar-and-immediate.md)
-- Previous op in family: [pto.tshrs](./tshrs.md)
-- Next op in family: [pto.tlrelu](./tlrelu.md)
diff --git a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/txors_zh.md b/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/txors_zh.md
deleted file mode 100644
index b2429dcb..00000000
--- a/docs/mkdocs/src/docs/isa/tile/ops/tile-scalar-and-immediate/txors_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.txors
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](txors.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/reduce-and-expand.md b/docs/mkdocs/src/docs/isa/tile/reduce-and-expand.md
deleted file mode 100644
index 2c824dcf..00000000
--- a/docs/mkdocs/src/docs/isa/tile/reduce-and-expand.md
+++ /dev/null
@@ -1,216 +0,0 @@
-<!-- Generated from `docs/isa/tile/reduce-and-expand.md` -->
-
-# Reduce And Expand Family
-
-Reduce operations collapse a 2D tile along one axis into a 1D result (or a tile with reduced extent along that axis). Expand operations broadcast a 1D tile along one axis to produce a 2D tile.
-
-## Operations
-
-### Reduce (Row)
-
-| Operation | Description | C++ Intrinsic |
-|-----------|-------------|----------------|
-| [pto.trowsum](./ops/reduce-and-expand/trowsum.md) | Sum reduction along rows | `TROWSUM(dst, src, tmp)` |
-| [pto.trowprod](./ops/reduce-and-expand/trowprod.md) | Product reduction along rows | `TROWPROD(dst, src, tmp)` |
-| [pto.trowmax](./ops/reduce-and-expand/trowmax.md) | Maximum reduction along rows | `TROWMAX(dst, src, tmp)` |
-| [pto.trowmin](./ops/reduce-and-expand/trowmin.md) | Minimum reduction along rows | `TROWMIN(dst, src, tmp)` |
-| [pto.trowargmax](./ops/reduce-and-expand/trowargmax.md) | Index of maximum along rows | `TROWARGMAX(dst, src, tmp)` |
-| [pto.trowargmin](./ops/reduce-and-expand/trowargmin.md) | Index of minimum along rows | `TROWARGMIN(dst, src, tmp)` |
-
-### Reduce (Column)
-
-| Operation | Description | C++ Intrinsic |
-|-----------|-------------|----------------|
-| [pto.tcolsum](./ops/reduce-and-expand/tcolsum.md) | Sum reduction along columns | `TCOLSUM(dst, src)` |
-| [pto.tcolprod](./ops/reduce-and-expand/tcolprod.md) | Product reduction along columns | `TCOLPROD(dst, src)` |
-| [pto.tcolmax](./ops/reduce-and-expand/tcolmax.md) | Maximum reduction along columns | `TCOLMAX(dst, src)` |
-| [pto.tcolmin](./ops/reduce-and-expand/tcolmin.md) | Minimum reduction along columns | `TCOLMIN(dst, src)` |
-| [pto.tcolargmax](./ops/reduce-and-expand/tcolargmax.md) | Index of maximum along columns | `TCOLARGMAX(dst, src, tmp)` |
-| [pto.tcolargmin](./ops/reduce-and-expand/tcolargmin.md) | Index of minimum along columns | `TCOLARGMIN(dst, src, tmp)` |
-
-### Expand (Row)
-
-| Operation | Description | C++ Intrinsic |
-|-----------|-------------|----------------|
-| [pto.trowexpand](./ops/reduce-and-expand/trowexpand.md) | Expand row scalar to full tile | `TROWEXPAND(dst, src)` |
-| [pto.trowexpandadd](./ops/reduce-and-expand/trowexpandadd.md) | Expand row and add | `TROWEXPANDADD(dst, src0, src1)` |
-| [pto.trowexpandsub](./ops/reduce-and-expand/trowexpandsub.md) | Expand row and subtract | `TROWEXPSUB(dst, src0, src1)` |
-| [pto.trowexpandmul](./ops/reduce-and-expand/trowexpandmul.md) | Expand row and multiply | `TROWEXPMUL(dst, src0, src1)` |
-| [pto.trowexpanddiv](./ops/reduce-and-expand/trowexpanddiv.md) | Expand row and divide | `TROWEXPDIV(dst, src0, src1)` |
-| [pto.trowexpandmax](./ops/reduce-and-expand/trowexpandmax.md) | Expand row and max | `TROWEXPANDMAX(dst, src0, src1)` |
-| [pto.trowexpandmin](./ops/reduce-and-expand/trowexpandmin.md) | Expand row and min | `TROWEXPANDMIN(dst, src0, src1)` |
-| [pto.trowexpandexpdif](./ops/reduce-and-expand/trowexpandexpdif.md) | Expand with exponential difference | `TROWEXPDIF(dst, src0, src1)` |
-
-### Expand (Column)
-
-| Operation | Description | C++ Intrinsic |
-|-----------|-------------|----------------|
-| [pto.tcolexpand](./ops/reduce-and-expand/tcolexpand.md) | Expand column scalar to full tile | `TCOLEXPAND(dst, src)` |
-| [pto.tcolexpandadd](./ops/reduce-and-expand/tcolexpandadd.md) | Expand column and add | `TCOLEXPANDADD(dst, src0, src1)` |
-| [pto.tcolexpandsub](./ops/reduce-and-expand/tcolexpandsub.md) | Expand column and subtract | `TCOLEXPSUB(dst, src0, src1)` |
-| [pto.tcolexpandmul](./ops/reduce-and-expand/tcolexpandmul.md) | Expand column and multiply | `TCOLEXPMUL(dst, src0, src1)` |
-| [pto.tcolexpanddiv](./ops/reduce-and-expand/tcolexpanddiv.md) | Expand column and divide | `TCOLEXPDIV(dst, src0, src1)` |
-| [pto.tcolexpandmax](./ops/reduce-and-expand/tcolexpandmax.md) | Expand column and max | `TCOLEXPANDMAX(dst, src0, src1)` |
-| [pto.tcolexpandmin](./ops/reduce-and-expand/tcolexpandmin.md) | Expand column and min | `TCOLEXPANDMIN(dst, src0, src1)` |
-| [pto.tcolexpandexpdif](./ops/reduce-and-expand/tcolexpandexpdif.md) | Expand with exponential difference | `TCOLEXPDIF(dst, src0, src1)` |
-
-## Mechanism
-
-### Reduce
-
-For each row `r`, reduce along the column axis:
-
-$$ \mathrm{dst}_r = \bigoplus_{c=0}^{C-1} \mathrm{src}_{r,c} $$
-
-For each column `c`, reduce along the row axis:
-
-$$ \mathrm{dst}_c = \bigoplus_{r=0}^{R-1} \mathrm{src}_{r,c} $$
-
-where $\bigoplus$ is the reduction operator (sum, max, min, prod).
-
-### Expand
-
-Expand takes a 1D tile of shape `(R)` or `(C)` and broadcasts it to a 2D tile of shape `(R, C)`:
-
-$$ \mathrm{dst}_{r,c} = \mathrm{src}_r \quad \text{(row expand)} $$
-
-$$ \mathrm{dst}_{r,c} = \mathrm{src}_c \quad \text{(column expand)} $$
-
-Expand variants combine the broadcast with an elementwise operation using a second source tile:
-
-$$ \mathrm{dst}_{r,c} = \mathrm{src0}_{r,c} \;\oplus\; \mathrm{src1}_r \quad \text{(row expand with op)} $$
-
-## Output Shape
-
-| Operation | Input Shape | Output Shape |
-|-----------|-------------|-------------|
-| Row reduce | `(R, C)` | `(R, 1)` |
-| Column reduce | `(R, C)` | `(1, C)` |
-| Row expand | `(R, 1)` | `(R, C)` |
-| Column expand | `(1, C)` | `(R, C)` |
-
-## Type Support by Target Profile
-
-| Element Type | CPU Simulator | A2/A3 | A5 |
-|------------|:-------------:|:------:|:--:|
-| f32 (float) | Yes | Yes | Yes |
-| f16 (half) | Yes | Yes | Yes |
-| bf16 (bfloat16_t) | Yes | Yes | Yes |
-| i8 / u8 | Yes | Yes | Yes |
-| i16 / u16 | Yes | Yes | Yes |
-| i32 / u32 | Yes | Yes | Yes |
-| i64 / u64 | Yes | Yes | Yes |
-
-## Constraints
-
-- The source tile's valid region determines the reduction domain.
-- Arg variants (`*_argmax`, `*_argmin`) produce an **integer index tile**, not a numeric value tile.
-- The destination tile for reduce operations has extent `1` along the reduced axis.
-- Expand variants require a second source tile with shape `(R)` or `(C)` matching the expand axis.
-- Exp-diff variants compute: `dst = exp(src0 - src1)` — used for softmax-style reductions.
-
-## Cases That Are Not Allowed
-
-- **MUST NOT** reduce along an axis with zero extent.
-- **MUST NOT** use arg variants with non-numeric element types.
-- **MUST NOT** use expand variants with mismatched expand-axis lengths.
-
-## C++ Intrinsic
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Row reduce (requires temporary tile)
-template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWSUM(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWPROD(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWMAX(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWMIN(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWARGMAX(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
-PTO_INST RecordEvent TROWARGMIN(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
-
-// Column reduce
-template <typename TileDst, typename TileSrc, typename... WaitEvents>
-PTO_INST RecordEvent TCOLSUM(TileDst& dst, TileSrc& src, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc, typename... WaitEvents>
-PTO_INST RecordEvent TCOLPROD(TileDst& dst, TileSrc& src, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc, typename... WaitEvents>
-PTO_INST RecordEvent TCOLMAX(TileDst& dst, TileSrc& src, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc, typename... WaitEvents>
-PTO_INST RecordEvent TCOLMIN(TileDst& dst, TileSrc& src, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
-PTO_INST RecordEvent TCOLARGMAX(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
-PTO_INST RecordEvent TCOLARGMIN(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
-
-// Row expand
-template <typename TileDst, typename TileSrc, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPAND(TileDst& dst, TileSrc& src, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDADD(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPSUB(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPMUL(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPDIV(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMAX(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPANDMIN(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TROWEXPDIF(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-// Column expand
-template <typename TileDst, typename TileSrc, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPAND(TileDst& dst, TileSrc& src, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDADD(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPSUB(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPMUL(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPDIV(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDMAX(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPANDMIN(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-
-template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
-PTO_INST RecordEvent TCOLEXPDIF(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
-```
-
-## See Also
-
-- [Tile families](../instruction-families/tile-families.md) — Family overview
-- [Tile instruction surface](../instruction-surfaces/tile-instructions.md) — Surface description
diff --git a/docs/mkdocs/src/docs/isa/tile/reduce-and-expand_zh.md b/docs/mkdocs/src/docs/isa/tile/reduce-and-expand_zh.md
deleted file mode 100644
index 8c471870..00000000
--- a/docs/mkdocs/src/docs/isa/tile/reduce-and-expand_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Reduce And Expand Family
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](reduce-and-expand.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/sync-and-config.md b/docs/mkdocs/src/docs/isa/tile/sync-and-config.md
deleted file mode 100644
index 749bc2a3..00000000
--- a/docs/mkdocs/src/docs/isa/tile/sync-and-config.md
+++ /dev/null
@@ -1,87 +0,0 @@
-<!-- Generated from `docs/isa/tile/sync-and-config.md` -->
-
-# Sync And Config Family
-
-Sync-and-config operations manage tile-visible state: resource binding, event setup, mode control, and synchronization. These operations do not produce arithmetic payload — they change state that later tile instructions consume.
-
-## Operations
-
-| Operation | Description | Category | C++ Intrinsic |
-|-----------|-------------|----------|---------------|
-| [pto.tassign](./ops/sync-and-config/tassign.md) | Bind tile register to a UB address | Resource | `TASSIGN(tile, addr)` |
-| [pto.tsync](./ops/sync-and-config/tsync.md) | Synchronize execution, wait on events, insert barrier | Sync | `TSYNC(events...)` |
-| [pto.tsethf32mode](./ops/sync-and-config/tsethf32mode.md) | Set HF32 computation mode | Config | `TSETHF32MODE(mode)` |
-| [pto.tsettf32mode](./ops/sync-and-config/tsettf32mode.md) | Set TF32 computation mode | Config | `TSETTF32MODE(mode)` |
-| [pto.tsetfmatrix](./ops/sync-and-config/tsetfmatrix.md) | Set FMatrix layout configuration | Config | `TSETFMATRIX(cfg)` |
-| [pto.tset_img2col_rpt](./ops/sync-and-config/tset-img2col-rpt.md) | Set img2col repetition count | Config | `TSET_IMG2COL_RPT(rpt)` |
-| [pto.tset_img2col_padding](./ops/sync-and-config/tset-img2col-padding.md) | Set img2col padding configuration | Config | `TSET_IMG2COL_PADDING(pad)` |
-| [pto.tsubview](./ops/sync-and-config/tsubview.md) | Create a sub-view of a tile | View | `TSUBVIEW(tile, offsets, shape)` |
-| [pto.tget_scale_addr](./ops/sync-and-config/tget-scale-addr.md) | Get scale address for quantized matmul | Config | `TGET_SCALE_ADDR(tile)` |
-
-## Mechanism
-
-Sync-and-config operations change tile-visible state that later tile instructions consume:
-
-- **`TASSIGN`**: binds a physical UB address to a tile register. Without `TASSIGN`, the compiler/runtime auto-assigns addresses. `TASSIGN` enables manual placement for performance tuning.
-- **`TSYNC`**: waits on event tokens (`events...`) or inserts per-op pipeline barriers (`TSYNC<Op>()`). See [Ordering and Synchronization](../machine-model/ordering-and-synchronization.md) for the full event model.
-- **`TSET*`**: configure mode registers that affect how later operations interpret their inputs or produce results. The affected operations are those that consume the configured mode.
-- **`TSUBVIEW`**: creates a logical view of a tile with adjusted offsets and/or reduced shape. The underlying storage is shared with the source tile.
-- **`TGET_SCALE_ADDR`**: retrieves the physical UB address of a scale tensor used in quantized matmul operations.
-
-## Sync Model
-
-`TSYNC` operates at two levels:
-
-1. **Event-wait form**: `TSYNC(%e0, %e1)` blocks until the specified events have been recorded. Events are produced by preceding operations (e.g., `TLOAD` produces an event; `TSYNC` waits on it).
-
-2. **Barrier form**: `TSYNC<Op>()` inserts a pipeline barrier for the specified operation class. All operations of class `Op` that appear before the barrier complete before any operation of class `Op` that appears after the barrier begins.
-
-See [Producer-Consumer Ordering](../memory-model/producer-consumer-ordering.md) for the complete synchronization model.
-
-## Constraints
-
-- `TASSIGN` binds an address; using the same address for two non-alias tiles simultaneously results in undefined behavior.
-- `TSYNC` with no operands is a no-op.
-- `TSET*` mode configurations affect subsequent operations until the next mode-setting operation of the same kind.
-- `TSUBVIEW` creates a view with reduced shape; accessing elements outside the view's shape but within the underlying tile's shape is undefined behavior.
-
-## Cases That Are Not Allowed
-
-- **MUST NOT** use the same physical tile register for two non-alias tiles without an intervening `TSYNC`.
-- **MUST NOT** wait on an event that has not been produced by a preceding operation.
-- **MUST NOT** configure mode registers while dependent operations are in-flight.
-
-## C++ Intrinsic
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Assign tile to UB address
-template <typename TileT>
-PTO_INST void TASSIGN(TileT& tile, uint64_t addr);
-
-// Synchronize on events
-template <typename... EventTs>
-PTO_INST RecordEvent TSYNC(EventTs&... events);
-
-// Pipeline barrier for op class
-template <typename OpTag>
-PTO_INST void TSYNC();
-
-// Set computation modes
-PTO_INST void TSETHF32MODE(HF32Mode mode);
-PTO_INST void TSETTF32MODE(TF32Mode mode);
-PTO_INST void TSETFMATRIX(FMatrixConfig cfg);
-
-// Subview creation
-template <typename TileT>
-PTO_INST TileT TSUBVIEW(TileT& src, int rowOffset, int colOffset,
-                         int newRows, int newCols);
-```
-
-## See Also
-
-- [Tile families](../instruction-families/tile-families.md) — Family overview
-- [Ordering and Synchronization](../machine-model/ordering-and-synchronization.md) — Event model
-- [Tile instruction surface](../instruction-surfaces/tile-instructions.md) — Surface description
diff --git a/docs/mkdocs/src/docs/isa/tile/sync-and-config_zh.md b/docs/mkdocs/src/docs/isa/tile/sync-and-config_zh.md
deleted file mode 100644
index eed724e3..00000000
--- a/docs/mkdocs/src/docs/isa/tile/sync-and-config_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Sync And Config Family
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](sync-and-config.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/tile/tile-scalar-and-immediate.md b/docs/mkdocs/src/docs/isa/tile/tile-scalar-and-immediate.md
deleted file mode 100644
index c00d5f3a..00000000
--- a/docs/mkdocs/src/docs/isa/tile/tile-scalar-and-immediate.md
+++ /dev/null
@@ -1,114 +0,0 @@
-<!-- Generated from `docs/isa/tile/tile-scalar-and-immediate.md` -->
-
-# Tile-Scalar And Immediate Family
-
-Tile-scalar operations combine a tile operand with a scalar value or immediate operand. The scalar is broadcast to match the tile shape. Comparison variants produce a predicate tile.
-
-## Operations
-
-| Operation | Description | Category | C++ Intrinsic |
-|-----------|-------------|----------|----------------|
-| [pto.tadds](./ops/tile-scalar-and-immediate/tadds.md) | Elementwise addition with scalar | Binary | `TADDS(dst, src, scalar)` |
-| [pto.tsubs](./ops/tile-scalar-and-immediate/tsubs.md) | Elementwise subtraction with scalar | Binary | `TSUBS(dst, src, scalar)` |
-| [pto.tmuls](./ops/tile-scalar-and-immediate/tmuls.md) | Elementwise multiplication with scalar | Binary | `TMULS(dst, src, scalar)` |
-| [pto.tdivs](./ops/tile-scalar-and-immediate/tdivs.md) | Elementwise division with scalar | Binary | `TDIVS(dst, src, scalar)` |
-| [pto.tfmods](./ops/tile-scalar-and-immediate/tfmods.md) | Elementwise modulo with scalar | Binary | `TFMODS(dst, src, scalar)` |
-| [pto.trems](./ops/tile-scalar-and-immediate/trems.md) | Elementwise remainder with scalar | Binary | `TREMS(dst, src, scalar)` |
-| [pto.tmins](./ops/tile-scalar-and-immediate/tmins.md) | Elementwise minimum with scalar | Binary | `TMINS(dst, src, scalar)` |
-| [pto.tmaxs](./ops/tile-scalar-and-immediate/tmaxs.md) | Elementwise maximum with scalar | Binary | `TMAXS(dst, src, scalar)` |
-| [pto.tands](./ops/tile-scalar-and-immediate/tands.md) | Elementwise AND with scalar | Binary | `TANDS(dst, src, scalar)` |
-| [pto.tors](./ops/tile-scalar-and-immediate/tors.md) | Elementwise OR with scalar | Binary | `TORS(dst, src, scalar)` |
-| [pto.txors](./ops/tile-scalar-and-immediate/txors.md) | Elementwise XOR with scalar | Binary | `TXORS(dst, src, scalar)` |
-| [pto.tshls](./ops/tile-scalar-and-immediate/tshls.md) | Shift left by scalar | Binary | `TSHLS(dst, src, shift)` |
-| [pto.tshrs](./ops/tile-scalar-and-immediate/tshrs.md) | Shift right by scalar | Binary | `TSHRS(dst, src, shift)` |
-| [pto.tlrelu](./ops/tile-scalar-and-immediate/tlrelu.md) | Leaky ReLU with scalar slope | Binary | `TLRELU(dst, src, slope)` |
-| [pto.taddsc](./ops/tile-scalar-and-immediate/taddsc.md) | Saturating add with scalar | Binary | `TADDSC(dst, src, scalar)` |
-| [pto.tsubsc](./ops/tile-scalar-and-immediate/tsubsc.md) | Saturating subtract with scalar | Binary | `TSUBSC(dst, src, scalar)` |
-| [pto.texpands](./ops/tile-scalar-and-immediate/texpands.md) | Compare tile with scalar, produce predicate | Comparison | `TEXPMDS(dst, src, scalar)` |
-| [pto.tcmps](./ops/tile-scalar-and-immediate/tcmps.md) | Compare tile with scalar, produce predicate | Comparison | `TCMPS(dst, src, scalar, cmp)` |
-| [pto.tsels](./ops/tile-scalar-and-immediate/tsels.md) | Select from two tiles based on scalar predicate | Selection | `TSELS(dst, src0, src1, pred)` |
-
-## Mechanism
-
-For each lane `(r, c)` in the destination's valid region:
-
-$$ \mathrm{dst}_{r,c} = f(\mathrm{src}_{r,c}, \mathrm{scalar}) $$
-
-The scalar operand is broadcast to all lanes. Comparison operations produce a predicate tile: lane `(r, c)` is `1` where the condition holds, `0` otherwise.
-
-## Scalar Operand
-
-The scalar operand may be:
-
-- A scalar register value (`!pto.scalar<T>`)
-- A compile-time immediate constant
-- A runtime scalar value passed as a parameter
-
-The scalar type must be compatible with the tile element type. No implicit type conversion is performed.
-
-## Saturating Variants
-
-`TADDSC` and `TSUBSC` perform saturating arithmetic (clamp to type min/max on overflow/underflow), in contrast to `TADDS`/`TSUBS` which use wrapping semantics.
-
-## Type Support by Target Profile
-
-| Element Type | CPU Simulator | A2/A3 | A5 |
-|------------|:-------------:|:------:|:--:|
-| f32 (float) | Yes | Yes | Yes |
-| f16 (half) | Yes | Yes | Yes |
-| bf16 (bfloat16_t) | Yes | Yes | Yes |
-| i8 / u8 | Yes | Yes | Yes |
-| i16 / u16 | Yes | Yes | Yes |
-| i32 / u32 | Yes | Yes | Yes |
-| i64 / u64 | Yes | Yes | Yes |
-
-## Constraints
-
-- The scalar type MUST be compatible with the tile element type.
-- Shift operations (`TSHLS`, `TSHRS`) interpret the scalar as an unsigned integer shift count.
-- Saturating variants (`TADDSC`, `TSUBSC`) clamp results to type min/max on overflow/underflow.
-- Comparison variants produce a predicate tile, not a numeric tile.
-
-## Cases That Are Not Allowed
-
-- **MUST NOT** use a scalar type that is not compatible with the tile element type.
-- **MUST NOT** use a shift count `>=` element bit-width.
-- **MUST NOT** rely on implicit type promotion between scalar and tile types.
-
-## C++ Intrinsic
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// Arithmetic with scalar
-template <typename TileDst, typename TileSrc, typename ScalarT>
-PTO_INST RecordEvent TADDS(TileDst& dst, TileSrc& src, ScalarT scalar);
-
-template <typename TileDst, typename TileSrc, typename ScalarT>
-PTO_INST RecordEvent TMULS(TileDst& dst, TileSrc& src, ScalarT scalar);
-
-template <typename TileDst, typename TileSrc, typename ScalarT>
-PTO_INST RecordEvent TMAXS(TileDst& dst, TileSrc& src, ScalarT scalar);
-
-// Saturating arithmetic with scalar
-template <typename TileDst, typename TileSrc, typename ScalarT>
-PTO_INST RecordEvent TADDSC(TileDst& dst, TileSrc& src, ScalarT scalar);
-
-// Shift by scalar (shift is unsigned integer)
-template <typename TileDst, typename TileSrc>
-PTO_INST RecordEvent TSHLS(TileDst& dst, TileSrc& src, uint32_t shift);
-
-// Leaky ReLU: dst[i,j] = (src[i,j] > 0) ? src[i,j] : slope * src[i,j]
-template <typename TileDst, typename TileSrc, typename ScalarT>
-PTO_INST RecordEvent TLRELU(TileDst& dst, TileSrc& src, ScalarT slope);
-
-// Comparison with scalar (produces predicate tile)
-template <typename TileDst, typename TileSrc, typename ScalarT>
-PTO_INST RecordEvent TCMPS(TileDst& dst, TileSrc& src, ScalarT scalar, CompareMode cmp);
-```
-
-## See Also
-
-- [Tile families](../instruction-families/tile-families.md) — Family overview
-- [Tile instruction surface](../instruction-surfaces/tile-instructions.md) — Surface description
diff --git a/docs/mkdocs/src/docs/isa/tile/tile-scalar-and-immediate_zh.md b/docs/mkdocs/src/docs/isa/tile/tile-scalar-and-immediate_zh.md
deleted file mode 100644
index b29b7db3..00000000
--- a/docs/mkdocs/src/docs/isa/tile/tile-scalar-and-immediate_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Tile-Scalar And Immediate Family
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](tile-scalar-and-immediate.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/README.md b/docs/mkdocs/src/docs/isa/vector/README.md
deleted file mode 100644
index a9884e37..00000000
--- a/docs/mkdocs/src/docs/isa/vector/README.md
+++ /dev/null
@@ -1,44 +0,0 @@
-<!-- Generated from `docs/isa/vector/README.md` -->
-
-# Vector ISA Reference
-
-This section documents the `pto.v*` vector micro-instruction surface of PTO ISA. Pages are organized by family, with standalone per-op pages under `vector/ops/`.
-
-## Families
-
-| Family | Description | Operations |
-|--------|-------------|------------|
-| [Vector Load Store](./vector-load-store.md) | UB↔vector register transfer, gather/scatter | ~25 |
-| [Predicate and Materialization](./predicate-and-materialization.md) | Vector broadcast and duplication | 2 |
-| [Unary Vector Ops](./unary-vector-ops.md) | Single-input elementwise operations | 12 |
-| [Binary Vector Ops](./binary-vector-ops.md) | Two-input elementwise operations | 14 |
-| [Vec-Scalar Ops](./vec-scalar-ops.md) | Vector combined with scalar operand | 14 |
-| [Conversion Ops](./conversion-ops.md) | Type conversion between numeric types | 3 |
-| [Reduction Ops](./reduction-ops.md) | Cross-lane reductions | 6 |
-| [Compare and Select](./compare-select.md) | Comparison and conditional selection | 5 |
-| [Data Rearrangement](./data-rearrangement.md) | Lane permutation and packing | 10 |
-| [SFU and DSA Ops](./sfu-and-dsa-ops.md) | Special function units and DSA ops | 11 |
-
-## Quick Reference
-
-### Common Vector Types
-
-| Type | Description |
-|------|-------------|
-| `!pto.vreg<NxT>` | Vector register with N lanes of type T |
-| `!pto.mask` | Predicate mask (width matches vector length) |
-| `!pto.scalar<T>` | Scalar register |
-
-### Vector Lengths
-
-Vector length `N` is a power of 2. Common values depend on the target profile.
-
-## Navigation
-
-The left sidebar provides standalone per-op pages for all vector surface instructions. Use the family overviews above to understand shared constraints and mechanisms before reading individual opcode pages.
-
-## See Also
-
-- [Vector instruction surface](../instruction-surfaces/vector-instructions.md)
-- [Vector families](../instruction-families/vector-families.md)
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/README_zh.md b/docs/mkdocs/src/docs/isa/vector/README_zh.md
deleted file mode 100644
index eee77d73..00000000
--- a/docs/mkdocs/src/docs/isa/vector/README_zh.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# Vector ISA Reference
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](README.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [现有中文指令说明](../README_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/binary-vector-ops.md b/docs/mkdocs/src/docs/isa/vector/binary-vector-ops.md
deleted file mode 100644
index 3e03274b..00000000
--- a/docs/mkdocs/src/docs/isa/vector/binary-vector-ops.md
+++ /dev/null
@@ -1,279 +0,0 @@
-<!-- Generated from `docs/isa/vector/binary-vector-ops.md` -->
-
-# Vector Families: Binary Vector Ops
-
-This page documents two-input `pto.v*` compute families. The detailed per-op sections below are imported into the PTO ISA manual because vector micro-instruction legality and operand discipline are part of the PTO architecture contract, not external notes.
-
-> **Category:** Two-input vector operations
-> **Pipeline:** PIPE_V (Vector Core)
-
-Element-wise operations that take two vector inputs and produce one vector output.
-
-## Common Operand Model
-
-- `%lhs` and `%rhs` are the two source vector register values.
-- `%mask` is the predicate operand `Pg` that gates which lanes participate.
-- `%result` is the destination vector register value. Unless explicitly noted,
-  it has the same lane count and element type as the inputs.
-- Unless explicitly documented otherwise, `%lhs`, `%rhs`, and `%result` MUST
-  have matching vector shapes and element types.
-
----
-
-## Arithmetic
-
-### `pto.vadd`
-
-- **syntax:** `%result = pto.vadd %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i8-i64, f16, bf16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] + src1[i];
-```
-
-- **inputs:** `%lhs` and `%rhs` are added lane-wise; `%mask` selects active
-  lanes.
-- **outputs:** `%result` is the lane-wise sum.
-- **constraints and limitations:** Input and result types MUST match.
-
----
-
-### `pto.vsub`
-
-- **syntax:** `%result = pto.vsub %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i8-i64, f16, bf16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] - src1[i];
-```
-
-- **inputs:** `%lhs` is the minuend, `%rhs` is the subtrahend, and `%mask`
-  selects active lanes.
-- **outputs:** `%result` is the lane-wise difference.
-- **constraints and limitations:** Input and result types MUST match.
-
----
-
-### `pto.vmul`
-
-- **syntax:** `%result = pto.vmul %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i16-i32, f16, bf16, f32 (**NOT** i8/u8)
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] * src1[i];
-```
-
-- **inputs:** `%lhs` and `%rhs` are multiplied lane-wise; `%mask` selects
-  active lanes.
-- **outputs:** `%result` is the lane-wise product.
-- **constraints and limitations:** The current A5 profile excludes `i8/u8`
-  forms from this surface.
-
----
-
-### `pto.vdiv`
-
-- **syntax:** `%result = pto.vdiv %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32 only (no integer division)
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] / src1[i];
-```
-
-- **inputs:** `%lhs` is the numerator, `%rhs` is the denominator, and `%mask`
-  selects active lanes.
-- **outputs:** `%result` is the lane-wise quotient.
-- **constraints and limitations:** Floating-point element types only. Active
-  denominators containing `+0` or `-0` follow the target's exceptional
-  behavior.
-
----
-
-### `pto.vmax`
-
-- **syntax:** `%result = pto.vmax %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i8-i32, f16, bf16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src0[i] > src1[i]) ? src0[i] : src1[i];
-```
-
-- **inputs:** `%lhs`, `%rhs`, and `%mask` as above.
-- **outputs:** `%result` holds the lane-wise maximum.
-- **constraints and limitations:** Input and result types MUST match.
-
----
-
-### `pto.vmin`
-
-- **syntax:** `%result = pto.vmin %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i8-i32, f16, bf16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src0[i] < src1[i]) ? src0[i] : src1[i];
-```
-
-- **inputs:** `%lhs`, `%rhs`, and `%mask` as above.
-- **outputs:** `%result` holds the lane-wise minimum.
-- **constraints and limitations:** Input and result types MUST match.
-
----
-
-## Bitwise
-
-### `pto.vand`
-
-- **syntax:** `%result = pto.vand %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** all integer types
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] & src1[i];
-```
-
-- **inputs:** `%lhs`, `%rhs`, and `%mask` as above.
-- **outputs:** `%result` is the lane-wise bitwise AND.
-- **constraints and limitations:** Integer element types only.
-
----
-
-### `pto.vor`
-
-- **syntax:** `%result = pto.vor %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** all integer types
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] | src1[i];
-```
-
-- **inputs:** `%lhs`, `%rhs`, and `%mask` as above.
-- **outputs:** `%result` is the lane-wise bitwise OR.
-- **constraints and limitations:** Integer element types only.
-
----
-
-### `pto.vxor`
-
-- **syntax:** `%result = pto.vxor %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** all integer types
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] ^ src1[i];
-```
-
-- **inputs:** `%lhs`, `%rhs`, and `%mask` as above.
-- **outputs:** `%result` is the lane-wise bitwise XOR.
-- **constraints and limitations:** Integer element types only.
-
----
-
-## Shift
-
-### `pto.vshl`
-
-- **syntax:** `%result = pto.vshl %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** all integer types
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] << src1[i];
-```
-
-- **inputs:** `%lhs` supplies the shifted value, `%rhs` supplies the per-lane
-  shift amount, and `%mask` selects active lanes.
-- **outputs:** `%result` is the shifted vector.
-- **constraints and limitations:** Integer element types only. Shift counts
-  SHOULD stay within `[0, bitwidth(T) - 1]`; out-of-range behavior is target-
-  defined unless the verifier narrows it further.
-
----
-
-### `pto.vshr`
-
-- **syntax:** `%result = pto.vshr %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** all integer types
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] >> src1[i];  // arithmetic for signed, logical for unsigned
-```
-
-- **inputs:** `%lhs` supplies the shifted value, `%rhs` supplies the per-lane
-  shift amount, and `%mask` selects active lanes.
-- **outputs:** `%result` is the shifted vector.
-- **constraints and limitations:** Integer element types only. Signedness of the
-  element type determines arithmetic vs logical behavior.
-
----
-
-## Carry Operations
-
-### `pto.vaddc`
-
-- **syntax:** `%result, %carry = pto.vaddc %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>, !pto.mask`
-- **semantics:** Add with carry output.
-
-```c
-for (int i = 0; i < N; i++) {
-    uint64_t r = (uint64_t)src0[i] + src1[i];
-    dst[i] = (T)r;
-    carry[i] = (r >> bitwidth);
-}
-```
-
-- **inputs:** `%lhs` and `%rhs` are added lane-wise and `%mask` selects active
-  lanes.
-- **outputs:** `%result` is the truncated arithmetic result and `%carry` is the
-  carry/overflow predicate per lane.
-- **constraints and limitations:** This is a carry-chain integer add family. On
-  the current A5 surface, it SHOULD be treated as an unsigned integer
-  operation.
-
----
-
-### `pto.vsubc`
-
-- **syntax:** `%result, %borrow = pto.vsubc %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>, !pto.mask`
-- **semantics:** Subtract with borrow output.
-
-```c
-for (int i = 0; i < N; i++) {
-    dst[i] = src0[i] - src1[i];
-    borrow[i] = (src0[i] < src1[i]);
-}
-```
-
-- **inputs:** `%lhs` and `%rhs` are subtracted lane-wise and `%mask` selects
-  active lanes.
-- **outputs:** `%result` is the arithmetic difference and `%borrow` marks lanes
-  that borrowed.
-- **constraints and limitations:** This operation SHOULD be treated as an
-  unsigned 32-bit carry-chain family unless and until the verifier states
-  otherwise.
-
----
-
-## Typical Usage
-
-```mlir
-// Vector addition
-%sum = pto.vadd %a, %b, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Element-wise multiply
-%prod = pto.vmul %x, %y, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Clamp to range [min, max]
-%clamped_low = pto.vmax %input, %min_vec, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-%clamped = pto.vmin %clamped_low, %max_vec, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Bit manipulation
-%masked = pto.vand %data, %bitmask, %mask : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask -> !pto.vreg<64xi32>
-```
diff --git a/docs/mkdocs/src/docs/isa/vector/binary-vector-ops_zh.md b/docs/mkdocs/src/docs/isa/vector/binary-vector-ops_zh.md
deleted file mode 100644
index a7c1b832..00000000
--- a/docs/mkdocs/src/docs/isa/vector/binary-vector-ops_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: Binary Vector Ops
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](binary-vector-ops.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/compare-select.md b/docs/mkdocs/src/docs/isa/vector/compare-select.md
deleted file mode 100644
index ae607181..00000000
--- a/docs/mkdocs/src/docs/isa/vector/compare-select.md
+++ /dev/null
@@ -1,183 +0,0 @@
-<!-- Generated from `docs/isa/vector/compare-select.md` -->
-
-# Vector Families: Compare And Select
-
-This page documents `pto.v*` compare and select families. Predicate production and selection behavior are specified here because later vector and control operations depend on those exact mask semantics.
-
-> **Category:** Comparison and conditional selection operations
-> **Pipeline:** PIPE_V (Vector Core)
-
-Operations that compare vectors and conditionally select elements.
-
-## Common Operand Model
-
-- `%src0` and `%src1` are source vector operands.
-- `%scalar` is the scalar operand for scalar-comparison families.
-- `%seed` is the incoming predicate that limits which lanes participate in the
-  compare.
-- `%result` is either a predicate mask (`vcmp`, `vcmps`) or a vector register
-  (`vsel`, `vselr`, `vselrv2`).
-
----
-
-## Comparison Operations
-
-### `pto.vcmp`
-
-- **syntax:** `%result = pto.vcmp %src0, %src1, %seed, "CMP_MODE" : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.mask`
-- **semantics:** Element-wise comparison, output predicate mask.
-
-```c
-for (int i = 0; i < N; i++)
-    if (seed[i])
-        dst[i] = (src0[i] CMP src1[i]) ? 1 : 0;
-```
-
-**Compare modes:**
-
-| Mode | Operation |
-|------|-----------|
-| `eq` | Equal (==) |
-| `ne` | Not equal (!=) |
-| `lt` | Less than (<) |
-| `le` | Less than or equal (<=) |
-| `gt` | Greater than (>) |
-| `ge` | Greater than or equal (>=) |
-
-**Example:**
-```mlir
-%all_active = pto.pset_b32 "PAT_ALL" : !pto.mask
-%lt_mask = pto.vcmp %a, %b, %all_active, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-// lt_mask[i] = 1 if a[i] < b[i]
-```
-
-- **inputs:** `%src0`, `%src1`, and `%seed`; `CMP_MODE` selects the comparison
-  predicate.
-- **outputs:** `%result` is the generated predicate mask.
-- **constraints and limitations:** Only lanes enabled by `%seed` participate.
-  Integer and floating-point comparisons follow their own element-type-specific
-  comparison rules.
-
----
-
-### `pto.vcmps`
-
-- **syntax:** `%result = pto.vcmps %src, %scalar, %seed, "CMP_MODE" : !pto.vreg<NxT>, T, !pto.mask -> !pto.mask`
-- **semantics:** Compare vector against scalar.
-
-```c
-for (int i = 0; i < N; i++)
-    if (seed[i])
-        dst[i] = (src[i] CMP scalar) ? 1 : 0;
-```
-
-**Example:**
-```mlir
-%positive_mask = pto.vcmps %values, %c0_f32, %all_active, "gt"
-    : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-// positive_mask[i] = 1 if values[i] > 0
-```
-
-- **inputs:** `%src` is the vector source, `%scalar` is the scalar comparison
-  value, and `%seed` is the incoming predicate.
-- **outputs:** `%result` is the generated predicate mask.
-- **constraints and limitations:** For 32-bit scalar forms, the scalar source
-  MUST satisfy the backend's legal scalar-source constraints for this family.
-
----
-
-## Selection Operations
-
-### `pto.vsel`
-
-- **syntax:** `%result = pto.vsel %src0, %src1, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **semantics:** Per-lane select based on mask.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = mask[i] ? src0[i] : src1[i];
-```
-
-**Example — Conditional assignment:**
-```mlir
-// dst = mask ? true_vals : false_vals
-%result = pto.vsel %true_vals, %false_vals, %condition
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-- **inputs:** `%src0` is the true-path vector, `%src1` is the false-path vector,
-  and `%mask` selects between them.
-- **outputs:** `%result` is the selected vector.
-- **constraints and limitations:** Source vectors and result MUST have matching
-  vector shapes and element types.
-
----
-
-### `pto.vselr`
-
-- **syntax:** `%result = pto.vselr %src0, %src1 : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>`
-- **semantics:** Select with reversed mask semantics.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = mask[i] ? src1[i] : src0[i];  // reversed from vsel
-```
-
-- **inputs:** `%src0` and `%src1` are the source vectors.
-- **outputs:** `%result` is the selected vector.
-- **constraints and limitations:** This family preserves reversed-select
-  semantics. If the concrete lowering uses an implicit predicate source, that
-  predicate source MUST be documented by the surrounding IR pattern.
-
----
-
-### `pto.vselrv2`
-
-- **syntax:** `%result = pto.vselrv2 %src0, %src1 : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>`
-- **semantics:** Variant select form with the same current two-vector operand shape.
-- **inputs:** `%src0` and `%src1` are the source vectors.
-- **outputs:** `%result` is the selected vector.
-- **constraints and limitations:** This page records the surface shape only.
-  Lowering MUST preserve the exact A5 variant semantics selected for this form.
-
----
-
-## Typical Usage
-
-```mlir
-// Clamp negative values to zero (manual ReLU)
-%all = pto.pset_b32 "PAT_ALL" : !pto.mask
-%zero = pto.vbr %c0_f32 : f32 -> !pto.vreg<64xf32>
-%neg_mask = pto.vcmps %input, %c0_f32, %all, "lt" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-%clamped = pto.vsel %zero, %input, %neg_mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Element-wise max via compare+select
-%gt_mask = pto.vcmp %a, %b, %all, "gt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-%max_ab = pto.vsel %a, %b, %gt_mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Threshold filter
-%above_thresh = pto.vcmps %scores, %threshold, %all, "ge" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-%filtered = pto.vsel %scores, %zero, %above_thresh : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
----
-
-## Compare + Select Pattern
-
-```mlir
-// Softmax safe exp: exp(x - max) where x < max returns exp of negative
-// but we want to clamp to avoid underflow
-
-%all = pto.pset_b32 "PAT_ALL" : !pto.mask
-
-// 1. Compare against threshold
-%too_small = pto.vcmps %x_minus_max, %min_exp_arg, %all, "lt"
-    : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-
-// 2. Clamp values below threshold
-%clamped = pto.vsel %min_exp_arg_vec, %x_minus_max, %too_small
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// 3. Safe exp
-%exp_result = pto.vexp %clamped, %all : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
diff --git a/docs/mkdocs/src/docs/isa/vector/compare-select_zh.md b/docs/mkdocs/src/docs/isa/vector/compare-select_zh.md
deleted file mode 100644
index 0f2434c2..00000000
--- a/docs/mkdocs/src/docs/isa/vector/compare-select_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: Compare And Select
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](compare-select.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/conversion-ops.md b/docs/mkdocs/src/docs/isa/vector/conversion-ops.md
deleted file mode 100644
index 0ac4a10b..00000000
--- a/docs/mkdocs/src/docs/isa/vector/conversion-ops.md
+++ /dev/null
@@ -1,170 +0,0 @@
-<!-- Generated from `docs/isa/vector/conversion-ops.md` -->
-
-# Vector Families: Conversion Ops
-
-This page documents `pto.v*` conversion and index-generation families. Width changes, rounding rules, saturation, and part selection are target-visible constraints and must stay aligned with the PTO ISA verifier and lowering contracts.
-
-> **Category:** Type conversion operations
-> **Pipeline:** PIPE_V (Vector Core)
-
-Operations that convert between data types (float/int, narrowing/widening).
-
-## Common Operand Model
-
-- `%input` is the source vector register value.
-- `%result` is the destination vector register value.
-- `round_mode`, `sat`, and `part` control rounding, saturation, and lane-part
-  selection in attribute form.
-- The single `pto.vcvt` surface covers float-int, float-float, int-float, and
-  int-int conversion families.
-
----
-
-## `pto.vci`
-
-- **syntax:** `%result = pto.vci %index {order = "ORDER"} : integer -> !pto.vreg<NxT>`
-- **semantics:** Generate a lane-index vector from a scalar seed/index value.
-- **inputs:**
-  `%index` is the scalar seed or base index.
-- **outputs:**
-  `%result` is the generated index vector.
-- **constraints and limitations:**
-  This is an index-generation family, not a numeric conversion. `ORDER` and the
-  result element type together determine how indices are generated.
-
----
-
-## `pto.vcvt`
-
-- **syntax:** `%result = pto.vcvt %input {round_mode = "ROUND_MODE", sat = "SAT_MODE", part = "PART_MODE"} : !pto.vreg<NxT0> -> !pto.vreg<MxT1>`
-- **semantics:** Type conversion between float/int types with rounding control.
-
-```c
-for (int i = 0; i < min(N, M); i++)
-    dst[i] = convert(src[i], T0, T1, round_mode);
-```
-
-- **inputs:**
-  `%input` is the source vector; attributes select rounding, saturation, and
-  even/odd placement when the conversion changes width.
-- **outputs:**
-  `%result` is the converted vector.
-- **constraints and limitations:**
-  Only documented source/destination type pairs are legal. `PART_EVEN` /
-  `PART_ODD` is only meaningful for width-changing forms that pack two source
-  streams into one destination register.
-
----
-
-### Rounding Modes
-
-| Mode | Description |
-|------|-------------|
-| `ROUND_R` | Round to nearest, ties to even (default) |
-| `ROUND_A` | Round away from zero |
-| `ROUND_F` | Round toward negative infinity (floor) |
-| `ROUND_C` | Round toward positive infinity (ceil) |
-| `ROUND_Z` | Round toward zero (truncate) |
-| `ROUND_O` | Round to odd |
-
----
-
-### Saturation Modes
-
-| Mode | Description |
-|------|-------------|
-| `RS_ENABLE` | Saturate on overflow |
-| `RS_DISABLE` | No saturation (wrap/undefined on overflow) |
-
----
-
-### Part Modes (for width-changing conversions)
-
-| Mode | Description |
-|------|-------------|
-| `PART_EVEN` | Output to even-indexed lanes |
-| `PART_ODD` | Output to odd-indexed lanes |
-
----
-
-### A5 Supported Conversions
-
-**Float-Float (vcvtff):**
-- f32 ↔ f16
-- f32 ↔ bf16
-- f16 ↔ bf16
-
-**Float-Int (vcvtfi):**
-- f16 → i16, f16 → i32
-- f32 → i16, f32 → i32
-- bf16 → i32
-
-**Int-Float (vcvtif):**
-- i16 → f16
-- i32 → f32
-
----
-
-### Width-Changing Conversion Pattern
-
-For conversions that change width (e.g., f32→f16), use even/odd parts and combine:
-
-```mlir
-// Convert two f32 vectors to one f16 vector
-%even = pto.vcvt %in0 {round_mode = "ROUND_R", sat = "RS_ENABLE", part = "PART_EVEN"}
-    : !pto.vreg<64xf32> -> !pto.vreg<128xf16>
-%odd  = pto.vcvt %in1 {round_mode = "ROUND_R", sat = "RS_ENABLE", part = "PART_ODD"}
-    : !pto.vreg<64xf32> -> !pto.vreg<128xf16>
-%result = pto.vor %even, %odd, %mask : !pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask -> !pto.vreg<128xf16>
-```
-
----
-
-## `pto.vtrc`
-
-- **syntax:** `%result = pto.vtrc %input, "ROUND_MODE" : !pto.vreg<NxT> -> !pto.vreg<NxT>`
-- **semantics:** Truncate/round float to integer-valued float (stays in float type).
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = round_to_int_valued_float(src[i], round_mode);
-```
-
-- **inputs:**
-  `%input` is the floating-point source vector and `ROUND_MODE` selects the
-  truncation/rounding rule.
-- **outputs:**
-  `%result` is still a floating-point vector, but each active lane now carries
-  an integer-valued floating-point result.
-- **constraints and limitations:**
-  This op does not change the element type. `ROUND_O` is supported for avoiding
-  double-rounding errors during staged conversions.
-
-**Example:**
-```mlir
-// Round to nearest integer, keep as float
-%rounded = pto.vtrc %input, "ROUND_R" : !pto.vreg<64xf32> -> !pto.vreg<64xf32>
-// input:  [1.4, 2.6, -1.5, 3.0]
-// output: [1.0, 3.0, -2.0, 3.0]
-```
-
----
-
-## Typical Usage
-
-```mlir
-// Quantization: f32 → i8 with saturation
-%scaled = pto.vmuls %input, %scale, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-%quantized = pto.vcvt %scaled {round_mode = "ROUND_R", sat = "RS_ENABLE"}
-    : !pto.vreg<64xf32> -> !pto.vreg<64xi32>
-// Then narrow i32 → i8 via pack ops
-
-// Mixed precision: bf16 → f32 for accumulation
-%f32_vec = pto.vcvt %bf16_input {round_mode = "ROUND_R"}
-    : !pto.vreg<128xbf16> -> !pto.vreg<64xf32>
-
-// Floor for integer division
-%floored = pto.vtrc %ratio, "ROUND_F" : !pto.vreg<64xf32> -> !pto.vreg<64xf32>
-%int_div = pto.vcvt %floored {round_mode = "ROUND_Z"}
-    : !pto.vreg<64xf32> -> !pto.vreg<64xi32>
-```
diff --git a/docs/mkdocs/src/docs/isa/vector/conversion-ops_zh.md b/docs/mkdocs/src/docs/isa/vector/conversion-ops_zh.md
deleted file mode 100644
index cfcd8632..00000000
--- a/docs/mkdocs/src/docs/isa/vector/conversion-ops_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: Conversion Ops
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](conversion-ops.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/data-rearrangement.md b/docs/mkdocs/src/docs/isa/vector/data-rearrangement.md
deleted file mode 100644
index 03b000c2..00000000
--- a/docs/mkdocs/src/docs/isa/vector/data-rearrangement.md
+++ /dev/null
@@ -1,295 +0,0 @@
-<!-- Generated from `docs/isa/vector/data-rearrangement.md` -->
-
-# Vector Families: Data Rearrangement
-
-This page documents `pto.v*` rearrangement families. These operations permute or repack vector-visible data without turning into tile movement or DMA, so they stay in the vector surface.
-
-> **Category:** In-register data movement and permutation
-> **Pipeline:** PIPE_V (Vector Core)
-
-Operations that rearrange data within or between vector registers without memory access.
-
-## Common Operand Model
-
-- `%lhs` / `%rhs` are source vector register values.
-- `%src` is a single source vector register value.
-- `%result` is the destination vector register value unless an op explicitly
-  returns multiple vectors.
-- These families do not access UB directly; they only rearrange register
-  contents.
-
----
-
-## Interleave / Deinterleave
-
-### `pto.vintlv`
-
-- **syntax:** `%low, %high = pto.vintlv %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>, !pto.vreg<NxT>`
-- **semantics:** Interleave elements from two sources.
-
-```c
-// Interleave: merge even/odd elements from two sources
-// low  = {src0[0], src1[0], src0[1], src1[1], ...}
-// high = {src0[N/2], src1[N/2], src0[N/2+1], src1[N/2+1], ...}
-```
-
-- **inputs:** `%lhs` and `%rhs` are the two source vectors.
-- **outputs:** `%low` and `%high` are the two destination vectors.
-- **constraints and limitations:** The two outputs form a paired interleave
-  result. The PTO ISA vector surface representation exposes that pair as two SSA results, and the pair ordering MUST
-  be preserved.
-
----
-
-### `pto.vdintlv`
-
-- **syntax:** `%low, %high = pto.vdintlv %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>, !pto.vreg<NxT>`
-- **semantics:** Deinterleave elements into even/odd.
-
-```c
-// Deinterleave: separate even/odd elements
-// low  = {src0[0], src0[2], src0[4], ...}  // even
-// high = {src0[1], src0[3], src0[5], ...}  // odd
-```
-
-- **inputs:** `%lhs` and `%rhs` represent the interleaved source stream in the
-  current PTO ISA vector surface representation.
-- **outputs:** `%low` and `%high` are the separated destination vectors.
-- **constraints and limitations:** The two outputs form the even/odd
-  deinterleave result pair, and their ordering MUST be preserved.
-
----
-
-## Slide / Shift
-
-### `pto.vslide`
-
-- **syntax:** `%result = pto.vslide %src0, %src1, %amt : !pto.vreg<NxT>, !pto.vreg<NxT>, i16 -> !pto.vreg<NxT>`
-- **semantics:** Concatenate two vectors and extract N-element window at offset.
-
-```c
-// Conceptually: tmp[0..2N-1] = {src1, src0}
-// dst[i] = tmp[amt + i]
-if (amt >= 0)
-    for (int i = 0; i < N; i++)
-        dst[i] = (i >= amt) ? src0[i - amt] : src1[N - amt + i];
-```
-
-**Use case:** Sliding window operations, shift register patterns.
-
-- **inputs:** `%src0` and `%src1` provide the concatenated source window and
-  `%amt` selects the extraction offset.
-- **outputs:** `%result` is the extracted destination window.
-- **constraints and limitations:** `pto.vslide` operates on the logical
-  concatenation of `%src1` and `%src0`. The source order and extraction offset
-  MUST be preserved exactly.
-
----
-
-### `pto.vshift`
-
-- **syntax:** `%result = pto.vshift %src, %amt : !pto.vreg<NxT>, i16 -> !pto.vreg<NxT>`
-- **semantics:** Single-source slide (shift with zero fill).
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (i >= amt) ? src[i - amt] : 0;
-```
-
-- **inputs:** `%src` is the source vector and `%amt` is the slide amount.
-- **outputs:** `%result` is the shifted vector.
-- **constraints and limitations:** This surface represents the single-source
-  slide/shift family. Zero-fill versus other fill behavior MUST match the
-  selected form.
-
----
-
-## Compress / Expand
-
-### `pto.vsqz`
-
-- **syntax:** `%result = pto.vsqz %src, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **semantics:** Compress — pack active lanes to front.
-
-```c
-int j = 0;
-for (int i = 0; i < N; i++)
-    if (mask[i]) dst[j++] = src[i];
-while (j < N) dst[j++] = 0;
-```
-
-**Use case:** Sparse data compaction, filtering.
-
-- **inputs:** `%src` is the source vector and `%mask` selects which elements are
-  kept.
-- **outputs:** `%result` is the compacted vector.
-- **constraints and limitations:** This is a reduction-style compaction family.
-  Preserved element order MUST match source lane order.
-
----
-
-### `pto.vusqz`
-
-- **syntax:** `%result = pto.vusqz %mask : !pto.mask -> !pto.vreg<NxT>`
-- **semantics:** Expand — scatter front elements to active positions.
-
-```c
-int j = 0;
-for (int i = 0; i < N; i++)
-    if (mask[i]) dst[i] = src_front[j++];
-    else dst[i] = 0;
-```
-
-- **inputs:** `%mask` is the expansion/placement predicate.
-- **outputs:** `%result` is the expanded vector image.
-- **constraints and limitations:** The source-front stream is implicit in the
-  current surface. Lane placement for active and inactive positions MUST be
-  preserved exactly.
-
----
-
-## Permutation
-
-### `pto.vperm`
-
-- **syntax:** `%result = pto.vperm %src, %index : !pto.vreg<NxT>, !pto.vreg<NxI> -> !pto.vreg<NxT>`
-- **semantics:** In-register permute (table lookup). **Not** memory gather.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[index[i] % N];
-```
-
-**Note:** This operates on register contents, unlike `pto.vgather2` which reads from UB memory.
-
-- **inputs:** `%src` is the source vector and `%index` supplies per-lane source
-  indices.
-- **outputs:** `%result` is the permuted vector.
-- **constraints and limitations:** This is an in-register permutation family.
-  `%index` values outside the legal range follow the wrap/clamp behavior of the
-  selected form.
-
----
-
-### `pto.vselr`
-
-- **syntax:** `%result = pto.vselr %src0, %src1 : !pto.vreg<NxT>, !pto.vreg<NxI> -> !pto.vreg<NxT>`
-- **semantics:** Register select with reversed mask semantics.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = mask[i] ? src1[i] : src0[i];
-```
-
-- **inputs:** `%src0` and `%src1` are source vectors.
-- **outputs:** `%result` is the selected vector.
-- **constraints and limitations:** This page records the rearrangement use of
-  the family; the compare/select page documents the same name from the predicate
-  selection perspective.
-
----
-
-## Pack / Unpack
-
-### `pto.vpack`
-
-- **syntax:** `%result = pto.vpack %src0, %src1, %part : !pto.vreg<NxT_wide>, !pto.vreg<NxT_wide>, index -> !pto.vreg<2NxT_narrow>`
-- **semantics:** Narrowing pack — two wide vectors to one narrow vector.
-
-```c
-// e.g., two vreg<64xi32> → one vreg<128xi16>
-for (int i = 0; i < N; i++) {
-    dst[i]     = truncate(src0[i]);
-    dst[N + i] = truncate(src1[i]);
-}
-```
-
-- **inputs:** `%src0` and `%src1` are wide source vectors and `%part` selects
-  the packing submode.
-- **outputs:** `%result` is the packed narrow vector.
-- **constraints and limitations:** Packing is a narrowing conversion. Source
-  values that do not fit the destination width follow the truncation semantics
-  of the selected packing mode.
-
----
-
-### `pto.vsunpack`
-
-- **syntax:** `%result = pto.vsunpack %src, %part : !pto.vreg<NxT_narrow>, index -> !pto.vreg<N/2xT_wide>`
-- **semantics:** Sign-extending unpack — narrow to wide (half).
-
-```c
-// e.g., vreg<128xi16> → vreg<64xi32> (one half)
-for (int i = 0; i < N/2; i++)
-    dst[i] = sign_extend(src[part_offset + i]);
-```
-
-- **inputs:** `%src` is the packed narrow vector and `%part` selects which half
-  is unpacked.
-- **outputs:** `%result` is the widened vector.
-- **constraints and limitations:** This is the sign-extending unpack family.
-
----
-
-### `pto.vzunpack`
-
-- **syntax:** `%result = pto.vzunpack %src, %part : !pto.vreg<NxT_narrow>, index -> !pto.vreg<N/2xT_wide>`
-- **semantics:** Zero-extending unpack — narrow to wide (half).
-
-```c
-for (int i = 0; i < N/2; i++)
-    dst[i] = zero_extend(src[part_offset + i]);
-```
-
-- **inputs:** `%src` is the packed narrow vector and `%part` selects which half
-  is unpacked.
-- **outputs:** `%result` is the widened vector.
-- **constraints and limitations:** This is the zero-extending unpack family.
-
----
-
-## Typical Usage
-
-```mlir
-// AoS → SoA conversion using deinterleave
-%even, %odd = pto.vdintlv %interleaved0, %interleaved1
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>, !pto.vreg<64xf32>
-
-// Filter: keep only elements passing condition
-%pass_mask = pto.vcmps %values, %threshold, %all, "gt"
-    : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-%compacted = pto.vsqz %values, %pass_mask
-    : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Sliding window sum
-%prev_window = pto.vslide %curr, %prev, %c1
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, i16 -> !pto.vreg<64xf32>
-%window_sum = pto.vadd %curr, %prev_window, %all
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Type narrowing via pack
-%packed_i16 = pto.vpack %wide0_i32, %wide1_i32, %c0
-    : !pto.vreg<64xi32>, !pto.vreg<64xi32>, index -> !pto.vreg<128xi16>
-```
-
----
-
-## V2 Interleave Forms
-
-### `pto.vintlvv2`
-
-- **syntax:** `%result = pto.vintlvv2 %lhs, %rhs, "PART" : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>`
-- **inputs:** `%lhs` and `%rhs` are source vectors and `PART` selects the
-  returned half of the V2 interleave result.
-- **outputs:** `%result` is the selected interleave half.
-- **constraints and limitations:** This op exposes only one half of the V2
-  result in SSA form.
-
-### `pto.vdintlvv2`
-
-- **syntax:** `%result = pto.vdintlvv2 %lhs, %rhs, "PART" : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>`
-- **inputs:** `%lhs` and `%rhs` are source vectors and `PART` selects the
-  returned half of the V2 deinterleave result.
-- **outputs:** `%result` is the selected deinterleave half.
-- **constraints and limitations:** This op exposes only one half of the V2
-  result in SSA form.
diff --git a/docs/mkdocs/src/docs/isa/vector/data-rearrangement_zh.md b/docs/mkdocs/src/docs/isa/vector/data-rearrangement_zh.md
deleted file mode 100644
index ebd85ad4..00000000
--- a/docs/mkdocs/src/docs/isa/vector/data-rearrangement_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: Data Rearrangement
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](data-rearrangement.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/dma-copy.md b/docs/mkdocs/src/docs/isa/vector/dma-copy.md
deleted file mode 100644
index d50599ce..00000000
--- a/docs/mkdocs/src/docs/isa/vector/dma-copy.md
+++ /dev/null
@@ -1,606 +0,0 @@
-<!-- Generated from `docs/isa/vector/dma-copy.md` -->
-
-# Vector Families: DMA Copy
-
-This page documents the vector-surface DMA families inside PTO ISA. These operations stage data between GM and vector-visible UB state and are part of the architecture-visible setup required before `pto.v*` compute can execute.
-
-> **Category:** DMA transfer configuration and execution
-> **Pipelines:** MTE2 (GM→UB), MTE3 (UB→GM)
-
-DMA transfers move data between Global Memory (GM) and Unified Buffer (UB). The MTE engines operate asynchronously from the Vector core, requiring explicit sync (see [Pipeline Sync](./pipeline-sync.md)).
-
-The MTE2/MTE3 DMA engine executes a **multi-level nested loop** transfer. Before issuing the copy instruction, stride and loop-size registers must be configured.
-
----
-
-## Loop Stride Configuration (GM→UB)
-
-These ops configure the MTE2 DMA engine's hardware loops for GM→UB transfers. They must be set **before** calling `pto.copy_gm_to_ubuf`.
-
-### `pto.set_loop_size_outtoub`
-
-- **syntax:** `pto.set_loop_size_outtoub %loop1_count, %loop2_count : i64, i64`
-- **semantics:** Configure HW loop iteration counts for GM→UB DMA.
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%loop1_count` | 21 bits | Inner HW loop iteration count |
-| `%loop2_count` | 21 bits | Outer HW loop iteration count |
-
-When not using multi-level looping, set both to 1.
-
----
-
-### `pto.set_loop2_stride_outtoub`
-
-- **syntax:** `pto.set_loop2_stride_outtoub %src_stride, %dst_stride : i64, i64`
-- **semantics:** Configure outer loop (loop2) pointer advance for GM→UB DMA.
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%src_stride` | 40 bits | GM source pointer advance per loop2 iteration (bytes) |
-| `%dst_stride` | 21 bits | UB destination pointer advance per loop2 iteration (bytes) |
-
-After each loop2 iteration, the DMA engine advances the GM read pointer by `%src_stride` and UB write pointer by `%dst_stride`.
-
----
-
-### `pto.set_loop1_stride_outtoub`
-
-- **syntax:** `pto.set_loop1_stride_outtoub %src_stride, %dst_stride : i64, i64`
-- **semantics:** Configure inner loop (loop1) pointer advance for GM→UB DMA.
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%src_stride` | 40 bits | GM source pointer advance per loop1 iteration (bytes) |
-| `%dst_stride` | 21 bits | UB destination pointer advance per loop1 iteration (bytes) |
-
----
-
-## Loop Stride Configuration (UB→GM)
-
-These ops configure the MTE3 DMA engine's hardware loops for UB→GM transfers. They must be set **before** calling `pto.copy_ubuf_to_gm`.
-
-Note: UB stride fields are 21 bits (sufficient for 256KB UB address space), GM stride fields are 40 bits (full GM address range).
-
-### `pto.set_loop_size_ubtoout`
-
-- **syntax:** `pto.set_loop_size_ubtoout %loop1_count, %loop2_count : i64, i64`
-- **semantics:** Configure HW loop iteration counts for UB→GM DMA.
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%loop1_count` | 21 bits | Inner HW loop iteration count |
-| `%loop2_count` | 21 bits | Outer HW loop iteration count |
-
----
-
-### `pto.set_loop2_stride_ubtoout`
-
-- **syntax:** `pto.set_loop2_stride_ubtoout %src_stride, %dst_stride : i64, i64`
-- **semantics:** Configure outer loop (loop2) pointer advance for UB→GM DMA.
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%src_stride` | 21 bits | UB source pointer advance per loop2 iteration (bytes) |
-| `%dst_stride` | 40 bits | GM destination pointer advance per loop2 iteration (bytes) |
-
----
-
-### `pto.set_loop1_stride_ubtoout`
-
-- **syntax:** `pto.set_loop1_stride_ubtoout %src_stride, %dst_stride : i64, i64`
-- **semantics:** Configure inner loop (loop1) pointer advance for UB→GM DMA.
-
-**Parameter Table:**
-
-| Parameter | Width | Description |
-|-----------|-------|-------------|
-| `%src_stride` | 21 bits | UB source pointer advance per loop1 iteration (bytes) |
-| `%dst_stride` | 40 bits | GM destination pointer advance per loop1 iteration (bytes) |
-
----
-
-## DMA Transfer Execution
-
-### `pto.copy_gm_to_ubuf`
-
-- **syntax:**
-```mlir
-pto.copy_gm_to_ubuf %gm_src, %ub_dst,
-    %sid, %n_burst, %len_burst, %left_padding, %right_padding,
-    %data_select_bit, %l2_cache_ctl, %src_stride, %dst_stride
-    : !pto.ptr<T, gm>, !pto.ptr<T, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-- **semantics:** DMA transfer from Global Memory (`!pto.ptr<T, gm>`) to Unified Buffer (`!pto.ptr<T, ub>`).
-
-**Parameters:**
-
-| Parameter | Description |
-|-----------|-------------|
-| `%gm_src` | GM source pointer (`!pto.ptr<T, gm>`) |
-| `%ub_dst` | UB destination pointer (`!pto.ptr<T, ub>`, 32B-aligned) |
-| `%sid` | Stream ID (usually 0) |
-| `%n_burst` | Number of burst rows (innermost loop count) |
-| `%len_burst` | Contiguous bytes transferred per burst row |
-| `%left_padding` | Left padding count (bytes) |
-| `%right_padding` | Right padding count (bytes) |
-| `%data_select_bit` | Padding / data-select control bit (`i1`) |
-| `%l2_cache_ctl` | L2 cache allocate control (TBD — controls whether DMA allocates in L2 cache) |
-| `%src_stride` | GM source stride: start-to-start distance between consecutive burst rows (bytes) |
-| `%dst_stride` | UB destination stride: start-to-start distance between consecutive burst rows (bytes, 32B-aligned) |
-
----
-
-### `pto.copy_ubuf_to_gm`
-
-- **syntax:**
-```mlir
-pto.copy_ubuf_to_gm %ub_src, %gm_dst,
-    %sid, %n_burst, %len_burst, %reserved, %dst_stride, %src_stride
-    : !pto.ptr<T, ub>, !pto.ptr<T, gm>, i64, i64, i64, i64, i64, i64
-```
-- **semantics:** DMA transfer from Unified Buffer (`!pto.ptr<T, ub>`) to Global Memory (`!pto.ptr<T, gm>`). MTE3 reads only `len_burst` bytes from each UB row (de-padding).
-
-**Parameters:**
-
-| Parameter | Description |
-|-----------|-------------|
-| `%ub_src` | UB source pointer (`!pto.ptr<T, ub>`, 32B-aligned) |
-| `%gm_dst` | GM destination pointer (`!pto.ptr<T, gm>`) |
-| `%sid` | Stream ID (usually 0) |
-| `%n_burst` | Number of burst rows |
-| `%len_burst` | Contiguous bytes transferred per burst row |
-| `%reserved` | Reserved field (set to 0) |
-| `%dst_stride` | GM destination stride: start-to-start distance between consecutive burst rows (bytes) |
-| `%src_stride` | UB source stride: start-to-start distance between consecutive burst rows (bytes, 32B-aligned) |
-
----
-
-### `pto.copy_ubuf_to_ubuf`
-
-- **syntax:**
-```mlir
-pto.copy_ubuf_to_ubuf %source, %dest, %sid, %n_burst, %len_burst, %src_stride, %dst_stride
-    : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64 x5
-```
-- **semantics:** Copy within Unified Buffer.
-
-**Parameters:**
-
-| Parameter | Description |
-|-----------|-------------|
-| `%source` | UB source pointer |
-| `%dest` | UB destination pointer |
-| `%sid` | Stream ID |
-| `%n_burst` | Number of bursts |
-| `%len_burst` | Length per burst |
-| `%src_stride` | Source stride |
-| `%dst_stride` | Destination stride |
-
----
-
-## Burst / Stride / Pad Model
-
-All A5 DMA addresses are **stride-based**: stride is the distance from the start of one row to the start of the next row (`stride >= lenBurst`). There is no separate "gap" parameter.
-
-### Key Terms
-
-```
-burst    = lenBurst contiguous bytes transferred per row
-stride   = distance (bytes) from start of row[r] to start of row[r+1]
-pad      = ub_stride - lenBurst, padded to the 32B alignment boundary
-```
-
-### Alignment Constraints
-
-- **UB addresses** (both source and destination) must be **32-byte aligned**.
-- **GM→UB padding**: When `data_select_bit = true`, each UB row is padded from `lenBurst` up to the **32B-aligned boundary** of `ub_stride` with `pad_val` (set via `set_mov_pad_val`). This ensures every UB row starts at a 32B-aligned offset.
-- **UB→GM de-padding**: MTE3 reads `lenBurst` bytes from each 32B-aligned UB row (skipping any padding that was added during load), writing only valid data to GM. This effectively strips padding on store.
-
-### 2D Diagram: GM→UB (pto.copy_gm_to_ubuf)
-
-```
-GM (source, `!pto.ptr<T, gm>`):
-
-          |<--- src_stride (start-to-start) --->|
-          |<- len_burst ->|                     |
-Row 0:    [##DATA########]......................|
-Row 1:    [##DATA########]......................|
-Row 2:    [##DATA########]......................|
-          ...
-Row N-1:  [##DATA########]
-
-UB (destination, `!pto.ptr<T, ub>`, 32B-aligned):
-
-          |<---------- dst_stride (32B-aligned) ---------->|
-          |<- len_burst ->|<- pad (to 32B boundary) ->|    |
-Row 0:    [##DATA########][000000 PAD 000000000000000]
-Row 1:    [##DATA########][000000 PAD 000000000000000]
-Row 2:    [##DATA########][000000 PAD 000000000000000]
-          ...
-Row N-1:  [##DATA########][000000 PAD 000000000000000]
-
-N = n_burst
-stride = start of row[r] to start of row[r+1]
-pad    = filled with pad_val to 32B boundary (data_select_bit=true)
-[DATA] = valid data transferred by DMA
-[PAD]  = pad_val fill (set via set_mov_pad_val)
-```
-
-### 2D Diagram: UB→GM (pto.copy_ubuf_to_gm)
-
-```
-UB (source, `!pto.ptr<T, ub>`, 32B-aligned start addr):
-
-          |<---------- src_stride (32B-aligned) --------->|
-          |<- len_burst ->|<-- pad (ignored on read) -->| |
-Row 0:    [##DATA########][000 pad 000000000000000000]
-Row 1:    [##DATA########][000 pad 000000000000000000]
-Row 2:    [##DATA########][000 pad 000000000000000000]
-          ...
-Row N-1:  [##DATA########][000 pad 000000000000000000]
-
-GM (destination, `!pto.ptr<T, gm>`):
-
-          |<--- dst_stride (start-to-start) --->|
-          |<- len_burst ->|                     |
-Row 0:    [##DATA########]......................|
-Row 1:    [##DATA########]......................|
-Row 2:    [##DATA########]......................|
-          ...
-Row N-1:  [##DATA########]
-
-N = n_burst
-MTE3 reads only len_burst bytes from each UB row (de-padding).
-Only len_burst bytes are written to each GM row.
-```
-
----
-
-## Multi-Level Loop Semantics (C Code)
-
-The full DMA transfer is a nested loop. The HW loop registers (set before the copy) control the outer levels, and the copy instruction parameters control the innermost burst level.
-
-### GM→UB Full Loop
-
-```c
-// C equivalent of what the HW executes:
-for (int j = 0; j < loop2_count; j++) {                // HW outer loop
-    uint8_t *gm1 = gm_src + j * loop2_src_stride;
-    uint8_t *ub1 = ub_dst + j * loop2_dst_stride;
-
-    for (int k = 0; k < loop1_count; k++) {            // HW inner loop
-        uint8_t *gm2 = gm1 + k * loop1_src_stride;
-        uint8_t *ub2 = ub1 + k * loop1_dst_stride;
-
-        for (int r = 0; r < n_burst; r++) {            // burst engine
-            memcpy(ub2 + r * dst_stride,               //   UB dest row
-                   gm2 + r * src_stride,               //   GM src row
-                   len_burst);                          //   contiguous bytes
-            if (data_select_bit)
-                memset(ub2 + r * dst_stride + len_burst,
-                       pad_val, dst_stride - len_burst);
-        }
-    }
-}
-```
-
-### UB→GM Full Loop
-
-```c
-// C equivalent:
-for (int j = 0; j < loop2_count; j++) {
-    uint8_t *ub1 = ub_src + j * loop2_src_stride;
-    uint8_t *gm1 = gm_dst + j * loop2_dst_stride;
-
-    for (int k = 0; k < loop1_count; k++) {
-        uint8_t *ub2 = ub1 + k * loop1_src_stride;
-        uint8_t *gm2 = gm1 + k * loop1_dst_stride;
-
-        for (int r = 0; r < n_burst; r++) {
-            memcpy(gm2 + r * dst_stride,               //   GM dest row
-                   ub2 + r * src_stride,               //   UB src row
-                   len_burst);                          //   contiguous bytes
-        }
-    }
-}
-```
-
----
-
-## Example 1: GM→UB — Load a 32×32 f32 Tile (Simple Case)
-
-Load a 32×32 f32 tile from GM into UB. This matches the `abs_kernel_2d` test case.
-
-```
-GM layout (32 × 32 f32, contiguous):
-
-    |<- len_burst = 128B (32 × 4) ->|
-    |<- src_stride = 128B --------->|
-    +--[#######TILE#######]--+  row 0
-    +--[#######TILE#######]--+  row 1
-    ...
-    +--[#######TILE#######]--+  row 31
-
-UB layout (32 × 32 f32, 32B-aligned, contiguous):
-
-    |<- dst_stride = 128B (32B-aligned) ->|
-    +--[#######TILE#######]--+  row 0
-    +--[#######TILE#######]--+  row 1
-    ...
-    +--[#######TILE#######]--+  row 31
-
-    len_burst   = 32 × 4 = 128 bytes
-    src_stride  = 128 bytes (contiguous rows)
-    dst_stride  = 128 bytes (already 32B-aligned, no padding)
-```
-
-```mlir
-// Simple 2D load — no multi-level loops needed
-pto.set_loop_size_outtoub %c1_i64, %c1_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %arg0, %ub_in,
-    %c0_i64,       // sid = 0
-    %c32_i64,      // n_burst = 32 (32 rows)
-    %c128_i64,     // len_burst = 128 bytes per row
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %false,        // data_select_bit = false
-    %c0_i64,       // l2_cache_ctl = 0
-    %c128_i64,     // src_stride = 128 bytes
-    %c128_i64      // dst_stride = 128 bytes
-    : !pto.ptr<f32, gm>, !pto.ptr<f32, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
----
-
-## Example 2: GM→UB — Load a 2D Tile from a Larger Matrix
-
-Load a 64×128 tile (f16) from a 1024×512 matrix in GM into UB.
-
-```
-GM layout (1024 × 512 f16):
-
-    col 0          col 128               col 512
-    |              |                     |
-    +--[###TILE###]+.....................+  row R
-    +--[###TILE###]+.....................+  row R+1
-    ...
-    +--[###TILE###]+.....................+  row R+63
-
-    |<--------- src_stride = 1024B ----------->|
-    |<-len_burst=256B->|
-
-    len_burst   = 128 × 2 = 256 bytes (128 f16 elements)
-    src_stride  = 512 × 2 = 1024 bytes (start-to-start, full GM row)
-
-UB layout (64 × 128 f16, 32B-aligned, contiguous):
-
-    +--[###TILE###]--+  row 0  (256 bytes, 32B-aligned, no pad)
-    +--[###TILE###]--+  row 1
-    ...
-    +--[###TILE###]--+  row 63
-
-    dst_stride = 256 bytes (= len_burst, already 32B-aligned, no padding)
-```
-
-```mlir
-// Simple 2D load — no multi-level loops needed
-pto.set_loop_size_outtoub %c1_i64, %c1_i64 : i64, i64
-pto.set_loop1_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-pto.set_loop2_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr,
-    %c0_i64,       // sid = 0
-    %c64_i64,      // n_burst = 64 (64 rows)
-    %c256_i64,     // len_burst = 256 bytes per row
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %false,        // data_select_bit = false
-    %c0_i64,       // l2_cache_ctl = 0
-    %c1024_i64,    // src_stride = 1024 bytes (full matrix row)
-    %c256_i64      // dst_stride = 256 bytes (tile row)
-    : !pto.ptr<f16, gm>, !pto.ptr<f16, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
----
-
-## Example 3: GM→UB — Load with Padding
-
-Load 100 valid columns from GM into a 128-wide UB tile (f16). The remaining 28 columns are zero-padded.
-
-```
-GM (100 cols valid, contiguous):
-
-    |<-len_burst=200B->|
-    |<- src_stride=200B (start-to-start) ->|
-    +--[####DATA####]-+  row 0
-    +--[####DATA####]-+  row 1
-    ...
-    +--[####DATA####]-+  row 63
-
-UB (128 cols wide, 32B-aligned, padded):
-
-    |<--------- dst_stride = 256B (32B-aligned) --------->|
-    |<-len_burst=200B->|<---- pad = 56B to 32B boundary ->|
-    +--[####DATA####]-+[0000000 PAD 0000000000000000000000]+  row 0
-    +--[####DATA####]-+[0000000 PAD 0000000000000000000000]+  row 1
-    ...
-    +--[####DATA####]-+[0000000 PAD 0000000000000000000000]+  row 63
-
-    len_burst   = 100 × 2 = 200 bytes
-    src_stride  = 200 bytes (start-to-start, contiguous in GM)
-    dst_stride  = 128 × 2 = 256 bytes (32B-aligned tile width in UB)
-    pad         = 256 - 200 = 56 bytes (padded to 32B boundary with pad_val)
-```
-
-```mlir
-pto.set_loop_size_outtoub %c1_i64, %c1_i64 : i64, i64
-pto.set_loop1_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-pto.set_loop2_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr,
-    %c0_i64,       // sid = 0
-    %c64_i64,      // n_burst = 64
-    %c200_i64,     // len_burst = 200 bytes
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %true,         // data_select_bit = true (enable padding)
-    %c0_i64,       // l2_cache_ctl = 0
-    %c200_i64,     // src_stride = 200 bytes
-    %c256_i64      // dst_stride = 256 bytes (32B-aligned)
-    : !pto.ptr<f16, gm>, !pto.ptr<f16, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
----
-
-## Example 4: UB→GM — Store a 32×32 f32 Tile (Simple Case)
-
-Store a 32×32 f32 tile from UB back to GM. This matches the `abs_kernel_2d` test case.
-
-```
-UB (source, 32B-aligned, 32 × 32 f32):
-
-    |<- src_stride = 128B (32B-aligned) ->|
-    |<- len_burst = 128B ->|
-    +--[#######TILE#######]---+  row 0
-    +--[#######TILE#######]---+  row 1
-    ...
-    +--[#######TILE#######]---+  row 31
-
-    (no padding here — len_burst == src_stride)
-
-GM (dest, 32 × 32 f32):
-
-    |<- dst_stride = 128B ->|
-    |<- len_burst = 128B -->|
-    +--[#######TILE#######]---+  row 0
-    +--[#######TILE#######]---+  row 1
-    ...
-    +--[#######TILE#######]---+  row 31
-```
-
-```mlir
-// Configure MTE3 strides
-pto.set_loop_size_ubtoout %c1_i64, %c1_i64 : i64, i64
-
-pto.copy_ubuf_to_gm %ub_out, %arg1,
-    %c0_i64,       // sid = 0
-    %c32_i64,      // n_burst = 32
-    %c128_i64,     // len_burst = 128 bytes
-    %c0_i64,       // reserved = 0
-    %c128_i64,     // dst_stride = 128 bytes
-    %c128_i64      // src_stride = 128 bytes
-    : !pto.ptr<f32, ub>, !pto.ptr<f32, gm>, i64, i64, i64, i64, i64, i64
-```
-
----
-
-## Example 5: UB→GM — Store a 2D Tile Back to a Larger Matrix
-
-Store a 64×128 tile (f16) from UB back to a 1024×512 GM matrix at an offset.
-
-```
-UB (source, 32B-aligned, 64 × 128 f16):
-
-    |<- src_stride = 256B (32B-aligned) ->|
-    |<- len_burst = 256B ->|
-    +--[#####TILE#####]---+  row 0
-    +--[#####TILE#####]---+  row 1
-    ...
-    +--[#####TILE#####]---+  row 63
-
-    (no padding here — len_burst == src_stride)
-
-GM (dest, into 1024 × 512 matrix):
-
-    |<----------- dst_stride = 1024B (start-to-start) --------->|
-    |<- len_burst = 256B ->|                                    |
-    col 0          col 128                              col 512
-    +--[#####TILE#####]---+.............................+  row R
-    +--[#####TILE#####]---+.............................+  row R+1
-    ...
-    +--[#####TILE#####]---+.............................+  row R+63
-
-    MTE3 reads len_burst bytes from each 32B-aligned UB row,
-    writes only len_burst bytes per GM row (stride controls row spacing).
-```
-
-```mlir
-// Configure MTE3 strides
-pto.set_loop_size_ubtoout %c1_i64, %c1_i64 : i64, i64
-pto.set_loop1_stride_ubtoout %c0_i64, %c0_i64 : i64, i64
-pto.set_loop2_stride_ubtoout %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_ubuf_to_gm %ub_ptr, %gm_ptr,
-    %c0_i64,       // sid = 0
-    %c64_i64,      // n_burst = 64
-    %c256_i64,     // len_burst = 256 bytes
-    %c0_i64,       // reserved = 0
-    %c1024_i64,    // dst_stride = 1024 bytes (GM row)
-    %c256_i64      // src_stride = 256 bytes (UB row)
-    : !pto.ptr<f16, ub>, !pto.ptr<f16, gm>, i64, i64, i64, i64, i64, i64
-```
-
----
-
-## Example 6: GM→UB with Multi-Level Loop (Batch of Tiles)
-
-Load 4 batches of 8×128 tiles from a [4, 8, 128] f16 tensor using loop1.
-
-```
-GM [4, 8, 128] f16 (contiguous):        UB (4 tiles laid out sequentially):
-
-    batch 0: 8 rows × 256 bytes          [batch 0: 8×128][batch 1: 8×128]
-    batch 1: 8 rows × 256 bytes          [batch 2: 8×128][batch 3: 8×128]
-    batch 2: 8 rows × 256 bytes
-    batch 3: 8 rows × 256 bytes          loop1 src_stride = 2048 bytes (8 × 256)
-                                          loop1 dst_stride = 2048 bytes (8 × 256)
-    Each batch = 8 × 256 = 2048 bytes     loop1_count = 4 (iterate over batches)
-```
-
-```mlir
-// loop1_count = 4 batches, loop2_count = 1 (not used)
-pto.set_loop_size_outtoub %c4_i64, %c1_i64 : i64, i64
-
-// loop1 stride: advance by one batch (2048 bytes) in both GM and UB
-pto.set_loop1_stride_outtoub %c2048_i64, %c2048_i64 : i64, i64
-pto.set_loop2_stride_outtoub %c0_i64, %c0_i64 : i64, i64
-
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr,
-    %c0_i64,       // sid = 0
-    %c8_i64,       // n_burst = 8 rows per batch
-    %c256_i64,     // len_burst = 256 bytes per row
-    %c0_i64,       // left_padding = 0
-    %c0_i64,       // right_padding = 0
-    %false,        // data_select_bit = false
-    %c0_i64,       // l2_cache_ctl = 0
-    %c256_i64,     // src_stride = 256 (contiguous rows)
-    %c256_i64      // dst_stride = 256 (contiguous rows)
-    : !pto.ptr<f16, gm>, !pto.ptr<f16, ub>, i64, i64, i64,
-      i64, i64, i1, i64, i64, i64
-```
-
-Execution trace:
-
-```
-loop1 iter 0: gm_ptr + 0×2048 → ub_ptr + 0×2048, DMA 8 rows × 256B
-loop1 iter 1: gm_ptr + 1×2048 → ub_ptr + 1×2048, DMA 8 rows × 256B
-loop1 iter 2: gm_ptr + 2×2048 → ub_ptr + 2×2048, DMA 8 rows × 256B
-loop1 iter 3: gm_ptr + 3×2048 → ub_ptr + 3×2048, DMA 8 rows × 256B
-```
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vadd.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vadd.md
deleted file mode 100644
index b0fab823..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vadd.md
+++ /dev/null
@@ -1,143 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vadd.md` -->
-
-# pto.vadd
-
-Standalone reference page for `pto.vadd`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Lane-wise addition of two vector registers, producing a result vector register. Only lanes selected by the predicate mask are active; inactive lanes do not participate in the computation and their destination elements are left unchanged.
-
-## Mechanism
-
-`pto.vadd` is a `pto.v*` compute operation. It reads two source vector registers lane-by-lane, adds the corresponding elements, and writes the result to the destination vector register. The iteration domain covers all N lanes; the predicate mask determines which lanes are active.
-
-For each lane `i` where the predicate is true:
-
-$$ \mathrm{dst}_i = \mathrm{lhs}_i + \mathrm{rhs}_i $$
-
-Lanes where the predicate is false are **inactive**: the destination register element at that lane is unmodified.
-
-## Syntax
-
-### PTO Assembly Form
-
-```text
-vadd %dst, %lhs, %rhs, %mask : !pto.vreg<NxT>
-```
-
-### AS Level 1 (SSA)
-
-```mlir
-%result = pto.vadd %lhs, %rhs, %mask : (!pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask) -> !pto.vreg<NxT>
-```
-
-### AS Level 2 (DPS)
-
-```mlir
-pto.vadd ins(%lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask)
-          outs(%result : !pto.vreg<NxT>)
-```
-
-## C++ Intrinsic
-
-Declared in `include/pto/common/pto_instr.hpp`:
-
-```cpp
-template <typename VecDst, typename VecLhs, typename VecRhs, typename MaskT, typename... WaitEvents>
-PTO_INST RecordEvent VADD(VecDst& dst, const VecLhs& lhs, const VecRhs& rhs,
-                          const MaskT& mask, WaitEvents&... events);
-```
-
-## Inputs
-
-| Operand | Type | Description |
-|---------|------|-------------|
-| `%lhs` | `!pto.vreg<NxT>` | Left-hand source vector register |
-| `%rhs` | `!pto.vreg<NxT>` | Right-hand source vector register |
-| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active |
-
-Both source registers MUST have the same element type and the same vector width `N`. The mask width MUST match `N`.
-
-## Expected Outputs
-
-| Result | Type | Description |
-|--------|------|-------------|
-| `%dst` | `!pto.vreg<NxT>` | Lane-wise sum on active lanes; inactive lanes are unmodified |
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its destination vector register. It does not implicitly reserve buffers, signal events, or establish memory fences.
-
-## Constraints
-
-- **Type match**: `%lhs`, `%rhs`, and `%dst` MUST have identical element types.
-- **Width match**: All three registers MUST have the same vector width `N`.
-- **Mask width**: `%mask` MUST have width equal to `N`.
-- **Active lanes**: Only lanes where the mask bit is 1 (true) participate in the addition.
-- **Inactive lanes**: Destination elements at inactive lanes are unmodified.
-
-## Exceptions
-
-- The verifier rejects illegal operand type mismatches, width mismatches, or mask width mismatches.
-- Any additional illegality stated in the [Binary Vector Ops](../../binary-vector-ops.md) family page is also part of the contract.
-
-## Target-Profile Restrictions
-
-| Element Type | CPU Simulator | A2/A3 | A5 |
-|------------|:-------------:|:------:|:--:|
-| `f32` | Simulated | Simulated | Supported |
-| `f16` / `bf16` | Simulated | Simulated | Supported |
-| `i8`–`i64`, `u8`–`u64` | Simulated | Simulated | Supported |
-
-A5 is the primary concrete profile for the vector surface. CPU simulation and A2/A3-class targets emulate `pto.v*` operations using scalar loops while preserving the visible PTO contract. Code that depends on specific performance characteristics or latency should treat those dependencies as target-profile-specific.
-
-## Examples
-
-### Full-vector addition (all lanes active)
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-// All lanes active: mask set to all-ones
-Mask<64> mask;
-mask.set_all(true);  // predicate all-true
-
-VADD(vdst, va, vb, mask);
-```
-
-### Partial predication
-
-```mlir
-// Only lanes where %cond is true participate in addition
-%result = pto.vadd %va, %vb, %cond : (!pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask) -> !pto.vreg<128xf16>
-```
-
-### Complete vector-load / compute / vector-store pipeline
-
-```cpp
-#include <pto/pto-inst.hpp>
-using namespace pto;
-
-void vector_add(Ptr<ub_space_t, ub_t> ub_a, Ptr<ub_space_t, ub_t> ub_b,
-                Ptr<ub_space_t, ub_t> ub_out, size_t count) {
-    VReg<64, float> va, vb, vdst;
-    Mask<64> mask;
-    mask.set_all(true);
-
-    VLDS(va, ub_a, "NORM");
-    VLDS(vb, ub_b, "NORM");
-
-    VADD(vdst, va, vb, mask);
-
-    VSTS(vdst, ub_out);
-}
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Next op in family: [pto.vsub](./vsub.md)
-- Vector surface overview: [Vector Instructions](../../instruction-surfaces/vector-instructions.md)
-- Type system: [Type System](../../state-and-types/type-system.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vadd_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vadd_zh.md
deleted file mode 100644
index 748fcd7f..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vadd_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vadd
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vadd.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vaddc.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vaddc.md
deleted file mode 100644
index fd4d3e49..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vaddc.md
+++ /dev/null
@@ -1,75 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vaddc.md` -->
-
-# pto.vaddc
-
-Standalone reference page for `pto.vaddc`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Add with carry output.
-
-## Mechanism
-
-`pto.vaddc` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result, %carry = pto.vaddc %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>, !pto.mask
-```
-
-## Inputs
-
-`%lhs` and `%rhs` are added lane-wise and `%mask` selects active
-  lanes.
-
-## Expected Outputs
-
-`%result` is the truncated arithmetic result and `%carry` is the
-  carry/overflow predicate per lane.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This is a carry-chain integer add family. On
-  the current A5 surface, it SHOULD be treated as an unsigned integer
-  operation.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++) {
-    uint64_t r = (uint64_t)src0[i] + src1[i];
-    dst[i] = (T)r;
-    carry[i] = (r >> bitwidth);
-}
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++) {
-    uint64_t r = (uint64_t)src0[i] + src1[i];
-    dst[i] = (T)r;
-    carry[i] = (r >> bitwidth);
-}
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vshr](./vshr.md)
-- Next op in family: [pto.vsubc](./vsubc.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vaddc_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vaddc_zh.md
deleted file mode 100644
index 193224f8..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vaddc_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vaddc
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vaddc.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vand.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vand.md
deleted file mode 100644
index 5c4769bb..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vand.md
+++ /dev/null
@@ -1,68 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vand.md` -->
-
-# pto.vand
-
-Standalone reference page for `pto.vand`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise bitwise AND.
-
-## Mechanism
-
-`pto.vand` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vand %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `all integer types`.
-
-## Inputs
-
-`%lhs`, `%rhs`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` is the lane-wise bitwise AND.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `all integer types`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] & src1[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] & src1[i];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vmin](./vmin.md)
-- Next op in family: [pto.vor](./vor.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vand_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vand_zh.md
deleted file mode 100644
index 85e04b0c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vand_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vand
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vand.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vdiv.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vdiv.md
deleted file mode 100644
index 6bc9618c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vdiv.md
+++ /dev/null
@@ -1,72 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vdiv.md` -->
-
-# pto.vdiv
-
-Standalone reference page for `pto.vdiv`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise quotient.
-
-## Mechanism
-
-`pto.vdiv` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vdiv %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32 only (no integer division)`.
-
-## Inputs
-
-`%lhs` is the numerator, `%rhs` is the denominator, and `%mask`
-  selects active lanes.
-
-## Expected Outputs
-
-`%result` is the lane-wise quotient.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Floating-point element types only. Active
-  denominators containing `+0` or `-0` follow the target's exceptional
-  behavior.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Target-defined numeric exceptional behavior, such as divide-by-zero or out-of-domain inputs, remains subject to the selected backend profile unless this page narrows it further.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32 only (no integer division)`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] / src1[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] / src1[i];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vmul](./vmul.md)
-- Next op in family: [pto.vmax](./vmax.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vdiv_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vdiv_zh.md
deleted file mode 100644
index 5b720ddc..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vdiv_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vdiv
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vdiv.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmax.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmax.md
deleted file mode 100644
index 417e393d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmax.md
+++ /dev/null
@@ -1,68 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vmax.md` -->
-
-# pto.vmax
-
-Standalone reference page for `pto.vmax`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` holds the lane-wise maximum.
-
-## Mechanism
-
-`pto.vmax` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vmax %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i8-i32, f16, bf16, f32`.
-
-## Inputs
-
-`%lhs`, `%rhs`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` holds the lane-wise maximum.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Input and result types MUST match.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i8-i32, f16, bf16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src0[i] > src1[i]) ? src0[i] : src1[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src0[i] > src1[i]) ? src0[i] : src1[i];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vdiv](./vdiv.md)
-- Next op in family: [pto.vmin](./vmin.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmax_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmax_zh.md
deleted file mode 100644
index d446a6dd..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmax_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vmax
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vmax.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmin.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmin.md
deleted file mode 100644
index aa404fc9..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmin.md
+++ /dev/null
@@ -1,70 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vmin.md` -->
-
-# pto.vmin
-
-Standalone reference page for `pto.vmin`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` holds the lane-wise minimum.
-
-## Mechanism
-
-`pto.vmin` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vmin %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i8-i32, f16, bf16, f32`.
-
-## Inputs
-
-`%lhs`, `%rhs`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` holds the lane-wise minimum.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Input and result types MUST match.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i8-i32, f16, bf16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src0[i] < src1[i]) ? src0[i] : src1[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src0[i] < src1[i]) ? src0[i] : src1[i];
-```
-
-## Bitwise
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vmax](./vmax.md)
-- Next op in family: [pto.vand](./vand.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmin_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmin_zh.md
deleted file mode 100644
index d6a71a4b..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmin_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vmin
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vmin.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmul.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmul.md
deleted file mode 100644
index 0f3da84d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmul.md
+++ /dev/null
@@ -1,70 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vmul.md` -->
-
-# pto.vmul
-
-Standalone reference page for `pto.vmul`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise product.
-
-## Mechanism
-
-`pto.vmul` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vmul %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i16-i32, f16, bf16, f32 (**NOT** i8/u8)`.
-
-## Inputs
-
-`%lhs` and `%rhs` are multiplied lane-wise; `%mask` selects
-  active lanes.
-
-## Expected Outputs
-
-`%result` is the lane-wise product.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-The current A5 profile excludes `i8/u8`
-  forms from this surface.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i16-i32, f16, bf16, f32 (**NOT** i8/u8)`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] * src1[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] * src1[i];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vsub](./vsub.md)
-- Next op in family: [pto.vdiv](./vdiv.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmul_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmul_zh.md
deleted file mode 100644
index b7fcacaa..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vmul_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vmul
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vmul.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vor.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vor.md
deleted file mode 100644
index 6122b675..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vor.md
+++ /dev/null
@@ -1,68 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vor.md` -->
-
-# pto.vor
-
-Standalone reference page for `pto.vor`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise bitwise OR.
-
-## Mechanism
-
-`pto.vor` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vor %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `all integer types`.
-
-## Inputs
-
-`%lhs`, `%rhs`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` is the lane-wise bitwise OR.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `all integer types`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] | src1[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] | src1[i];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vand](./vand.md)
-- Next op in family: [pto.vxor](./vxor.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vor_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vor_zh.md
deleted file mode 100644
index b6428f36..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vor_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vor
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vor.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshl.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshl.md
deleted file mode 100644
index fea8b19d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshl.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vshl.md` -->
-
-# pto.vshl
-
-Standalone reference page for `pto.vshl`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the shifted vector.
-
-## Mechanism
-
-`pto.vshl` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vshl %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `all integer types`.
-
-## Inputs
-
-`%lhs` supplies the shifted value, `%rhs` supplies the per-lane
-  shift amount, and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` is the shifted vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only. Shift counts
-  SHOULD stay within `[0, bitwidth(T) - 1]`; out-of-range behavior is target-
-  defined unless the verifier narrows it further.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `all integer types`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] << src1[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] << src1[i];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vxor](./vxor.md)
-- Next op in family: [pto.vshr](./vshr.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshl_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshl_zh.md
deleted file mode 100644
index e99d9677..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshl_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vshl
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vshl.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshr.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshr.md
deleted file mode 100644
index 7ffba623..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshr.md
+++ /dev/null
@@ -1,72 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vshr.md` -->
-
-# pto.vshr
-
-Standalone reference page for `pto.vshr`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the shifted vector.
-
-## Mechanism
-
-`pto.vshr` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vshr %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `all integer types`.
-
-## Inputs
-
-`%lhs` supplies the shifted value, `%rhs` supplies the per-lane
-  shift amount, and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` is the shifted vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only. Signedness of the
-  element type determines arithmetic vs logical behavior.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `all integer types`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] >> src1[i];  // arithmetic for signed, logical for unsigned
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] >> src1[i];  // arithmetic for signed, logical for unsigned
-```
-
-## Carry Operations
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vshl](./vshl.md)
-- Next op in family: [pto.vaddc](./vaddc.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshr_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshr_zh.md
deleted file mode 100644
index 35e5a4ac..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vshr_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vshr
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vshr.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsub.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsub.md
deleted file mode 100644
index 66fc8ce4..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsub.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vsub.md` -->
-
-# pto.vsub
-
-Standalone reference page for `pto.vsub`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise difference.
-
-## Mechanism
-
-`pto.vsub` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vsub %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i8-i64, f16, bf16, f32`.
-
-## Inputs
-
-`%lhs` is the minuend, `%rhs` is the subtrahend, and `%mask`
-  selects active lanes.
-
-## Expected Outputs
-
-`%result` is the lane-wise difference.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Input and result types MUST match.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i8-i64, f16, bf16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] - src1[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] - src1[i];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vadd](./vadd.md)
-- Next op in family: [pto.vmul](./vmul.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsub_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsub_zh.md
deleted file mode 100644
index 7db4c73a..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsub_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsub
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsub.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsubc.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsubc.md
deleted file mode 100644
index 44d95917..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsubc.md
+++ /dev/null
@@ -1,104 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vsubc.md` -->
-
-# pto.vsubc
-
-Standalone reference page for `pto.vsubc`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Subtract with borrow output.
-
-## Mechanism
-
-`pto.vsubc` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result, %borrow = pto.vsubc %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>, !pto.mask
-```
-
-## Inputs
-
-`%lhs` and `%rhs` are subtracted lane-wise and `%mask` selects
-  active lanes.
-
-## Expected Outputs
-
-`%result` is the arithmetic difference and `%borrow` marks lanes
-  that borrowed.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This operation SHOULD be treated as an
-  unsigned 32-bit carry-chain family unless and until the verifier states
-  otherwise.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++) {
-    dst[i] = src0[i] - src1[i];
-    borrow[i] = (src0[i] < src1[i]);
-}
-```
-
-```mlir
-// Vector addition
-%sum = pto.vadd %a, %b, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Element-wise multiply
-%prod = pto.vmul %x, %y, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Clamp to range [min, max]
-%clamped_low = pto.vmax %input, %min_vec, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-%clamped = pto.vmin %clamped_low, %max_vec, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Bit manipulation
-%masked = pto.vand %data, %bitmask, %mask : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask -> !pto.vreg<64xi32>
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++) {
-    dst[i] = src0[i] - src1[i];
-    borrow[i] = (src0[i] < src1[i]);
-}
-```
-
-## Typical Usage
-
-```mlir
-// Vector addition
-%sum = pto.vadd %a, %b, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Element-wise multiply
-%prod = pto.vmul %x, %y, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Clamp to range [min, max]
-%clamped_low = pto.vmax %input, %min_vec, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-%clamped = pto.vmin %clamped_low, %max_vec, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Bit manipulation
-%masked = pto.vand %data, %bitmask, %mask : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask -> !pto.vreg<64xi32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vaddc](./vaddc.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsubc_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsubc_zh.md
deleted file mode 100644
index 318a1b1c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vsubc_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsubc
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsubc.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vxor.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vxor.md
deleted file mode 100644
index 3950f1db..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vxor.md
+++ /dev/null
@@ -1,70 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/binary-vector-ops/vxor.md` -->
-
-# pto.vxor
-
-Standalone reference page for `pto.vxor`. This page belongs to the [Binary Vector Ops](../../binary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise bitwise XOR.
-
-## Mechanism
-
-`pto.vxor` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vxor %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `all integer types`.
-
-## Inputs
-
-`%lhs`, `%rhs`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` is the lane-wise bitwise XOR.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `all integer types`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] ^ src1[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src0[i] ^ src1[i];
-```
-
-## Shift
-
-## Related Ops / Family Links
-
-- Family overview: [Binary Vector Ops](../../binary-vector-ops.md)
-- Previous op in family: [pto.vor](./vor.md)
-- Next op in family: [pto.vshl](./vshl.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vxor_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vxor_zh.md
deleted file mode 100644
index d8f084ff..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/binary-vector-ops/vxor_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vxor
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vxor.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmp.md b/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmp.md
deleted file mode 100644
index 023b6b89..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmp.md
+++ /dev/null
@@ -1,93 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/compare-select/vcmp.md` -->
-
-# pto.vcmp
-
-Standalone reference page for `pto.vcmp`. This page belongs to the [Compare And Select](../../compare-select.md) family in the PTO ISA manual.
-
-## Summary
-
-Element-wise comparison, output predicate mask.
-
-## Mechanism
-
-`pto.vcmp` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vcmp %src0, %src1, %seed, "CMP_MODE" : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.mask
-```
-
-## Inputs
-
-`%src0`, `%src1`, and `%seed`; `CMP_MODE` selects the comparison
-  predicate.
-
-## Expected Outputs
-
-`%result` is the generated predicate mask.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Only lanes enabled by `%seed` participate.
-  Integer and floating-point comparisons follow their own element-type-specific
-  comparison rules.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    if (seed[i])
-        dst[i] = (src0[i] CMP src1[i]) ? 1 : 0;
-```
-
-```mlir
-%all_active = pto.pset_b32 "PAT_ALL" : !pto.mask
-%lt_mask = pto.vcmp %a, %b, %all_active, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-// lt_mask[i] = 1 if a[i] < b[i]
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    if (seed[i])
-        dst[i] = (src0[i] CMP src1[i]) ? 1 : 0;
-```
-
-**Compare modes:**
-
-| Mode | Operation |
-|------|-----------|
-| `eq` | Equal (==) |
-| `ne` | Not equal (!=) |
-| `lt` | Less than (<) |
-| `le` | Less than or equal (<=) |
-| `gt` | Greater than (>) |
-| `ge` | Greater than or equal (>=) |
-
-**Example:**
-```mlir
-%all_active = pto.pset_b32 "PAT_ALL" : !pto.mask
-%lt_mask = pto.vcmp %a, %b, %all_active, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-// lt_mask[i] = 1 if a[i] < b[i]
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Compare And Select](../../compare-select.md)
-- Next op in family: [pto.vcmps](./vcmps.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmp_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmp_zh.md
deleted file mode 100644
index b5bbb6f2..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmp_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vcmp
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vcmp.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmps.md b/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmps.md
deleted file mode 100644
index 6e624290..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmps.md
+++ /dev/null
@@ -1,84 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/compare-select/vcmps.md` -->
-
-# pto.vcmps
-
-Standalone reference page for `pto.vcmps`. This page belongs to the [Compare And Select](../../compare-select.md) family in the PTO ISA manual.
-
-## Summary
-
-Compare vector against scalar.
-
-## Mechanism
-
-`pto.vcmps` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vcmps %src, %scalar, %seed, "CMP_MODE" : !pto.vreg<NxT>, T, !pto.mask -> !pto.mask
-```
-
-## Inputs
-
-`%src` is the vector source, `%scalar` is the scalar comparison
-  value, and `%seed` is the incoming predicate.
-
-## Expected Outputs
-
-`%result` is the generated predicate mask.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-For 32-bit scalar forms, the scalar source
-  MUST satisfy the backend's legal scalar-source constraints for this family.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    if (seed[i])
-        dst[i] = (src[i] CMP scalar) ? 1 : 0;
-```
-
-```mlir
-%positive_mask = pto.vcmps %values, %c0_f32, %all_active, "gt"
-    : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-// positive_mask[i] = 1 if values[i] > 0
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    if (seed[i])
-        dst[i] = (src[i] CMP scalar) ? 1 : 0;
-```
-
-**Example:**
-```mlir
-%positive_mask = pto.vcmps %values, %c0_f32, %all_active, "gt"
-    : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-// positive_mask[i] = 1 if values[i] > 0
-```
-
-## Selection Operations
-
-## Related Ops / Family Links
-
-- Family overview: [Compare And Select](../../compare-select.md)
-- Previous op in family: [pto.vcmp](./vcmp.md)
-- Next op in family: [pto.vsel](./vsel.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmps_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmps_zh.md
deleted file mode 100644
index 743d43e8..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vcmps_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vcmps
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vcmps.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vsel.md b/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vsel.md
deleted file mode 100644
index fd4ae1e8..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vsel.md
+++ /dev/null
@@ -1,80 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/compare-select/vsel.md` -->
-
-# pto.vsel
-
-Standalone reference page for `pto.vsel`. This page belongs to the [Compare And Select](../../compare-select.md) family in the PTO ISA manual.
-
-## Summary
-
-Per-lane select based on mask.
-
-## Mechanism
-
-`pto.vsel` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vsel %src0, %src1, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%src0` is the true-path vector, `%src1` is the false-path vector,
-  and `%mask` selects between them.
-
-## Expected Outputs
-
-`%result` is the selected vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Source vectors and result MUST have matching
-  vector shapes and element types.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = mask[i] ? src0[i] : src1[i];
-```
-
-```mlir
-// dst = mask ? true_vals : false_vals
-%result = pto.vsel %true_vals, %false_vals, %condition
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = mask[i] ? src0[i] : src1[i];
-```
-
-**Example — Conditional assignment:**
-```mlir
-// dst = mask ? true_vals : false_vals
-%result = pto.vsel %true_vals, %false_vals, %condition
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Compare And Select](../../compare-select.md)
-- Previous op in family: [pto.vcmps](./vcmps.md)
-- Next op in family: [pto.vselr](./vselr.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vsel_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vsel_zh.md
deleted file mode 100644
index f818802d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vsel_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsel
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsel.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselr.md b/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselr.md
deleted file mode 100644
index 4b54e515..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselr.md
+++ /dev/null
@@ -1,67 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/compare-select/vselr.md` -->
-
-# pto.vselr
-
-Standalone reference page for `pto.vselr`. This page belongs to the [Compare And Select](../../compare-select.md) family in the PTO ISA manual.
-
-## Summary
-
-Select with reversed mask semantics.
-
-## Mechanism
-
-`pto.vselr` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vselr %src0, %src1 : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%src0` and `%src1` are the source vectors.
-
-## Expected Outputs
-
-`%result` is the selected vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This family preserves reversed-select
-  semantics. If the concrete lowering uses an implicit predicate source, that
-  predicate source MUST be documented by the surrounding IR pattern.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = mask[i] ? src1[i] : src0[i];  // reversed from vsel
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = mask[i] ? src1[i] : src0[i];  // reversed from vsel
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Compare And Select](../../compare-select.md)
-- Previous op in family: [pto.vsel](./vsel.md)
-- Next op in family: [pto.vselrv2](./vselrv2.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselr_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselr_zh.md
deleted file mode 100644
index 18021f84..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselr_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vselr
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vselr.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselrv2.md b/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselrv2.md
deleted file mode 100644
index 662e654f..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselrv2.md
+++ /dev/null
@@ -1,127 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/compare-select/vselrv2.md` -->
-
-# pto.vselrv2
-
-Standalone reference page for `pto.vselrv2`. This page belongs to the [Compare And Select](../../compare-select.md) family in the PTO ISA manual.
-
-## Summary
-
-Variant select form with the same current two-vector operand shape.
-
-## Mechanism
-
-`pto.vselrv2` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vselrv2 %src0, %src1 : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%src0` and `%src1` are the source vectors.
-
-## Expected Outputs
-
-`%result` is the selected vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This page records the surface shape only.
-  Lowering MUST preserve the exact A5 variant semantics selected for this form.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-// Clamp negative values to zero (manual ReLU)
-%all = pto.pset_b32 "PAT_ALL" : !pto.mask
-%zero = pto.vbr %c0_f32 : f32 -> !pto.vreg<64xf32>
-%neg_mask = pto.vcmps %input, %c0_f32, %all, "lt" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-%clamped = pto.vsel %zero, %input, %neg_mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Element-wise max via compare+select
-%gt_mask = pto.vcmp %a, %b, %all, "gt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-%max_ab = pto.vsel %a, %b, %gt_mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Threshold filter
-%above_thresh = pto.vcmps %scores, %threshold, %all, "ge" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-%filtered = pto.vsel %scores, %zero, %above_thresh : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-```mlir
-// Softmax safe exp: exp(x - max) where x < max returns exp of negative
-// but we want to clamp to avoid underflow
-
-%all = pto.pset_b32 "PAT_ALL" : !pto.mask
-
-// 1. Compare against threshold
-%too_small = pto.vcmps %x_minus_max, %min_exp_arg, %all, "lt"
-    : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-
-// 2. Clamp values below threshold
-%clamped = pto.vsel %min_exp_arg_vec, %x_minus_max, %too_small
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// 3. Safe exp
-%exp_result = pto.vexp %clamped, %all : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Detailed Notes
-
-## Typical Usage
-
-```mlir
-// Clamp negative values to zero (manual ReLU)
-%all = pto.pset_b32 "PAT_ALL" : !pto.mask
-%zero = pto.vbr %c0_f32 : f32 -> !pto.vreg<64xf32>
-%neg_mask = pto.vcmps %input, %c0_f32, %all, "lt" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-%clamped = pto.vsel %zero, %input, %neg_mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Element-wise max via compare+select
-%gt_mask = pto.vcmp %a, %b, %all, "gt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-%max_ab = pto.vsel %a, %b, %gt_mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Threshold filter
-%above_thresh = pto.vcmps %scores, %threshold, %all, "ge" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-%filtered = pto.vsel %scores, %zero, %above_thresh : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Compare + Select Pattern
-
-```mlir
-// Softmax safe exp: exp(x - max) where x < max returns exp of negative
-// but we want to clamp to avoid underflow
-
-%all = pto.pset_b32 "PAT_ALL" : !pto.mask
-
-// 1. Compare against threshold
-%too_small = pto.vcmps %x_minus_max, %min_exp_arg, %all, "lt"
-    : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-
-// 2. Clamp values below threshold
-%clamped = pto.vsel %min_exp_arg_vec, %x_minus_max, %too_small
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// 3. Safe exp
-%exp_result = pto.vexp %clamped, %all : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Compare And Select](../../compare-select.md)
-- Previous op in family: [pto.vselr](./vselr.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselrv2_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselrv2_zh.md
deleted file mode 100644
index 03abcd94..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/compare-select/vselrv2_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vselrv2
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vselrv2.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vci.md b/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vci.md
deleted file mode 100644
index 9f975d75..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vci.md
+++ /dev/null
@@ -1,56 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/conversion-ops/vci.md` -->
-
-# pto.vci
-
-Standalone reference page for `pto.vci`. This page belongs to the [Conversion Ops](../../conversion-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Standalone contract page for `pto.vci`.
-
-## Mechanism
-
-`pto.vci` belongs to the `pto.v*` conversion surface. It changes vector element interpretation, width, rounding, saturation, or index-generation state without leaving the vector-register model.
-
-## Syntax
-
-
-## Inputs
-
-This operation follows the operand model of the [Conversion Ops](../../conversion-ops.md) family: SSA vector values carry payloads, masks gate active lanes when present, and family-specific attributes select rounding, selection, distribution, or fused-mode behavior.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-pto.vci
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [Conversion Ops](../../conversion-ops.md)
-- Next op in family: [pto.vcvt](./vcvt.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vci_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vci_zh.md
deleted file mode 100644
index 022994a0..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vci_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vci
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vci.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vcvt.md b/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vcvt.md
deleted file mode 100644
index 1f139427..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vcvt.md
+++ /dev/null
@@ -1,57 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/conversion-ops/vcvt.md` -->
-
-# pto.vcvt
-
-Standalone reference page for `pto.vcvt`. This page belongs to the [Conversion Ops](../../conversion-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Standalone contract page for `pto.vcvt`.
-
-## Mechanism
-
-`pto.vcvt` belongs to the `pto.v*` conversion surface. It changes vector element interpretation, width, rounding, saturation, or index-generation state without leaving the vector-register model.
-
-## Syntax
-
-
-## Inputs
-
-This operation follows the operand model of the [Conversion Ops](../../conversion-ops.md) family: SSA vector values carry payloads, masks gate active lanes when present, and family-specific attributes select rounding, selection, distribution, or fused-mode behavior.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-pto.vcvt
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [Conversion Ops](../../conversion-ops.md)
-- Previous op in family: [pto.vci](./vci.md)
-- Next op in family: [pto.vtrc](./vtrc.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vcvt_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vcvt_zh.md
deleted file mode 100644
index 6f8f0461..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vcvt_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vcvt
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vcvt.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vtrc.md b/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vtrc.md
deleted file mode 100644
index ba96779d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vtrc.md
+++ /dev/null
@@ -1,56 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/conversion-ops/vtrc.md` -->
-
-# pto.vtrc
-
-Standalone reference page for `pto.vtrc`. This page belongs to the [Conversion Ops](../../conversion-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Standalone contract page for `pto.vtrc`.
-
-## Mechanism
-
-`pto.vtrc` belongs to the `pto.v*` conversion surface. It changes vector element interpretation, width, rounding, saturation, or index-generation state without leaving the vector-register model.
-
-## Syntax
-
-
-## Inputs
-
-This operation follows the operand model of the [Conversion Ops](../../conversion-ops.md) family: SSA vector values carry payloads, masks gate active lanes when present, and family-specific attributes select rounding, selection, distribution, or fused-mode behavior.
-
-## Expected Outputs
-
-This form is primarily defined by the side effect it has on control state, predicate state, or memory. It does not publish a new payload SSA result beyond any explicit state outputs shown in the syntax.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This operation inherits the legality and operand-shape rules of its family overview. Any target-specific narrowing of element types, distributions, pipe/event spaces, or configuration tuples must be stated by the selected target profile.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-pto.vtrc
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [Conversion Ops](../../conversion-ops.md)
-- Previous op in family: [pto.vcvt](./vcvt.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vtrc_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vtrc_zh.md
deleted file mode 100644
index 1e32293b..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/conversion-ops/vtrc_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vtrc
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vtrc.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlv.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlv.md
deleted file mode 100644
index 7855a6ed..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlv.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vdintlv.md` -->
-
-# pto.vdintlv
-
-Standalone reference page for `pto.vdintlv`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Deinterleave elements into even/odd.
-
-## Mechanism
-
-`pto.vdintlv` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%low, %high = pto.vdintlv %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>, !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%lhs` and `%rhs` represent the interleaved source stream in the
-  current PTO ISA vector surface representation.
-
-## Expected Outputs
-
-`%low` and `%high` are the separated destination vectors.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-The two outputs form the even/odd
-  deinterleave result pair, and their ordering MUST be preserved.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-// Deinterleave: separate even/odd elements
-// low  = {src0[0], src0[2], src0[4], ...}  // even
-// high = {src0[1], src0[3], src0[5], ...}  // odd
-```
-
-## Detailed Notes
-
-```c
-// Deinterleave: separate even/odd elements
-// low  = {src0[0], src0[2], src0[4], ...}  // even
-// high = {src0[1], src0[3], src0[5], ...}  // odd
-```
-
-## Slide / Shift
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Previous op in family: [pto.vintlv](./vintlv.md)
-- Next op in family: [pto.vslide](./vslide.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlv_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlv_zh.md
deleted file mode 100644
index 2b6eb047..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlv_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vdintlv
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vdintlv.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlvv2.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlvv2.md
deleted file mode 100644
index c8c4e990..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlvv2.md
+++ /dev/null
@@ -1,62 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vdintlvv2.md` -->
-
-# pto.vdintlvv2
-
-Standalone reference page for `pto.vdintlvv2`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the selected deinterleave half.
-
-## Mechanism
-
-`pto.vdintlvv2` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vdintlvv2 %lhs, %rhs, "PART" : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%lhs` and `%rhs` are source vectors and `PART` selects the
-  returned half of the V2 deinterleave result.
-
-## Expected Outputs
-
-`%result` is the selected deinterleave half.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This op exposes only one half of the V2
-  result in SSA form.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-%result = pto.vdintlvv2 %lhs, %rhs, "PART" : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Previous op in family: [pto.vintlvv2](./vintlvv2.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlvv2_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlvv2_zh.md
deleted file mode 100644
index 3c293109..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vdintlvv2_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vdintlvv2
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vdintlvv2.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlv.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlv.md
deleted file mode 100644
index 8e9e0bd0..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlv.md
+++ /dev/null
@@ -1,68 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vintlv.md` -->
-
-# pto.vintlv
-
-Standalone reference page for `pto.vintlv`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Interleave elements from two sources.
-
-## Mechanism
-
-`pto.vintlv` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%low, %high = pto.vintlv %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>, !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%lhs` and `%rhs` are the two source vectors.
-
-## Expected Outputs
-
-`%low` and `%high` are the two destination vectors.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-The two outputs form a paired interleave
-  result. The PTO ISA vector surface representation exposes that pair as two SSA results, and the pair ordering MUST
-  be preserved.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-// Interleave: merge even/odd elements from two sources
-// low  = {src0[0], src1[0], src0[1], src1[1], ...}
-// high = {src0[N/2], src1[N/2], src0[N/2+1], src1[N/2+1], ...}
-```
-
-## Detailed Notes
-
-```c
-// Interleave: merge even/odd elements from two sources
-// low  = {src0[0], src1[0], src0[1], src1[1], ...}
-// high = {src0[N/2], src1[N/2], src0[N/2+1], src1[N/2+1], ...}
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Next op in family: [pto.vdintlv](./vdintlv.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlv_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlv_zh.md
deleted file mode 100644
index c8858852..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlv_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vintlv
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vintlv.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlvv2.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlvv2.md
deleted file mode 100644
index ae2329ad..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlvv2.md
+++ /dev/null
@@ -1,63 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vintlvv2.md` -->
-
-# pto.vintlvv2
-
-Standalone reference page for `pto.vintlvv2`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the selected interleave half.
-
-## Mechanism
-
-`pto.vintlvv2` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vintlvv2 %lhs, %rhs, "PART" : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%lhs` and `%rhs` are source vectors and `PART` selects the
-  returned half of the V2 interleave result.
-
-## Expected Outputs
-
-`%result` is the selected interleave half.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This op exposes only one half of the V2
-  result in SSA form.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-%result = pto.vintlvv2 %lhs, %rhs, "PART" : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Previous op in family: [pto.vzunpack](./vzunpack.md)
-- Next op in family: [pto.vdintlvv2](./vdintlvv2.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlvv2_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlvv2_zh.md
deleted file mode 100644
index 26567d49..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vintlvv2_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vintlvv2
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vintlvv2.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vpack.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vpack.md
deleted file mode 100644
index b811b50f..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vpack.md
+++ /dev/null
@@ -1,74 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vpack.md` -->
-
-# pto.vpack
-
-Standalone reference page for `pto.vpack`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Narrowing pack — two wide vectors to one narrow vector.
-
-## Mechanism
-
-`pto.vpack` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vpack %src0, %src1, %part : !pto.vreg<NxT_wide>, !pto.vreg<NxT_wide>, index -> !pto.vreg<2NxT_narrow>
-```
-
-## Inputs
-
-`%src0` and `%src1` are wide source vectors and `%part` selects
-  the packing submode.
-
-## Expected Outputs
-
-`%result` is the packed narrow vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Packing is a narrowing conversion. Source
-  values that do not fit the destination width follow the truncation semantics
-  of the selected packing mode.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-// e.g., two vreg<64xi32> → one vreg<128xi16>
-for (int i = 0; i < N; i++) {
-    dst[i]     = truncate(src0[i]);
-    dst[N + i] = truncate(src1[i]);
-}
-```
-
-## Detailed Notes
-
-```c
-// e.g., two vreg<64xi32> → one vreg<128xi16>
-for (int i = 0; i < N; i++) {
-    dst[i]     = truncate(src0[i]);
-    dst[N + i] = truncate(src1[i]);
-}
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Previous op in family: [pto.vperm](./vperm.md)
-- Next op in family: [pto.vsunpack](./vsunpack.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vpack_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vpack_zh.md
deleted file mode 100644
index 2c43846c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vpack_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vpack
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vpack.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vperm.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vperm.md
deleted file mode 100644
index d14f2b59..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vperm.md
+++ /dev/null
@@ -1,70 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vperm.md` -->
-
-# pto.vperm
-
-Standalone reference page for `pto.vperm`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-In-register permute (table lookup). **Not** memory gather.
-
-## Mechanism
-
-`pto.vperm` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vperm %src, %index : !pto.vreg<NxT>, !pto.vreg<NxI> -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%src` is the source vector and `%index` supplies per-lane source
-  indices.
-
-## Expected Outputs
-
-`%result` is the permuted vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This is an in-register permutation family.
-  `%index` values outside the legal range follow the wrap/clamp behavior of the
-  selected form.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[index[i] % N];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[index[i] % N];
-```
-
-**Note:** This operates on register contents, unlike `pto.vgather2` which reads from UB memory.
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Previous op in family: [pto.vusqz](./vusqz.md)
-- Next op in family: [pto.vpack](./vpack.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vperm_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vperm_zh.md
deleted file mode 100644
index d21e1771..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vperm_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vperm
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vperm.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vshift.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vshift.md
deleted file mode 100644
index aa510d83..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vshift.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vshift.md` -->
-
-# pto.vshift
-
-Standalone reference page for `pto.vshift`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Single-source slide (shift with zero fill).
-
-## Mechanism
-
-`pto.vshift` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vshift %src, %amt : !pto.vreg<NxT>, i16 -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%src` is the source vector and `%amt` is the slide amount.
-
-## Expected Outputs
-
-`%result` is the shifted vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This surface represents the single-source
-  slide/shift family. Zero-fill versus other fill behavior MUST match the
-  selected form.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (i >= amt) ? src[i - amt] : 0;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (i >= amt) ? src[i - amt] : 0;
-```
-
-## Compress / Expand
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Previous op in family: [pto.vslide](./vslide.md)
-- Next op in family: [pto.vsqz](./vsqz.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vshift_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vshift_zh.md
deleted file mode 100644
index 3d5a4789..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vshift_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vshift
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vshift.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vslide.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vslide.md
deleted file mode 100644
index 7f059ce6..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vslide.md
+++ /dev/null
@@ -1,76 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vslide.md` -->
-
-# pto.vslide
-
-Standalone reference page for `pto.vslide`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Concatenate two vectors and extract N-element window at offset.
-
-## Mechanism
-
-`pto.vslide` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vslide %src0, %src1, %amt : !pto.vreg<NxT>, !pto.vreg<NxT>, i16 -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%src0` and `%src1` provide the concatenated source window and
-  `%amt` selects the extraction offset.
-
-## Expected Outputs
-
-`%result` is the extracted destination window.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-`pto.vslide` operates on the logical
-  concatenation of `%src1` and `%src0`. The source order and extraction offset
-  MUST be preserved exactly.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-// Conceptually: tmp[0..2N-1] = {src1, src0}
-// dst[i] = tmp[amt + i]
-if (amt >= 0)
-    for (int i = 0; i < N; i++)
-        dst[i] = (i >= amt) ? src0[i - amt] : src1[N - amt + i];
-```
-
-## Detailed Notes
-
-```c
-// Conceptually: tmp[0..2N-1] = {src1, src0}
-// dst[i] = tmp[amt + i]
-if (amt >= 0)
-    for (int i = 0; i < N; i++)
-        dst[i] = (i >= amt) ? src0[i - amt] : src1[N - amt + i];
-```
-
-**Use case:** Sliding window operations, shift register patterns.
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Previous op in family: [pto.vdintlv](./vdintlv.md)
-- Next op in family: [pto.vshift](./vshift.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vslide_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vslide_zh.md
deleted file mode 100644
index 6b7219f2..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vslide_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vslide
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vslide.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsqz.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsqz.md
deleted file mode 100644
index b03a085f..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsqz.md
+++ /dev/null
@@ -1,73 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vsqz.md` -->
-
-# pto.vsqz
-
-Standalone reference page for `pto.vsqz`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Compress — pack active lanes to front.
-
-## Mechanism
-
-`pto.vsqz` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vsqz %src, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%src` is the source vector and `%mask` selects which elements are
-  kept.
-
-## Expected Outputs
-
-`%result` is the compacted vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This is a reduction-style compaction family.
-  Preserved element order MUST match source lane order.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-int j = 0;
-for (int i = 0; i < N; i++)
-    if (mask[i]) dst[j++] = src[i];
-while (j < N) dst[j++] = 0;
-```
-
-## Detailed Notes
-
-```c
-int j = 0;
-for (int i = 0; i < N; i++)
-    if (mask[i]) dst[j++] = src[i];
-while (j < N) dst[j++] = 0;
-```
-
-**Use case:** Sparse data compaction, filtering.
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Previous op in family: [pto.vshift](./vshift.md)
-- Next op in family: [pto.vusqz](./vusqz.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsqz_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsqz_zh.md
deleted file mode 100644
index a8a7824b..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsqz_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsqz
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsqz.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsunpack.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsunpack.md
deleted file mode 100644
index cb915cbf..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsunpack.md
+++ /dev/null
@@ -1,68 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vsunpack.md` -->
-
-# pto.vsunpack
-
-Standalone reference page for `pto.vsunpack`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Sign-extending unpack — narrow to wide (half).
-
-## Mechanism
-
-`pto.vsunpack` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vsunpack %src, %part : !pto.vreg<NxT_narrow>, index -> !pto.vreg<N/2xT_wide>
-```
-
-## Inputs
-
-`%src` is the packed narrow vector and `%part` selects which half
-  is unpacked.
-
-## Expected Outputs
-
-`%result` is the widened vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This is the sign-extending unpack family.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-// e.g., vreg<128xi16> → vreg<64xi32> (one half)
-for (int i = 0; i < N/2; i++)
-    dst[i] = sign_extend(src[part_offset + i]);
-```
-
-## Detailed Notes
-
-```c
-// e.g., vreg<128xi16> → vreg<64xi32> (one half)
-for (int i = 0; i < N/2; i++)
-    dst[i] = sign_extend(src[part_offset + i]);
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Previous op in family: [pto.vpack](./vpack.md)
-- Next op in family: [pto.vzunpack](./vzunpack.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsunpack_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsunpack_zh.md
deleted file mode 100644
index 5b39f027..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vsunpack_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsunpack
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsunpack.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vusqz.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vusqz.md
deleted file mode 100644
index 75774f44..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vusqz.md
+++ /dev/null
@@ -1,73 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vusqz.md` -->
-
-# pto.vusqz
-
-Standalone reference page for `pto.vusqz`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Expand — scatter front elements to active positions.
-
-## Mechanism
-
-`pto.vusqz` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vusqz %mask : !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%mask` is the expansion/placement predicate.
-
-## Expected Outputs
-
-`%result` is the expanded vector image.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-The source-front stream is implicit in the
-  current surface. Lane placement for active and inactive positions MUST be
-  preserved exactly.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-int j = 0;
-for (int i = 0; i < N; i++)
-    if (mask[i]) dst[i] = src_front[j++];
-    else dst[i] = 0;
-```
-
-## Detailed Notes
-
-```c
-int j = 0;
-for (int i = 0; i < N; i++)
-    if (mask[i]) dst[i] = src_front[j++];
-    else dst[i] = 0;
-```
-
-## Permutation
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Previous op in family: [pto.vsqz](./vsqz.md)
-- Next op in family: [pto.vperm](./vperm.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vusqz_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vusqz_zh.md
deleted file mode 100644
index f87afb43..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vusqz_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vusqz
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vusqz.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vzunpack.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vzunpack.md
deleted file mode 100644
index cb801e2f..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vzunpack.md
+++ /dev/null
@@ -1,114 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/data-rearrangement/vzunpack.md` -->
-
-# pto.vzunpack
-
-Standalone reference page for `pto.vzunpack`. This page belongs to the [Data Rearrangement](../../data-rearrangement.md) family in the PTO ISA manual.
-
-## Summary
-
-Zero-extending unpack — narrow to wide (half).
-
-## Mechanism
-
-`pto.vzunpack` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vzunpack %src, %part : !pto.vreg<NxT_narrow>, index -> !pto.vreg<N/2xT_wide>
-```
-
-## Inputs
-
-`%src` is the packed narrow vector and `%part` selects which half
-  is unpacked.
-
-## Expected Outputs
-
-`%result` is the widened vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This is the zero-extending unpack family.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N/2; i++)
-    dst[i] = zero_extend(src[part_offset + i]);
-```
-
-```mlir
-// AoS → SoA conversion using deinterleave
-%even, %odd = pto.vdintlv %interleaved0, %interleaved1
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>, !pto.vreg<64xf32>
-
-// Filter: keep only elements passing condition
-%pass_mask = pto.vcmps %values, %threshold, %all, "gt"
-    : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-%compacted = pto.vsqz %values, %pass_mask
-    : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Sliding window sum
-%prev_window = pto.vslide %curr, %prev, %c1
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, i16 -> !pto.vreg<64xf32>
-%window_sum = pto.vadd %curr, %prev_window, %all
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Type narrowing via pack
-%packed_i16 = pto.vpack %wide0_i32, %wide1_i32, %c0
-    : !pto.vreg<64xi32>, !pto.vreg<64xi32>, index -> !pto.vreg<128xi16>
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N/2; i++)
-    dst[i] = zero_extend(src[part_offset + i]);
-```
-
-## Typical Usage
-
-```mlir
-// AoS → SoA conversion using deinterleave
-%even, %odd = pto.vdintlv %interleaved0, %interleaved1
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>, !pto.vreg<64xf32>
-
-// Filter: keep only elements passing condition
-%pass_mask = pto.vcmps %values, %threshold, %all, "gt"
-    : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask
-%compacted = pto.vsqz %values, %pass_mask
-    : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Sliding window sum
-%prev_window = pto.vslide %curr, %prev, %c1
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, i16 -> !pto.vreg<64xf32>
-%window_sum = pto.vadd %curr, %prev_window, %all
-    : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Type narrowing via pack
-%packed_i16 = pto.vpack %wide0_i32, %wide1_i32, %c0
-    : !pto.vreg<64xi32>, !pto.vreg<64xi32>, index -> !pto.vreg<128xi16>
-```
-
-## V2 Interleave Forms
-
-## Related Ops / Family Links
-
-- Family overview: [Data Rearrangement](../../data-rearrangement.md)
-- Previous op in family: [pto.vsunpack](./vsunpack.md)
-- Next op in family: [pto.vintlvv2](./vintlvv2.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vzunpack_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vzunpack_zh.md
deleted file mode 100644
index 50af28af..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/data-rearrangement/vzunpack_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vzunpack
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vzunpack.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vbr.md b/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vbr.md
deleted file mode 100644
index 69a86e35..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vbr.md
+++ /dev/null
@@ -1,74 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/predicate-and-materialization/vbr.md` -->
-
-# pto.vbr
-
-Standalone reference page for `pto.vbr`. This page belongs to the [Predicate And Materialization](../../predicate-and-materialization.md) family in the PTO ISA manual.
-
-## Summary
-
-Broadcast scalar to all vector lanes.
-
-## Mechanism
-
-`pto.vbr` materializes scalar or selected-lane state into a vector register. The architectural result is a new vector-register value, so the operation stays in the `pto.v*` surface even when its input is scalar.
-
-## Syntax
-
-```mlir
-%result = pto.vbr %value : T -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%value` is the scalar source.
-
-## Expected Outputs
-
-`%result` is a vector whose active lanes all carry `%value`.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Supported forms are `b8`, `b16`, and `b32`. For `b8`, only the low 8 bits of
-  the scalar source are consumed.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = value;
-```
-
-```mlir
-%one = pto.vbr %c1_f32 : f32 -> !pto.vreg<64xf32>
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = value;
-```
-
-**Example:**
-```mlir
-%one = pto.vbr %c1_f32 : f32 -> !pto.vreg<64xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate And Materialization](../../predicate-and-materialization.md)
-- Next op in family: [pto.vdup](./vdup.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vbr_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vbr_zh.md
deleted file mode 100644
index eb8520e2..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vbr_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vbr
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vbr.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vdup.md b/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vdup.md
deleted file mode 100644
index 194ba078..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vdup.md
+++ /dev/null
@@ -1,68 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/predicate-and-materialization/vdup.md` -->
-
-# pto.vdup
-
-Standalone reference page for `pto.vdup`. This page belongs to the [Predicate And Materialization](../../predicate-and-materialization.md) family in the PTO ISA manual.
-
-## Summary
-
-Duplicate scalar or vector element to all lanes.
-
-## Mechanism
-
-`pto.vdup` materializes scalar or selected-lane state into a vector register. The architectural result is a new vector-register value, so the operation stays in the `pto.v*` surface even when its input is scalar.
-
-## Syntax
-
-```mlir
-%result = pto.vdup %input {position = "POSITION"} : T|!pto.vreg<NxT> -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input` supplies the scalar or source-lane value selected by `position`.
-
-## Expected Outputs
-
-`%result` is the duplicated vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-`position` selects which source element or scalar position is duplicated. The
-  current PTO ISA vector surface representation models that selector as an attribute rather than a
-  separate operand.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = input_scalar_or_element;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = input_scalar_or_element;
-```
-
-## Predicate Generation
-
-## Related Ops / Family Links
-
-- Family overview: [Predicate And Materialization](../../predicate-and-materialization.md)
-- Previous op in family: [pto.vbr](./vbr.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vdup_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vdup_zh.md
deleted file mode 100644
index b93d1313..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/predicate-and-materialization/vdup_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vdup
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vdup.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcadd.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcadd.md
deleted file mode 100644
index bbdd75a7..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcadd.md
+++ /dev/null
@@ -1,78 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/reduction-ops/vcadd.md` -->
-
-# pto.vcadd
-
-Standalone reference page for `pto.vcadd`. This page belongs to the [Reduction Ops](../../reduction-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Sum all elements. Result in lane 0, others zeroed.
-
-## Mechanism
-
-`pto.vcadd` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vcadd %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i16-i64, f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects participating
-  lanes.
-
-## Expected Outputs
-
-`%result` contains the reduction result in its low element(s).
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Some narrow integer forms may widen the
-  internal accumulation or result placement. If all predicate bits are zero, the
-  result is zero.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i16-i64, f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-T sum = 0;
-for (int i = 0; i < N; i++)
-    sum += src[i];
-dst[0] = sum;
-for (int i = 1; i < N; i++)
-    dst[i] = 0;
-```
-
-## Detailed Notes
-
-```c
-T sum = 0;
-for (int i = 0; i < N; i++)
-    sum += src[i];
-dst[0] = sum;
-for (int i = 1; i < N; i++)
-    dst[i] = 0;
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduction Ops](../../reduction-ops.md)
-- Next op in family: [pto.vcmax](./vcmax.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcadd_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcadd_zh.md
deleted file mode 100644
index acca67d6..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcadd_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vcadd
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vcadd.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgadd.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgadd.md
deleted file mode 100644
index 6d9fbf4f..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgadd.md
+++ /dev/null
@@ -1,87 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/reduction-ops/vcgadd.md` -->
-
-# pto.vcgadd
-
-Standalone reference page for `pto.vcgadd`. This page belongs to the [Reduction Ops](../../reduction-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Sum within each VLane. 8 results at indices 0, 8, 16, 24, 32, 40, 48, 56 (for f32).
-
-## Mechanism
-
-`pto.vcgadd` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vcgadd %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i16-i32, f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects participating
-  lanes.
-
-## Expected Outputs
-
-`%result` contains one sum per 32-byte VLane group, written
-  contiguously into the low slot of each group.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This is a per-32-byte VLane-group reduction.
-  Inactive lanes are treated as zero.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i16-i32, f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-int K = N / 8;  // elements per VLane
-for (int g = 0; g < 8; g++) {
-    T sum = 0;
-    for (int i = 0; i < K; i++)
-        sum += src[g*K + i];
-    dst[g*K] = sum;
-    for (int i = 1; i < K; i++)
-        dst[g*K + i] = 0;
-}
-// For f32: results at dst[0], dst[8], dst[16], dst[24], dst[32], dst[40], dst[48], dst[56]
-```
-
-## Detailed Notes
-
-```c
-int K = N / 8;  // elements per VLane
-for (int g = 0; g < 8; g++) {
-    T sum = 0;
-    for (int i = 0; i < K; i++)
-        sum += src[g*K + i];
-    dst[g*K] = sum;
-    for (int i = 1; i < K; i++)
-        dst[g*K + i] = 0;
-}
-// For f32: results at dst[0], dst[8], dst[16], dst[24], dst[32], dst[40], dst[48], dst[56]
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduction Ops](../../reduction-ops.md)
-- Previous op in family: [pto.vcmin](./vcmin.md)
-- Next op in family: [pto.vcgmax](./vcgmax.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgadd_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgadd_zh.md
deleted file mode 100644
index 9eb817e1..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgadd_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vcgadd
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vcgadd.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmax.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmax.md
deleted file mode 100644
index 4c06fbaa..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmax.md
+++ /dev/null
@@ -1,84 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/reduction-ops/vcgmax.md` -->
-
-# pto.vcgmax
-
-Standalone reference page for `pto.vcgmax`. This page belongs to the [Reduction Ops](../../reduction-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Max within each VLane.
-
-## Mechanism
-
-`pto.vcgmax` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vcgmax %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i16-i32, f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects participating
-  lanes.
-
-## Expected Outputs
-
-`%result` contains one maximum per 32-byte VLane group.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Grouping is by hardware 32-byte VLane, not by
-  arbitrary software subvector.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i16-i32, f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-int K = N / 8;
-for (int g = 0; g < 8; g++) {
-    T mx = -INF;
-    for (int i = 0; i < K; i++)
-        if (src[g*K + i] > mx) mx = src[g*K + i];
-    dst[g*K] = mx;
-    for (int i = 1; i < K; i++)
-        dst[g*K + i] = 0;
-}
-```
-
-## Detailed Notes
-
-```c
-int K = N / 8;
-for (int g = 0; g < 8; g++) {
-    T mx = -INF;
-    for (int i = 0; i < K; i++)
-        if (src[g*K + i] > mx) mx = src[g*K + i];
-    dst[g*K] = mx;
-    for (int i = 1; i < K; i++)
-        dst[g*K + i] = 0;
-}
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduction Ops](../../reduction-ops.md)
-- Previous op in family: [pto.vcgadd](./vcgadd.md)
-- Next op in family: [pto.vcgmin](./vcgmin.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmax_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmax_zh.md
deleted file mode 100644
index 1df14f84..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmax_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vcgmax
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vcgmax.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmin.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmin.md
deleted file mode 100644
index 4cc8e88e..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmin.md
+++ /dev/null
@@ -1,86 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/reduction-ops/vcgmin.md` -->
-
-# pto.vcgmin
-
-Standalone reference page for `pto.vcgmin`. This page belongs to the [Reduction Ops](../../reduction-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Min within each VLane.
-
-## Mechanism
-
-`pto.vcgmin` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vcgmin %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i16-i32, f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects participating
-  lanes.
-
-## Expected Outputs
-
-`%result` contains one minimum per 32-byte VLane group.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Grouping is by hardware 32-byte VLane, not by
-  arbitrary software subvector.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i16-i32, f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-int K = N / 8;
-for (int g = 0; g < 8; g++) {
-    T mn = INF;
-    for (int i = 0; i < K; i++)
-        if (src[g*K + i] < mn) mn = src[g*K + i];
-    dst[g*K] = mn;
-    for (int i = 1; i < K; i++)
-        dst[g*K + i] = 0;
-}
-```
-
-## Detailed Notes
-
-```c
-int K = N / 8;
-for (int g = 0; g < 8; g++) {
-    T mn = INF;
-    for (int i = 0; i < K; i++)
-        if (src[g*K + i] < mn) mn = src[g*K + i];
-    dst[g*K] = mn;
-    for (int i = 1; i < K; i++)
-        dst[g*K + i] = 0;
-}
-```
-
-## Prefix Operations
-
-## Related Ops / Family Links
-
-- Family overview: [Reduction Ops](../../reduction-ops.md)
-- Previous op in family: [pto.vcgmax](./vcgmax.md)
-- Next op in family: [pto.vcpadd](./vcpadd.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmin_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmin_zh.md
deleted file mode 100644
index f857ce03..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcgmin_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vcgmin
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vcgmin.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmax.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmax.md
deleted file mode 100644
index aedbae86..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmax.md
+++ /dev/null
@@ -1,79 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/reduction-ops/vcmax.md` -->
-
-# pto.vcmax
-
-Standalone reference page for `pto.vcmax`. This page belongs to the [Reduction Ops](../../reduction-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Find max element with argmax. Result value + index in lane 0.
-
-## Mechanism
-
-`pto.vcmax` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vcmax %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i16-i32, f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects participating
-  lanes.
-
-## Expected Outputs
-
-`%result` carries the reduction result in the low destination
-  positions.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This family computes both the extremum and
-  location information, but the exact packing of that information into the
-  destination vector depends on the chosen form. If all predicate bits are zero,
-  the result follows the zero-filled convention.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i16-i32, f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-T mx = -INF; int idx = 0;
-for (int i = 0; i < N; i++)
-    if (src[i] > mx) { mx = src[i]; idx = i; }
-dst_val[0] = mx;
-dst_idx[0] = idx;
-```
-
-## Detailed Notes
-
-```c
-T mx = -INF; int idx = 0;
-for (int i = 0; i < N; i++)
-    if (src[i] > mx) { mx = src[i]; idx = i; }
-dst_val[0] = mx;
-dst_idx[0] = idx;
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduction Ops](../../reduction-ops.md)
-- Previous op in family: [pto.vcadd](./vcadd.md)
-- Next op in family: [pto.vcmin](./vcmin.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmax_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmax_zh.md
deleted file mode 100644
index 80f1aedc..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmax_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vcmax
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vcmax.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmin.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmin.md
deleted file mode 100644
index 65231315..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmin.md
+++ /dev/null
@@ -1,93 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/reduction-ops/vcmin.md` -->
-
-# pto.vcmin
-
-Standalone reference page for `pto.vcmin`. This page belongs to the [Reduction Ops](../../reduction-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Find min element with argmin. Result value + index in lane 0.
-
-## Mechanism
-
-`pto.vcmin` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vcmin %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i16-i32, f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects participating
-  lanes.
-
-## Expected Outputs
-
-`%result` carries the reduction result in the low destination
-  positions.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-As with `pto.vcmax`, the exact value/index
-  packing depends on the chosen form and MUST be preserved consistently.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i16-i32, f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-T mn = INF; int idx = 0;
-for (int i = 0; i < N; i++)
-    if (src[i] < mn) { mn = src[i]; idx = i; }
-dst_val[0] = mn;
-dst_idx[0] = idx;
-```
-
-```
-vreg layout (f32 example, 64 elements total):
-VLane 0: [0..7]   VLane 1: [8..15]  VLane 2: [16..23] VLane 3: [24..31]
-VLane 4: [32..39] VLane 5: [40..47] VLane 6: [48..55] VLane 7: [56..63]
-```
-
-## Detailed Notes
-
-```c
-T mn = INF; int idx = 0;
-for (int i = 0; i < N; i++)
-    if (src[i] < mn) { mn = src[i]; idx = i; }
-dst_val[0] = mn;
-dst_idx[0] = idx;
-```
-
-## Per-VLane (Group) Reductions
-
-The vector register is organized as **8 VLanes** of 32 bytes each. Group reductions operate within each VLane independently.
-
-```
-vreg layout (f32 example, 64 elements total):
-VLane 0: [0..7]   VLane 1: [8..15]  VLane 2: [16..23] VLane 3: [24..31]
-VLane 4: [32..39] VLane 5: [40..47] VLane 6: [48..55] VLane 7: [56..63]
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduction Ops](../../reduction-ops.md)
-- Previous op in family: [pto.vcmax](./vcmax.md)
-- Next op in family: [pto.vcgadd](./vcgadd.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmin_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmin_zh.md
deleted file mode 100644
index 59689e9a..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcmin_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vcmin
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vcmin.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcpadd.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcpadd.md
deleted file mode 100644
index 521a6630..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcpadd.md
+++ /dev/null
@@ -1,120 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/reduction-ops/vcpadd.md` -->
-
-# pto.vcpadd
-
-Standalone reference page for `pto.vcpadd`. This page belongs to the [Reduction Ops](../../reduction-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Inclusive prefix sum (scan).
-
-## Mechanism
-
-`pto.vcpadd` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vcpadd %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects participating
-  lanes.
-
-## Expected Outputs
-
-`%result` is the inclusive prefix-sum vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Only floating-point element types are
-  documented on the current A5 surface here.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-dst[0] = src[0];
-for (int i = 1; i < N; i++)
-    dst[i] = dst[i-1] + src[i];
-```
-
-```c
-// input:  [1, 2, 3, 4, 5, ...]
-// output: [1, 3, 6, 10, 15, ...]
-```
-
-```mlir
-// Softmax: find max for numerical stability
-%max_vec = pto.vcmax %logits, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-// max is in lane 0, broadcast it
-%max_broadcast = pto.vlds %ub_tmp[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-
-// Row-wise sum using vcgadd (for 8-row tile)
-%row_sums = pto.vcgadd %tile, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-// Results at indices 0, 8, 16, 24, 32, 40, 48, 56
-
-// Full vector sum for normalization
-%total = pto.vcadd %values, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-// total[0] contains the sum
-
-// Prefix sum for cumulative distribution
-%cdf = pto.vcpadd %pdf, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Detailed Notes
-
-```c
-dst[0] = src[0];
-for (int i = 1; i < N; i++)
-    dst[i] = dst[i-1] + src[i];
-```
-
-**Example:**
-```c
-// input:  [1, 2, 3, 4, 5, ...]
-// output: [1, 3, 6, 10, 15, ...]
-```
-
-## Typical Usage
-
-```mlir
-// Softmax: find max for numerical stability
-%max_vec = pto.vcmax %logits, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-// max is in lane 0, broadcast it
-%max_broadcast = pto.vlds %ub_tmp[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-
-// Row-wise sum using vcgadd (for 8-row tile)
-%row_sums = pto.vcgadd %tile, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-// Results at indices 0, 8, 16, 24, 32, 40, 48, 56
-
-// Full vector sum for normalization
-%total = pto.vcadd %values, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-// total[0] contains the sum
-
-// Prefix sum for cumulative distribution
-%cdf = pto.vcpadd %pdf, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Reduction Ops](../../reduction-ops.md)
-- Previous op in family: [pto.vcgmin](./vcgmin.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcpadd_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcpadd_zh.md
deleted file mode 100644
index 6735b060..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/reduction-ops/vcpadd_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vcpadd
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vcpadd.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md
deleted file mode 100644
index 23fe7e90..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md` -->
-
-# pto.vaddrelu
-
-Standalone reference page for `pto.vaddrelu`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Fused add + ReLU.
-
-## Mechanism
-
-`pto.vaddrelu` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-%result = pto.vaddrelu %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%lhs` and `%rhs` are the two addends.
-
-## Expected Outputs
-
-`%result` is the fused add-then-ReLU result.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Floating-point element types only on the
-  current documented surface.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = max(src0[i] + src1[i], 0);
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = max(src0[i] + src1[i], 0);
-```
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Previous op in family: [pto.vexpdiff](./vexpdiff.md)
-- Next op in family: [pto.vsubrelu](./vsubrelu.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu_zh.md
deleted file mode 100644
index 1bbe0978..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vaddrelu
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vaddrelu.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv.md
deleted file mode 100644
index 3797932d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv.md
+++ /dev/null
@@ -1,78 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv.md` -->
-
-# pto.vaddreluconv
-
-Standalone reference page for `pto.vaddreluconv`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Fused add + ReLU + type conversion (HW fusion).
-
-## Mechanism
-
-`pto.vaddreluconv` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-%result = pto.vaddreluconv %lhs, %rhs : !pto.vreg<NxT0>, !pto.vreg<NxT0> -> !pto.vreg<MxT1>
-```
-
-## Inputs
-
-`%lhs` and `%rhs` are the source vectors.
-
-## Expected Outputs
-
-`%result` is the fused add/ReLU/convert result.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Only backend-supported source/destination
-  type pairs are legal. Rounding, saturation, and packing rules follow the
-  semantics of this fused operation, not an arbitrary sequence of standalone
-  ops.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-// f32→f16 variant:
-for (int i = 0; i < 64; i++)
-    dst_f16[i] = f32_to_f16(max(src0_f32[i] + src1_f32[i], 0));
-
-// f16→i8 variant:
-for (int i = 0; i < 128; i++)
-    dst_i8[i] = f16_to_i8(max(src0_f16[i] + src1_f16[i], 0));
-```
-
-## Detailed Notes
-
-```c
-// f32→f16 variant:
-for (int i = 0; i < 64; i++)
-    dst_f16[i] = f32_to_f16(max(src0_f32[i] + src1_f32[i], 0));
-
-// f16→i8 variant:
-for (int i = 0; i < 128; i++)
-    dst_i8[i] = f16_to_i8(max(src0_f16[i] + src1_f16[i], 0));
-```
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Previous op in family: [pto.vaxpy](./vaxpy.md)
-- Next op in family: [pto.vmulconv](./vmulconv.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv_zh.md
deleted file mode 100644
index 1309e771..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vaddreluconv
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vaddreluconv.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy.md
deleted file mode 100644
index 13617440..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy.md
+++ /dev/null
@@ -1,70 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy.md` -->
-
-# pto.vaxpy
-
-Standalone reference page for `pto.vaxpy`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-AXPY — scalar-vector multiply-add.
-
-## Mechanism
-
-`pto.vaxpy` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-%result = pto.vaxpy %src0, %src1, %alpha : !pto.vreg<NxT>, !pto.vreg<NxT>, T -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%src0` is the scaled vector, `%src1` is the addend vector, and
-  `%alpha` is the scalar multiplier.
-
-## Expected Outputs
-
-`%result` is the fused AXPY result.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Floating-point element types only on the
-  current documented surface.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = alpha * src0[i] + src1[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = alpha * src0[i] + src1[i];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Previous op in family: [pto.vsubrelu](./vsubrelu.md)
-- Next op in family: [pto.vaddreluconv](./vaddreluconv.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy_zh.md
deleted file mode 100644
index f36fc851..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vaxpy
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vaxpy.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff.md
deleted file mode 100644
index 6ef2c186..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff.md
+++ /dev/null
@@ -1,74 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff.md` -->
-
-# pto.vexpdiff
-
-Standalone reference page for `pto.vexpdiff`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Fused exp(x - max) for numerically stable softmax.
-
-## Mechanism
-
-`pto.vexpdiff` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-%result = pto.vexpdiff %input, %max : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%max` is the broadcasted
-  subtraction term.
-
-## Expected Outputs
-
-`%result` is the fused `exp(input - max)` vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Floating-point element types only.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Target-defined numeric exceptional behavior, such as divide-by-zero or out-of-domain inputs, remains subject to the selected backend profile unless this page narrows it further.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = expf(src[i] - max[i]);
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = expf(src[i] - max[i]);
-```
-
-**Use case:** Softmax numerator computation with numerical stability.
-
-## Fused Compute+Convert Ops
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Previous op in family: [pto.vprelu](./vprelu.md)
-- Next op in family: [pto.vaddrelu](./vaddrelu.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff_zh.md
deleted file mode 100644
index 940ae817..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vexpdiff
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vexpdiff.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmrgsort.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmrgsort.md
deleted file mode 100644
index ffcdc9b9..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmrgsort.md
+++ /dev/null
@@ -1,99 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vmrgsort.md` -->
-
-# pto.vmrgsort
-
-Standalone reference page for `pto.vmrgsort`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Merge-sort 4 pre-sorted input vectors.
-
-## Mechanism
-
-`pto.vmrgsort` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-pto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub> x4, i64, i64
-```
-
-## Inputs
-
-`%dest` is the UB destination, `%src0..%src3` are the four
-  pre-sorted UB inputs, `%count` is the number of valid elements, and `%config`
-  is the operation control word.
-
-## Expected Outputs
-
-This op writes UB memory and returns no SSA value.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Inputs MUST already be sorted according to
-  the sort order encoded by `%config`. This page uses the shorter mnemonic
-  `pto.vmrgsort`, while the current implementation summary still refers to
-  `pto.vmrgsort4`.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-// Softmax with fused expdiff
-%max_broadcast = pto.vlds %ub_max[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-%exp_stable = pto.vexpdiff %logits, %max_broadcast : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>
-
-// Leaky ReLU activation
-%activated = pto.vlrelu %linear_out, %alpha_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Fused residual add + ReLU
-%residual = pto.vaddrelu %conv_out, %skip_connection : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>
-
-// Generate indices for argsort
-%indices = pto.vci %c0 {order = "ASC"} : i32 -> !pto.vreg<64xi32>
-```
-
-## Detailed Notes
-
-## Current Implementation Surface Summary
-
-- `pto.vmull %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>, !pto.vreg<NxT>`
-- `pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- `pto.vci %index {order = "ORDER"} : integer -> !pto.vreg<NxT>`
-- `pto.vbitsort %dest, %src, %indices, %repeat_times : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, index`
-- `pto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, i64, i64`
-
-## Typical Usage
-
-```mlir
-// Softmax with fused expdiff
-%max_broadcast = pto.vlds %ub_max[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-%exp_stable = pto.vexpdiff %logits, %max_broadcast : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>
-
-// Leaky ReLU activation
-%activated = pto.vlrelu %linear_out, %alpha_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Fused residual add + ReLU
-%residual = pto.vaddrelu %conv_out, %skip_connection : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>
-
-// Generate indices for argsort
-%indices = pto.vci %c0 {order = "ASC"} : i32 -> !pto.vreg<64xi32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Previous op in family: [pto.vsort32](./vsort32.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmrgsort_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmrgsort_zh.md
deleted file mode 100644
index 4d006cbb..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmrgsort_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vmrgsort
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vmrgsort.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmula.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmula.md
deleted file mode 100644
index a01271c4..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmula.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vmula.md` -->
-
-# pto.vmula
-
-Standalone reference page for `pto.vmula`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Multiply-accumulate.
-
-## Mechanism
-
-`pto.vmula` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-%result = pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%acc` is the accumulator input, `%lhs` and `%rhs` are the
-  multiplicands, and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` is the multiply-accumulate result.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-`pto.vmula` is a fused multiply-accumulate
-  operation and is not always interchangeable with separate `vmul` plus `vadd`.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    if (mask[i])
-        dst[i] = acc[i] + lhs[i] * rhs[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    if (mask[i])
-        dst[i] = acc[i] + lhs[i] * rhs[i];
-```
-
-## Index Generation
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Previous op in family: [pto.vmull](./vmull.md)
-- Next op in family: [pto.vtranspose](./vtranspose.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmula_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmula_zh.md
deleted file mode 100644
index 996a63ef..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmula_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vmula
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vmula.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv.md
deleted file mode 100644
index 1df2d8e6..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv.md
+++ /dev/null
@@ -1,70 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv.md` -->
-
-# pto.vmulconv
-
-Standalone reference page for `pto.vmulconv`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Fused mul + type conversion (HW fusion).
-
-## Mechanism
-
-`pto.vmulconv` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-%result = pto.vmulconv %lhs, %rhs : !pto.vreg<NxT0>, !pto.vreg<NxT0> -> !pto.vreg<MxT1>
-```
-
-## Inputs
-
-`%lhs` and `%rhs` are the source vectors.
-
-## Expected Outputs
-
-`%result` is the fused mul/convert result.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Only backend-supported source/destination
-  type pairs are legal.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-// f16→i8 variant:
-for (int i = 0; i < 128; i++)
-    dst_i8[i] = f16_to_i8(src0_f16[i] * src1_f16[i]);
-```
-
-## Detailed Notes
-
-```c
-// f16→i8 variant:
-for (int i = 0; i < 128; i++)
-    dst_i8[i] = f16_to_i8(src0_f16[i] * src1_f16[i]);
-```
-
-## Extended Arithmetic
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Previous op in family: [pto.vaddreluconv](./vaddreluconv.md)
-- Next op in family: [pto.vmull](./vmull.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv_zh.md
deleted file mode 100644
index 04478e09..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vmulconv
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vmulconv.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmull.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmull.md
deleted file mode 100644
index aa97c7c3..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmull.md
+++ /dev/null
@@ -1,76 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vmull.md` -->
-
-# pto.vmull
-
-Standalone reference page for `pto.vmull`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Widening multiply with high/low results.
-
-## Mechanism
-
-`pto.vmull` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-%low, %high = pto.vmull %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>, !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i32/u32 (native 32×32→64 widening multiply)`.
-
-## Inputs
-
-`%lhs` and `%rhs` are the source vectors and `%mask` selects
-  active lanes.
-
-## Expected Outputs
-
-`%low` and `%high` expose the widened-product low/high parts.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-The current documented A5 form is the native
-  widening 32x32->64 integer multiply family.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i32/u32 (native 32×32→64 widening multiply)`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < 64; i++) {
-    int64_t r = (int64_t)src0_i32[i] * (int64_t)src1_i32[i];
-    dst_lo[i] = (int32_t)(r & 0xFFFFFFFF);
-    dst_hi[i] = (int32_t)(r >> 32);
-}
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < 64; i++) {
-    int64_t r = (int64_t)src0_i32[i] * (int64_t)src1_i32[i];
-    dst_lo[i] = (int32_t)(r & 0xFFFFFFFF);
-    dst_hi[i] = (int32_t)(r >> 32);
-}
-```
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Previous op in family: [pto.vmulconv](./vmulconv.md)
-- Next op in family: [pto.vmula](./vmula.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmull_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmull_zh.md
deleted file mode 100644
index 54e1678c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vmull_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vmull
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vmull.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu.md
deleted file mode 100644
index 6bf87457..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vprelu.md` -->
-
-# pto.vprelu
-
-Standalone reference page for `pto.vprelu`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Parametric ReLU with per-element alpha vector.
-
-## Mechanism
-
-`pto.vprelu` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-%result = pto.vprelu %input, %alpha : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%input` is the activation vector and `%alpha` is the per-element
-  slope vector.
-
-## Expected Outputs
-
-`%result` is the parametric-ReLU vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Floating-point element types only on the
-  current A5 surface.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] >= 0) ? src[i] : alpha[i] * src[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] >= 0) ? src[i] : alpha[i] * src[i];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Next op in family: [pto.vexpdiff](./vexpdiff.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu_zh.md
deleted file mode 100644
index 6ed1b524..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vprelu
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vprelu.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsort32.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsort32.md
deleted file mode 100644
index 6d2a733c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsort32.md
+++ /dev/null
@@ -1,63 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vsort32.md` -->
-
-# pto.vsort32
-
-Standalone reference page for `pto.vsort32`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Sort 32 elements in UB.
-
-## Mechanism
-
-`pto.vsort32` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-pto.vsort32 %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64
-```
-
-## Inputs
-
-`%dest` and `%src` are UB pointers and `%config` is the ISA
-  control/config word.
-
-## Expected Outputs
-
-This op writes UB memory and returns no SSA value.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This is a UB-to-UB accelerator helper, not a
-  pure vector-register op.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-pto.vsort32 %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Previous op in family: [pto.vtranspose](./vtranspose.md)
-- Next op in family: [pto.vmrgsort](./vmrgsort.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsort32_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsort32_zh.md
deleted file mode 100644
index 99236047..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsort32_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsort32
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsort32.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md
deleted file mode 100644
index f957b60e..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md` -->
-
-# pto.vsubrelu
-
-Standalone reference page for `pto.vsubrelu`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Fused sub + ReLU.
-
-## Mechanism
-
-`pto.vsubrelu` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-%result = pto.vsubrelu %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%lhs` is the minuend and `%rhs` is the subtrahend.
-
-## Expected Outputs
-
-`%result` is the fused sub-then-ReLU result.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Floating-point element types only on the
-  current documented surface.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = max(src0[i] - src1[i], 0);
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = max(src0[i] - src1[i], 0);
-```
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Previous op in family: [pto.vaddrelu](./vaddrelu.md)
-- Next op in family: [pto.vaxpy](./vaxpy.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu_zh.md
deleted file mode 100644
index 211d4de6..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsubrelu
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsubrelu.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vtranspose.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vtranspose.md
deleted file mode 100644
index 0016213d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vtranspose.md
+++ /dev/null
@@ -1,66 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/sfu-and-dsa-ops/vtranspose.md` -->
-
-# pto.vtranspose
-
-Standalone reference page for `pto.vtranspose`. This page belongs to the [SFU And DSA Ops](../../sfu-and-dsa-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-UB-to-UB transpose operation (not vreg-to-vreg).
-
-## Mechanism
-
-`pto.vtranspose` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the family can be reasoned about at the ISA level.
-
-## Syntax
-
-```mlir
-pto.vtranspose %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64
-```
-
-## Inputs
-
-`%dest` and `%src` are UB pointers and `%config` is the ISA
-  control/config word.
-
-## Expected Outputs
-
-This op writes UB memory and returns no SSA value.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This is not a `vreg -> vreg` op even though
-  it lives in the `pto.v*` namespace. Its correctness depends on the control
-  word and UB layout contract.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-pto.vtranspose %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64
-```
-
-## Detailed Notes
-
-**Note:** This operates on UB memory directly, not on vector registers.
-
-## Sorting Operations
-
-## Related Ops / Family Links
-
-- Family overview: [SFU And DSA Ops](../../sfu-and-dsa-ops.md)
-- Previous op in family: [pto.vmula](./vmula.md)
-- Next op in family: [pto.vsort32](./vsort32.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vtranspose_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vtranspose_zh.md
deleted file mode 100644
index 84e141a2..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/sfu-and-dsa-ops/vtranspose_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vtranspose
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vtranspose.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vabs.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vabs.md
deleted file mode 100644
index 485c0ba6..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vabs.md
+++ /dev/null
@@ -1,70 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vabs.md` -->
-
-# pto.vabs
-
-Standalone reference page for `pto.vabs`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` receives the lane-wise absolute values.
-
-## Mechanism
-
-`pto.vabs` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vabs %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i8-i32, f16, f32`.
-
-## Inputs
-
-`%input` supplies the source lanes and `%mask` selects which lanes
-  participate.
-
-## Expected Outputs
-
-`%result` receives the lane-wise absolute values.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Source and result types MUST match. Integer
-  overflow on the most-negative signed value follows the target-defined
-  behavior.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i8-i32, f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] < 0) ? -src[i] : src[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] < 0) ? -src[i] : src[i];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Next op in family: [pto.vneg](./vneg.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vabs_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vabs_zh.md
deleted file mode 100644
index 745ee458..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vabs_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vabs
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vabs.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vbcnt.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vbcnt.md
deleted file mode 100644
index 7ac1a538..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vbcnt.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vbcnt.md` -->
-
-# pto.vbcnt
-
-Standalone reference page for `pto.vbcnt`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` holds the population count for each active lane.
-
-## Mechanism
-
-`pto.vbcnt` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vbcnt %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `all integer types`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` holds the population count for each active lane.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only. The count is
-  over the source element width, not over the full vector register.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `all integer types`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = __builtin_popcount(src[i]);
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = __builtin_popcount(src[i]);
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Previous op in family: [pto.vnot](./vnot.md)
-- Next op in family: [pto.vcls](./vcls.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vbcnt_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vbcnt_zh.md
deleted file mode 100644
index 4cf46d54..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vbcnt_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vbcnt
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vbcnt.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vcls.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vcls.md
deleted file mode 100644
index 7b5d840c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vcls.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vcls.md` -->
-
-# pto.vcls
-
-Standalone reference page for `pto.vcls`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` holds the leading-sign-bit count per active lane.
-
-## Mechanism
-
-`pto.vcls` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vcls %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `all integer types`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` holds the leading-sign-bit count per active lane.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only. This operation is
-  sign-aware, so signed interpretation matters.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `all integer types`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = count_leading_sign_bits(src[i]);
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = count_leading_sign_bits(src[i]);
-```
-
-## Movement
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Previous op in family: [pto.vbcnt](./vbcnt.md)
-- Next op in family: [pto.vmov](./vmov.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vcls_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vcls_zh.md
deleted file mode 100644
index c3606202..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vcls_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vcls
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vcls.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vexp.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vexp.md
deleted file mode 100644
index e801d397..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vexp.md
+++ /dev/null
@@ -1,68 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vexp.md` -->
-
-# pto.vexp
-
-Standalone reference page for `pto.vexp`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` holds `exp(input[i])` per active lane.
-
-## Mechanism
-
-`pto.vexp` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vexp %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` holds `exp(input[i])` per active lane.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Only floating-point element types are legal.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = expf(src[i]);
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = expf(src[i]);
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Previous op in family: [pto.vneg](./vneg.md)
-- Next op in family: [pto.vln](./vln.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vexp_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vexp_zh.md
deleted file mode 100644
index 613e989c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vexp_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vexp
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vexp.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vln.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vln.md
deleted file mode 100644
index d46e88b0..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vln.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vln.md` -->
-
-# pto.vln
-
-Standalone reference page for `pto.vln`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` holds the natural logarithm per active lane.
-
-## Mechanism
-
-`pto.vln` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vln %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` holds the natural logarithm per active lane.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Only floating-point element types are legal.
-  For real-number semantics, active inputs SHOULD be strictly positive; non-
-  positive inputs follow the target's exception/NaN rules.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Target-defined numeric exceptional behavior, such as divide-by-zero or out-of-domain inputs, remains subject to the selected backend profile unless this page narrows it further.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = logf(src[i]);
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = logf(src[i]);
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Previous op in family: [pto.vexp](./vexp.md)
-- Next op in family: [pto.vsqrt](./vsqrt.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vln_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vln_zh.md
deleted file mode 100644
index 85cc55a4..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vln_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vln
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vln.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vmov.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vmov.md
deleted file mode 100644
index c1c0ef65..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vmov.md
+++ /dev/null
@@ -1,91 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vmov.md` -->
-
-# pto.vmov
-
-Standalone reference page for `pto.vmov`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Vector register copy.
-
-## Mechanism
-
-`pto.vmov` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vmov %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` is a copy of the source vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Predicated `pto.vmov` behaves like a masked
-  copy, while the unpredicated form behaves like a full-register copy.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i];
-```
-
-```mlir
-// Softmax numerator: exp(x - max)
-%sub = pto.vsub %x, %max_broadcast, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Reciprocal for division
-%sum_rcp = pto.vrec %sum, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// ReLU activation
-%activated = pto.vrelu %linear_out, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i];
-```
-
-## Typical Usage
-
-```mlir
-// Softmax numerator: exp(x - max)
-%sub = pto.vsub %x, %max_broadcast, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Reciprocal for division
-%sum_rcp = pto.vrec %sum, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// ReLU activation
-%activated = pto.vrelu %linear_out, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Previous op in family: [pto.vcls](./vcls.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vmov_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vmov_zh.md
deleted file mode 100644
index 442f892a..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vmov_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vmov
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vmov.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vneg.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vneg.md
deleted file mode 100644
index 214c50a2..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vneg.md
+++ /dev/null
@@ -1,70 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vneg.md` -->
-
-# pto.vneg
-
-Standalone reference page for `pto.vneg`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise arithmetic negation.
-
-## Mechanism
-
-`pto.vneg` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vneg %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `i8-i32, f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` is the lane-wise arithmetic negation.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Source and result types MUST match.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `i8-i32, f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = -src[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = -src[i];
-```
-
-## Transcendental
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Previous op in family: [pto.vabs](./vabs.md)
-- Next op in family: [pto.vexp](./vexp.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vneg_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vneg_zh.md
deleted file mode 100644
index 2928ce5d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vneg_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vneg
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vneg.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vnot.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vnot.md
deleted file mode 100644
index 2b889f07..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vnot.md
+++ /dev/null
@@ -1,68 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vnot.md` -->
-
-# pto.vnot
-
-Standalone reference page for `pto.vnot`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` holds the lane-wise bitwise inversion.
-
-## Mechanism
-
-`pto.vnot` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vnot %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `all integer types`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` holds the lane-wise bitwise inversion.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `all integer types`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = ~src[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = ~src[i];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Previous op in family: [pto.vrelu](./vrelu.md)
-- Next op in family: [pto.vbcnt](./vbcnt.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vnot_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vnot_zh.md
deleted file mode 100644
index 6c2de47d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vnot_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vnot
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vnot.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrec.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrec.md
deleted file mode 100644
index 9ad9f51e..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrec.md
+++ /dev/null
@@ -1,73 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vrec.md` -->
-
-# pto.vrec
-
-Standalone reference page for `pto.vrec`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` holds the reciprocal per active lane.
-
-## Mechanism
-
-`pto.vrec` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vrec %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` holds the reciprocal per active lane.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Only floating-point element types are legal.
-  Active inputs containing `+0` or `-0` follow the target's divide-style
-  exceptional behavior.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Target-defined numeric exceptional behavior, such as divide-by-zero or out-of-domain inputs, remains subject to the selected backend profile unless this page narrows it further.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = 1.0f / src[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = 1.0f / src[i];
-```
-
-## Activation
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Previous op in family: [pto.vrsqrt](./vrsqrt.md)
-- Next op in family: [pto.vrelu](./vrelu.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrec_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrec_zh.md
deleted file mode 100644
index 292fadd2..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrec_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vrec
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vrec.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrelu.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrelu.md
deleted file mode 100644
index f28cc74f..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrelu.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vrelu.md` -->
-
-# pto.vrelu
-
-Standalone reference page for `pto.vrelu`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` holds `max(input[i], 0)` per active lane.
-
-## Mechanism
-
-`pto.vrelu` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vrelu %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` holds `max(input[i], 0)` per active lane.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Only floating-point element types are legal
-  on the current A5 surface described here.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] > 0) ? src[i] : 0;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] > 0) ? src[i] : 0;
-```
-
-## Bitwise
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Previous op in family: [pto.vrec](./vrec.md)
-- Next op in family: [pto.vnot](./vnot.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrelu_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrelu_zh.md
deleted file mode 100644
index d578012a..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrelu_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vrelu
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vrelu.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrsqrt.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrsqrt.md
deleted file mode 100644
index b3cb474d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrsqrt.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vrsqrt.md` -->
-
-# pto.vrsqrt
-
-Standalone reference page for `pto.vrsqrt`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` holds reciprocal-square-root values per active lane.
-
-## Mechanism
-
-`pto.vrsqrt` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vrsqrt %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` holds reciprocal-square-root values per active lane.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Only floating-point element types are legal.
-  Active inputs containing `+0` or `-0` follow the target's divide-style
-  exceptional behavior.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Target-defined numeric exceptional behavior, such as divide-by-zero or out-of-domain inputs, remains subject to the selected backend profile unless this page narrows it further.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = 1.0f / sqrtf(src[i]);
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = 1.0f / sqrtf(src[i]);
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Previous op in family: [pto.vsqrt](./vsqrt.md)
-- Next op in family: [pto.vrec](./vrec.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrsqrt_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrsqrt_zh.md
deleted file mode 100644
index 00bb9b87..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vrsqrt_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vrsqrt
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vrsqrt.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vsqrt.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vsqrt.md
deleted file mode 100644
index 825e2594..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vsqrt.md
+++ /dev/null
@@ -1,70 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/unary-vector-ops/vsqrt.md` -->
-
-# pto.vsqrt
-
-Standalone reference page for `pto.vsqrt`. This page belongs to the [Unary Vector Ops](../../unary-vector-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` holds the square root per active lane.
-
-## Mechanism
-
-`pto.vsqrt` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vsqrt %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>
-```
-
-Documented A5 types or forms: `f16, f32`.
-
-## Inputs
-
-`%input` is the source vector and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` holds the square root per active lane.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Only floating-point element types are legal.
-  Negative active inputs follow the target's exception/NaN rules.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Target-defined numeric exceptional behavior, such as divide-by-zero or out-of-domain inputs, remains subject to the selected backend profile unless this page narrows it further.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- Documented A5 coverage: `f16, f32`.
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = sqrtf(src[i]);
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = sqrtf(src[i]);
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Unary Vector Ops](../../unary-vector-ops.md)
-- Previous op in family: [pto.vln](./vln.md)
-- Next op in family: [pto.vrsqrt](./vrsqrt.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vsqrt_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vsqrt_zh.md
deleted file mode 100644
index 1e7ebf1a..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/unary-vector-ops/vsqrt_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsqrt
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsqrt.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vaddcs.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vaddcs.md
deleted file mode 100644
index f1268848..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vaddcs.md
+++ /dev/null
@@ -1,75 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vaddcs.md` -->
-
-# pto.vaddcs
-
-Standalone reference page for `pto.vaddcs`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Add with carry-in and carry-out.
-
-## Mechanism
-
-`pto.vaddcs` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask, !pto.mask -> !pto.vreg<NxT>, !pto.mask
-```
-
-## Inputs
-
-`%lhs` and `%rhs` are the value vectors, `%carry_in` is the
-  incoming carry predicate, and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` is the arithmetic result and `%carry` is the carry-out
-  predicate.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This is the scalar-extended carry-chain
-  family. Treat it as an unsigned integer operation unless the verifier states a
-  wider legal domain.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++) {
-    uint64_t r = (uint64_t)src0[i] + src1[i] + carry_in[i];
-    dst[i] = (T)r;
-    carry_out[i] = (r >> bitwidth);
-}
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++) {
-    uint64_t r = (uint64_t)src0[i] + src1[i] + carry_in[i];
-    dst[i] = (T)r;
-    carry_out[i] = (r >> bitwidth);
-}
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vlrelu](./vlrelu.md)
-- Next op in family: [pto.vsubcs](./vsubcs.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vaddcs_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vaddcs_zh.md
deleted file mode 100644
index 2cc026c2..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vaddcs_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vaddcs
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vaddcs.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vadds.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vadds.md
deleted file mode 100644
index 58ddec33..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vadds.md
+++ /dev/null
@@ -1,67 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vadds.md` -->
-
-# pto.vadds
-
-Standalone reference page for `pto.vadds`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise sum.
-
-## Mechanism
-
-`pto.vadds` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vadds %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input` is the source vector, `%scalar` is broadcast logically to
-  each active lane, and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` is the lane-wise sum.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Inactive lanes follow the predication
-  behavior defined for this family. On the current surface, inactive lanes are
-  treated as zeroing lanes.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] + scalar;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] + scalar;
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Next op in family: [pto.vsubs](./vsubs.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vadds_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vadds_zh.md
deleted file mode 100644
index 6b8db6ee..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vadds_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vadds
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vadds.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vands.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vands.md
deleted file mode 100644
index 9f792fc0..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vands.md
+++ /dev/null
@@ -1,65 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vands.md` -->
-
-# pto.vands
-
-Standalone reference page for `pto.vands`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise bitwise AND.
-
-## Mechanism
-
-`pto.vands` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vands %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input`, `%scalar`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` is the lane-wise bitwise AND.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] & scalar;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] & scalar;
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vmins](./vmins.md)
-- Next op in family: [pto.vors](./vors.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vands_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vands_zh.md
deleted file mode 100644
index ffe50397..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vands_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vands
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vands.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vlrelu.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vlrelu.md
deleted file mode 100644
index 48dfbc72..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vlrelu.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vlrelu.md` -->
-
-# pto.vlrelu
-
-Standalone reference page for `pto.vlrelu`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise leaky-ReLU result.
-
-## Mechanism
-
-`pto.vlrelu` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vlrelu %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input` is the activation vector, `%scalar` is the leaky slope,
-  and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` is the lane-wise leaky-ReLU result.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Only `f16` and `f32` forms are currently
-  documented for `pto.vlrelu`.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] >= 0) ? src[i] : scalar * src[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] >= 0) ? src[i] : scalar * src[i];
-```
-
-## Carry Operations
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vshrs](./vshrs.md)
-- Next op in family: [pto.vaddcs](./vaddcs.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vlrelu_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vlrelu_zh.md
deleted file mode 100644
index 374a3c07..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vlrelu_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vlrelu
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vlrelu.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmaxs.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmaxs.md
deleted file mode 100644
index ccd6be5c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmaxs.md
+++ /dev/null
@@ -1,65 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vmaxs.md` -->
-
-# pto.vmaxs
-
-Standalone reference page for `pto.vmaxs`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise maximum.
-
-## Mechanism
-
-`pto.vmaxs` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vmaxs %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input`, `%scalar`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` is the lane-wise maximum.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Input and result types MUST match.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] > scalar) ? src[i] : scalar;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] > scalar) ? src[i] : scalar;
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vmuls](./vmuls.md)
-- Next op in family: [pto.vmins](./vmins.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmaxs_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmaxs_zh.md
deleted file mode 100644
index 01680ded..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmaxs_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vmaxs
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vmaxs.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmins.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmins.md
deleted file mode 100644
index 3f855430..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmins.md
+++ /dev/null
@@ -1,67 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vmins.md` -->
-
-# pto.vmins
-
-Standalone reference page for `pto.vmins`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise minimum.
-
-## Mechanism
-
-`pto.vmins` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vmins %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input`, `%scalar`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` is the lane-wise minimum.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Input and result types MUST match.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] < scalar) ? src[i] : scalar;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] < scalar) ? src[i] : scalar;
-```
-
-## Bitwise
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vmaxs](./vmaxs.md)
-- Next op in family: [pto.vands](./vands.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmins_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmins_zh.md
deleted file mode 100644
index 5924392b..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmins_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vmins
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vmins.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmuls.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmuls.md
deleted file mode 100644
index 3de710a4..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmuls.md
+++ /dev/null
@@ -1,66 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vmuls.md` -->
-
-# pto.vmuls
-
-Standalone reference page for `pto.vmuls`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise product.
-
-## Mechanism
-
-`pto.vmuls` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vmuls %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input`, `%scalar`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` is the lane-wise product.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Supported element types are hardware-family
-  specific; the current PTO ISA vector surface documentation covers the common numeric cases.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] * scalar;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] * scalar;
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vsubs](./vsubs.md)
-- Next op in family: [pto.vmaxs](./vmaxs.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmuls_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmuls_zh.md
deleted file mode 100644
index 2442e8c8..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vmuls_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vmuls
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vmuls.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vors.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vors.md
deleted file mode 100644
index 9bfcfd15..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vors.md
+++ /dev/null
@@ -1,65 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vors.md` -->
-
-# pto.vors
-
-Standalone reference page for `pto.vors`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise bitwise OR.
-
-## Mechanism
-
-`pto.vors` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vors %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input`, `%scalar`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` is the lane-wise bitwise OR.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] | scalar;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] | scalar;
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vands](./vands.md)
-- Next op in family: [pto.vxors](./vxors.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vors_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vors_zh.md
deleted file mode 100644
index 98c1d036..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vors_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vors
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vors.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshls.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshls.md
deleted file mode 100644
index 94d99c15..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshls.md
+++ /dev/null
@@ -1,67 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vshls.md` -->
-
-# pto.vshls
-
-Standalone reference page for `pto.vshls`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the shifted vector.
-
-## Mechanism
-
-`pto.vshls` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vshls %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input` is the value vector, `%scalar` is the uniform shift
-  amount, and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` is the shifted vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only. The shift amount
-  SHOULD stay within the source element width.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] << scalar;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] << scalar;
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vxors](./vxors.md)
-- Next op in family: [pto.vshrs](./vshrs.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshls_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshls_zh.md
deleted file mode 100644
index b5078ada..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshls_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vshls
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vshls.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshrs.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshrs.md
deleted file mode 100644
index a66bd0c9..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshrs.md
+++ /dev/null
@@ -1,66 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vshrs.md` -->
-
-# pto.vshrs
-
-Standalone reference page for `pto.vshrs`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the shifted vector.
-
-## Mechanism
-
-`pto.vshrs` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vshrs %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input` is the value vector, `%scalar` is the uniform shift
-  amount, and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` is the shifted vector.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] >> scalar;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] >> scalar;
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vshls](./vshls.md)
-- Next op in family: [pto.vlrelu](./vlrelu.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshrs_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshrs_zh.md
deleted file mode 100644
index 6de342f6..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vshrs_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vshrs
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vshrs.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubcs.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubcs.md
deleted file mode 100644
index 379d0d88..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubcs.md
+++ /dev/null
@@ -1,103 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vsubcs.md` -->
-
-# pto.vsubcs
-
-Standalone reference page for `pto.vsubcs`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-Subtract with borrow-in and borrow-out.
-
-## Mechanism
-
-`pto.vsubcs` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask, !pto.mask -> !pto.vreg<NxT>, !pto.mask
-```
-
-## Inputs
-
-`%lhs` and `%rhs` are the value vectors, `%borrow_in` is the
-  incoming borrow predicate, and `%mask` selects active lanes.
-
-## Expected Outputs
-
-`%result` is the arithmetic result and `%borrow` is the
-  borrow-out predicate.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-This is the scalar-extended borrow-chain
-  family and SHOULD be treated as an unsigned integer operation.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++) {
-    dst[i] = src0[i] - src1[i] - borrow_in[i];
-    borrow_out[i] = (src0[i] < src1[i] + borrow_in[i]);
-}
-```
-
-```mlir
-// Add bias to all elements
-%biased = pto.vadds %activation, %bias_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Scale by constant
-%scaled = pto.vmuls %input, %scale, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Clamp to [0, 255] for uint8 quantization
-%clamped_low = pto.vmaxs %input, %c0, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-%clamped = pto.vmins %clamped_low, %c255, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Shift right by fixed amount
-%shifted = pto.vshrs %data, %c4, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32>
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++) {
-    dst[i] = src0[i] - src1[i] - borrow_in[i];
-    borrow_out[i] = (src0[i] < src1[i] + borrow_in[i]);
-}
-```
-
-## Typical Usage
-
-```mlir
-// Add bias to all elements
-%biased = pto.vadds %activation, %bias_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Scale by constant
-%scaled = pto.vmuls %input, %scale, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Clamp to [0, 255] for uint8 quantization
-%clamped_low = pto.vmaxs %input, %c0, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-%clamped = pto.vmins %clamped_low, %c255, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Shift right by fixed amount
-%shifted = pto.vshrs %data, %c4, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vaddcs](./vaddcs.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubcs_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubcs_zh.md
deleted file mode 100644
index abf5e99c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubcs_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsubcs
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsubcs.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubs.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubs.md
deleted file mode 100644
index c28fba50..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubs.md
+++ /dev/null
@@ -1,66 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vsubs.md` -->
-
-# pto.vsubs
-
-Standalone reference page for `pto.vsubs`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise difference.
-
-## Mechanism
-
-`pto.vsubs` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vsubs %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input`, `%scalar`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` is the lane-wise difference.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer or floating-point legality depends on
-  the selected type family in lowering.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] - scalar;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] - scalar;
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vadds](./vadds.md)
-- Next op in family: [pto.vmuls](./vmuls.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubs_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubs_zh.md
deleted file mode 100644
index 2ef3bb24..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vsubs_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsubs
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsubs.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vxors.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vxors.md
deleted file mode 100644
index 51b043b9..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vxors.md
+++ /dev/null
@@ -1,67 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vec-scalar-ops/vxors.md` -->
-
-# pto.vxors
-
-Standalone reference page for `pto.vxors`. This page belongs to the [Vec Scalar Ops](../../vec-scalar-ops.md) family in the PTO ISA manual.
-
-## Summary
-
-`%result` is the lane-wise bitwise XOR.
-
-## Mechanism
-
-`pto.vxors` is a `pto.v*` compute operation. It applies its semantics to active lanes, obeys the family operand model, and returns its results in vector-register or mask form.
-
-## Syntax
-
-```mlir
-%result = pto.vxors %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%input`, `%scalar`, and `%mask` as above.
-
-## Expected Outputs
-
-`%result` is the lane-wise bitwise XOR.
-
-## Side Effects
-
-This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences unless the form says so.
-
-## Constraints
-
-Integer element types only.
-
-## Exceptions
-
-- The verifier rejects illegal operand shapes, unsupported element types, and attribute combinations that are not valid for the selected family or target profile.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] ^ scalar;
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] ^ scalar;
-```
-
-## Shift
-
-## Related Ops / Family Links
-
-- Family overview: [Vec Scalar Ops](../../vec-scalar-ops.md)
-- Previous op in family: [pto.vors](./vors.md)
-- Next op in family: [pto.vshls](./vshls.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vxors_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vxors_zh.md
deleted file mode 100644
index 0796ebe5..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vec-scalar-ops/vxors_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vxors
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vxors.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2-bc.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2-bc.md
deleted file mode 100644
index d12d8aac..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2-bc.md
+++ /dev/null
@@ -1,65 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vgather2-bc.md` -->
-
-# pto.vgather2_bc
-
-Standalone reference page for `pto.vgather2_bc`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Gather with broadcast, conditioned by mask.
-
-## Mechanism
-
-`pto.vgather2_bc` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr<T, ub>, !pto.vreg<NxI>, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%source` is the UB base pointer, `%offsets` contains gather indices, and
-  `%mask` gates which lanes participate.
-
-## Expected Outputs
-
-`%result` is the gathered vector.
-
-## Side Effects
-
-This operation reads UB-visible storage and returns SSA results. It does not by itself allocate buffers, signal events, or establish a fence.
-
-## Constraints
-
-This is a backward-compatible family. Masked-off lanes do not participate in
-  address coalescing and do not trigger address overflow exceptions; their
-  destination lanes are zero-filled.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr<T, ub>, !pto.vreg<NxI>, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Detailed Notes
-
-## Contiguous Stores
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vgatherb](./vgatherb.md)
-- Next op in family: [pto.vsts](./vsts.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2-bc_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2-bc_zh.md
deleted file mode 100644
index af9d2ed4..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2-bc_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vgather2_bc
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vgather2-bc.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2.md
deleted file mode 100644
index 9f201b6b..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vgather2.md` -->
-
-# pto.vgather2
-
-Standalone reference page for `pto.vgather2`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Indexed gather from UB.
-
-## Mechanism
-
-`pto.vgather2` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%result = pto.vgather2 %source, %offsets, %active_lanes : !pto.ptr<T, ub>, !pto.vreg<NxI>, index -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%source` is the UB base pointer, `%offsets` provides per-lane element
-  offsets, and `%active_lanes` bounds how many lanes participate.
-
-## Expected Outputs
-
-`%result` is the gathered vector.
-
-## Side Effects
-
-This operation reads UB-visible storage and returns SSA results. It does not by itself allocate buffers, signal events, or establish a fence.
-
-## Constraints
-
-Only the first `%active_lanes` indices participate. The index element width
-  and interpretation MUST match the selected gather form, and each effective
-  address must satisfy that form's alignment rules.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < active_lanes; i++)
-    dst[i] = UB[base + offsets[i] * sizeof(T)];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < active_lanes; i++)
-    dst[i] = UB[base + offsets[i] * sizeof(T)];
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vsldb](./vsldb.md)
-- Next op in family: [pto.vgatherb](./vgatherb.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2_zh.md
deleted file mode 100644
index 4e9209ea..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgather2_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vgather2
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vgather2.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgatherb.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgatherb.md
deleted file mode 100644
index 51841ee2..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgatherb.md
+++ /dev/null
@@ -1,69 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vgatherb.md` -->
-
-# pto.vgatherb
-
-Standalone reference page for `pto.vgatherb`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Byte-granularity indexed gather from UB.
-
-## Mechanism
-
-`pto.vgatherb` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%result = pto.vgatherb %source, %offsets, %active_lanes : !pto.ptr<T, ub>, !pto.vreg<NxI>, index -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%source` is the UB base pointer, `%offsets` contains per-block byte offsets,
-  and `%active_lanes` bounds the number of active gathered blocks.
-
-## Expected Outputs
-
-`%result` is the gathered vector.
-
-## Side Effects
-
-This operation reads UB-visible storage and returns SSA results. It does not by itself allocate buffers, signal events, or establish a fence.
-
-## Constraints
-
-This is a block gather, not a byte-per-lane gather. `%source` MUST be 32-byte
-  aligned, each participating offset MUST describe a 32-byte-aligned block, and
-  inactive blocks are zero-filled.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < active_lanes; i++)
-    dst[i] = UB[base + offsets[i]];  // byte-addressed
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < active_lanes; i++)
-    dst[i] = UB[base + offsets[i]];  // byte-addressed
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vgather2](./vgather2.md)
-- Next op in family: [pto.vgather2_bc](./vgather2-bc.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgatherb_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgatherb_zh.md
deleted file mode 100644
index 56e3201e..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vgatherb_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vgatherb
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vgatherb.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldas.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldas.md
deleted file mode 100644
index 587b37e7..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldas.md
+++ /dev/null
@@ -1,65 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vldas.md` -->
-
-# pto.vldas
-
-Standalone reference page for `pto.vldas`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Prime alignment buffer for subsequent unaligned load.
-
-## Mechanism
-
-`pto.vldas` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%result = pto.vldas %source : !pto.ptr<T, ub> -> !pto.align
-```
-
-## Inputs
-
-`%source` is the UB address whose surrounding aligned block seeds the load
-  alignment state.
-
-## Expected Outputs
-
-`%result` is the initialized load-alignment state.
-
-## Side Effects
-
-This operation reads UB-visible storage and returns SSA results. It does not by itself allocate buffers, signal events, or establish a fence.
-
-## Constraints
-
-This op is the required leading operation for a `pto.vldus` stream using the
-  same alignment state. The source address itself need not be 32-byte aligned;
-  hardware truncates it to the aligned block boundary for the priming load.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-%result = pto.vldas %source : !pto.ptr<T, ub> -> !pto.align
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vlds](./vlds.md)
-- Next op in family: [pto.vldus](./vldus.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldas_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldas_zh.md
deleted file mode 100644
index 25737ffa..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldas_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vldas
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vldas.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vlds.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vlds.md
deleted file mode 100644
index 68ba313b..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vlds.md
+++ /dev/null
@@ -1,92 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vlds.md` -->
-
-# pto.vlds
-
-Standalone reference page for `pto.vlds`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Vector load with distribution mode.
-
-## Mechanism
-
-`pto.vlds` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%result = pto.vlds %source[%offset] {dist = "DIST"} : !pto.ptr<T, ub> -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%source` is the UB base address, `%offset` is the load displacement, and
-  `DIST` selects the distribution mode.
-
-## Expected Outputs
-
-`%result` is the loaded vector register value.
-
-## Side Effects
-
-This operation reads UB-visible storage and returns SSA results. It does not by itself allocate buffers, signal events, or establish a fence.
-
-## Constraints
-
-The effective address MUST satisfy the alignment rule of the selected
-  distribution mode. `NORM` reads one full vector footprint. Broadcast,
-  upsample, downsample, unpack, split-channel, and deinterleave modes change
-  how memory bytes are mapped into destination lanes, but they do not change the
-  fact that the source is UB memory.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-%v = pto.vlds %ub[%offset] {dist = "NORM"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-```
-
-```mlir
-%v = pto.vlds %ub[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-```
-
-## Detailed Notes
-
-**Distribution modes:**
-
-| Mode | Description | C Semantics |
-|------|-------------|-------------|
-| `NORM` | Contiguous 256B load | `dst[i] = UB[base + i * sizeof(T)]` |
-| `BRC_B8/B16/B32` | Broadcast single element | `dst[i] = UB[base]` for all i |
-| `US_B8/B16` | Upsample (duplicate each element) | `dst[2*i] = dst[2*i+1] = UB[base + i]` |
-| `DS_B8/B16` | Downsample (every 2nd element) | `dst[i] = UB[base + 2*i]` |
-| `UNPK_B8/B16/B32` | Unpack (zero-extend to wider type) | `dst_i32[i] = (uint32_t)UB_i16[base + 2*i]` |
-| `SPLT4CHN_B8` | Split 4-channel (RGBA → R plane) | Extract every 4th byte |
-| `SPLT2CHN_B8/B16` | Split 2-channel | Extract every 2nd element |
-| `DINTLV_B32` | Deinterleave 32-bit | Even elements only |
-| `BLK` | Block load | Blocked access pattern |
-
-**Example — Contiguous load:**
-```mlir
-%v = pto.vlds %ub[%offset] {dist = "NORM"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-```
-
-**Example — Broadcast scalar to all lanes:**
-```mlir
-%v = pto.vlds %ub[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-```
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Next op in family: [pto.vldas](./vldas.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vlds_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vlds_zh.md
deleted file mode 100644
index a02337d4..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vlds_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vlds
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vlds.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldus.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldus.md
deleted file mode 100644
index 6fbc0e64..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldus.md
+++ /dev/null
@@ -1,74 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vldus.md` -->
-
-# pto.vldus
-
-Standalone reference page for `pto.vldus`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Unaligned load using primed align state.
-
-## Mechanism
-
-`pto.vldus` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%result, %align_out, %base_out = pto.vldus %source, %align : !pto.ptr<T, ub>, !pto.align -> !pto.vreg<NxT>, !pto.align, !pto.ptr<T, ub>
-```
-
-## Inputs
-
-`%source` is the current UB address and `%align` is the incoming load
-  alignment state primed by `pto.vldas` or a prior `pto.vldus`.
-
-## Expected Outputs
-
-`%result` is the assembled vector value, `%align_out` is the updated alignment
-  state, and `%base_out` is the post-update base pointer state exposed in SSA
-  form.
-
-## Side Effects
-
-This operation reads UB-visible storage and returns SSA results. It does not by itself allocate buffers, signal events, or establish a fence.
-
-## Constraints
-
-A matching `pto.vldas` MUST appear before the first dependent `pto.vldus`
-  stream in the same vector loop. Both the alignment state and the base address
-  advance across the stream, and the PTO ISA vector surface representation exposes those updates as SSA results.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-%align = pto.vldas %ub : !pto.ptr<f32, ub> -> !pto.align
-%vec, %align2, %ub2 = pto.vldus %ub, %align : !pto.ptr<f32, ub>, !pto.align -> !pto.vreg<64xf32>, !pto.align, !pto.ptr<f32, ub>
-```
-
-## Detailed Notes
-
-**Unaligned load pattern:**
-```mlir
-%align = pto.vldas %ub : !pto.ptr<f32, ub> -> !pto.align
-%vec, %align2, %ub2 = pto.vldus %ub, %align : !pto.ptr<f32, ub>, !pto.align -> !pto.vreg<64xf32>, !pto.align, !pto.ptr<f32, ub>
-```
-
-## Dual Loads (Deinterleave)
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vldas](./vldas.md)
-- Next op in family: [pto.vldx2](./vldx2.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldus_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldus_zh.md
deleted file mode 100644
index 78f9345f..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldus_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vldus
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vldus.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldx2.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldx2.md
deleted file mode 100644
index 1a299d02..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldx2.md
+++ /dev/null
@@ -1,87 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vldx2.md` -->
-
-# pto.vldx2
-
-Standalone reference page for `pto.vldx2`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Dual load with deinterleave (AoS → SoA conversion).
-
-## Mechanism
-
-`pto.vldx2` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%low, %high = pto.vldx2 %source[%offset], "DIST" : !pto.ptr<T, ub>, index -> !pto.vreg<NxT>, !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%source` is the UB base pointer, `%offset` is the displacement, and `DIST`
-  selects a dual-load/deinterleave layout.
-
-## Expected Outputs
-
-`%low` and `%high` are the two destination vectors.
-
-## Side Effects
-
-This operation reads UB-visible storage and returns SSA results. It does not by itself allocate buffers, signal events, or establish a fence.
-
-## Constraints
-
-This family is only legal for interleave/deinterleave style distributions.
-  The two outputs form an ordered pair, and that pairing MUST be preserved.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-// DINTLV_B32: deinterleave 32-bit elements
-for (int i = 0; i < 64; i++) {
-    low[i]  = UB[base + 8*i];       // even elements
-    high[i] = UB[base + 8*i + 4];   // odd elements
-}
-```
-
-```mlir
-%x, %y = pto.vldx2 %ub[%offset], "DINTLV_B32" : !pto.ptr<f32, ub>, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32>
-```
-
-## Detailed Notes
-
-**Distribution modes:** `DINTLV_B8`, `DINTLV_B16`, `DINTLV_B32`, `BDINTLV`
-
-```c
-// DINTLV_B32: deinterleave 32-bit elements
-for (int i = 0; i < 64; i++) {
-    low[i]  = UB[base + 8*i];       // even elements
-    high[i] = UB[base + 8*i + 4];   // odd elements
-}
-```
-
-**Example — Load interleaved XY pairs into separate X/Y vectors:**
-```mlir
-%x, %y = pto.vldx2 %ub[%offset], "DINTLV_B32" : !pto.ptr<f32, ub>, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32>
-```
-
-## Strided Loads
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vldus](./vldus.md)
-- Next op in family: [pto.vsld](./vsld.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldx2_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldx2_zh.md
deleted file mode 100644
index 36390dbd..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vldx2_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vldx2
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vldx2.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vscatter.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vscatter.md
deleted file mode 100644
index 81dfab39..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vscatter.md
+++ /dev/null
@@ -1,73 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vscatter.md` -->
-
-# pto.vscatter
-
-Standalone reference page for `pto.vscatter`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Indexed scatter to UB.
-
-## Mechanism
-
-`pto.vscatter` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-pto.vscatter %value, %dest, %offsets, %active_lanes : !pto.vreg<NxT>, !pto.ptr<T, ub>, !pto.vreg<NxI>, index
-```
-
-## Inputs
-
-`%value` is the source vector, `%dest` is the UB base pointer, `%offsets`
-  provides per-lane or per-block indices, and `%active_lanes` bounds the active
-  requests.
-
-## Expected Outputs
-
-This op writes UB memory and returns no SSA value.
-
-## Side Effects
-
-This operation writes UB-visible memory and/or updates streamed alignment state. Stateful unaligned forms expose their evolving state in SSA form, but a trailing flush form may still be required to complete the stream.
-
-## Constraints
-
-Only `b8`, `b16`, and `b32` element sizes are supported. The index vector
-  must use a supported integer element type and layout for this family.
-  Each computed address MUST be element-aligned. If two or more indices alias,
-  only one write is guaranteed and the winning lane is implementation-defined.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-for (int i = 0; i < active_lanes; i++)
-    UB[base + offsets[i] * sizeof(T)] = src[i];
-```
-
-## Detailed Notes
-
-```c
-for (int i = 0; i < active_lanes; i++)
-    UB[base + offsets[i] * sizeof(T)] = src[i];
-```
-
-## Alignment State Stores
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vsstb](./vsstb.md)
-- Next op in family: [pto.vsta](./vsta.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vscatter_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vscatter_zh.md
deleted file mode 100644
index 1b0f15ee..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vscatter_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vscatter
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vscatter.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsld.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsld.md
deleted file mode 100644
index 02f7f26f..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsld.md
+++ /dev/null
@@ -1,64 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vsld.md` -->
-
-# pto.vsld
-
-Standalone reference page for `pto.vsld`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Strided load with fixed stride pattern.
-
-## Mechanism
-
-`pto.vsld` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%result = pto.vsld %source[%offset], "STRIDE" : !pto.ptr<T, ub> -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%source` is the UB base pointer and `%offset` is the displacement encoded
-  with the selected fixed stride mode.
-
-## Expected Outputs
-
-`%result` is the loaded vector.
-
-## Side Effects
-
-This operation reads UB-visible storage and returns SSA results. It does not by itself allocate buffers, signal events, or establish a fence.
-
-## Constraints
-
-This is a deprecated compatibility family. The selected stride token
-  determines which sub-elements are read from each source block.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-%result = pto.vsld %source[%offset], "STRIDE" : !pto.ptr<T, ub> -> !pto.vreg<NxT>
-```
-
-## Detailed Notes
-
-**Stride modes:** `STRIDE_S3_B16`, `STRIDE_S4_B64`, `STRIDE_S8_B32`, `STRIDE_S2_B64`
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vldx2](./vldx2.md)
-- Next op in family: [pto.vsldb](./vsldb.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsld_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsld_zh.md
deleted file mode 100644
index 874be287..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsld_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsld
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsld.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsldb.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsldb.md
deleted file mode 100644
index 3cca958c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsldb.md
+++ /dev/null
@@ -1,65 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vsldb.md` -->
-
-# pto.vsldb
-
-Standalone reference page for `pto.vsldb`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Block-strided load for 2D tile access.
-
-## Mechanism
-
-`pto.vsldb` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%result = pto.vsldb %source, %offset, %mask : !pto.ptr<T, ub>, i32, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Inputs
-
-`%source` is the UB base pointer, `%offset` is the packed stride/control word,
-  and `%mask` controls which blocks participate.
-
-## Expected Outputs
-
-`%result` is the loaded vector.
-
-## Side Effects
-
-This operation reads UB-visible storage and returns SSA results. It does not by itself allocate buffers, signal events, or establish a fence.
-
-## Constraints
-
-`%offset` is not a plain byte displacement; it encodes the block stride and
-  repeat pattern. If a block is masked off, the corresponding destination block
-  is zeroed and MUST NOT raise an address overflow exception for that block.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-%result = pto.vsldb %source, %offset, %mask : !pto.ptr<T, ub>, i32, !pto.mask -> !pto.vreg<NxT>
-```
-
-## Detailed Notes
-
-## Gather (Indexed) Loads
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vsld](./vsld.md)
-- Next op in family: [pto.vgather2](./vgather2.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsldb_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsldb_zh.md
deleted file mode 100644
index 387002d3..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsldb_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsldb
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsldb.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsst.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsst.md
deleted file mode 100644
index 3387b2df..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsst.md
+++ /dev/null
@@ -1,64 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vsst.md` -->
-
-# pto.vsst
-
-Standalone reference page for `pto.vsst`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Strided store with fixed stride pattern.
-
-## Mechanism
-
-`pto.vsst` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-pto.vsst %value, %dest[%offset], "STRIDE" : !pto.vreg<NxT>, !pto.ptr<T, ub>
-```
-
-## Inputs
-
-`%value` is the source vector, `%dest` is the UB base pointer, and `%offset`
-  / `STRIDE` select the fixed strided layout.
-
-## Expected Outputs
-
-This op writes UB memory and returns no SSA value.
-
-## Side Effects
-
-This operation writes UB-visible memory and/or updates streamed alignment state. Stateful unaligned forms expose their evolving state in SSA form, but a trailing flush form may still be required to complete the stream.
-
-## Constraints
-
-This is a deprecated compatibility family. The stride token, not the vector
-  lane number alone, determines which destination elements are written.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-pto.vsst %value, %dest[%offset], "STRIDE" : !pto.vreg<NxT>, !pto.ptr<T, ub>
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vstx2](./vstx2.md)
-- Next op in family: [pto.vsstb](./vsstb.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsst_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsst_zh.md
deleted file mode 100644
index 3ead8037..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsst_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsst
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsst.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsstb.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsstb.md
deleted file mode 100644
index c3061025..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsstb.md
+++ /dev/null
@@ -1,64 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vsstb.md` -->
-
-# pto.vsstb
-
-Standalone reference page for `pto.vsstb`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Block-strided store for 2D tile access.
-
-## Mechanism
-
-`pto.vsstb` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-pto.vsstb %value, %dest, %offset, %mask : !pto.vreg<NxT>, !pto.ptr<T, ub>, i32, !pto.mask
-```
-
-## Inputs
-
-`%value` is the source vector, `%dest` is the UB base pointer, `%offset` is
-  the packed stride/control word, and `%mask` controls block participation.
-
-## Expected Outputs
-
-This op writes UB memory and returns no SSA value.
-
-## Side Effects
-
-This operation writes UB-visible memory and/or updates streamed alignment state. Stateful unaligned forms expose their evolving state in SSA form, but a trailing flush form may still be required to complete the stream.
-
-## Constraints
-
-`%offset` is a control word, not a plain byte displacement. This is a
-  deprecated compatibility family kept for surface coverage.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-pto.vsstb %value, %dest, %offset, %mask : !pto.vreg<NxT>, !pto.ptr<T, ub>, i32, !pto.mask
-```
-
-## Detailed Notes
-
-## Scatter (Indexed) Stores
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vsst](./vsst.md)
-- Next op in family: [pto.vscatter](./vscatter.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsstb_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsstb_zh.md
deleted file mode 100644
index 191e0e00..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsstb_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsstb
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsstb.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsta.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsta.md
deleted file mode 100644
index ddcdbd44..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsta.md
+++ /dev/null
@@ -1,65 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vsta.md` -->
-
-# pto.vsta
-
-Standalone reference page for `pto.vsta`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Flush alignment state to memory.
-
-## Mechanism
-
-`pto.vsta` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-pto.vsta %value, %dest[%offset] : !pto.align, !pto.ptr<T, ub>, index
-```
-
-## Inputs
-
-`%value` is the pending store-alignment state, `%dest` is the UB base pointer,
-  and `%offset` is the flush displacement.
-
-## Expected Outputs
-
-This op writes buffered tail bytes to UB and returns no SSA value.
-
-## Side Effects
-
-This operation writes UB-visible memory and/or updates streamed alignment state. Stateful unaligned forms expose their evolving state in SSA form, but a trailing flush form may still be required to complete the stream.
-
-## Constraints
-
-The flush address MUST match the post-updated address expected by the
-  preceding unaligned-store stream. After the flush, the corresponding store
-  alignment state is consumed.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-pto.vsta %value, %dest[%offset] : !pto.align, !pto.ptr<T, ub>, index
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vscatter](./vscatter.md)
-- Next op in family: [pto.vstas](./vstas.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsta_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsta_zh.md
deleted file mode 100644
index 306ce600..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsta_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsta
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsta.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstar.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstar.md
deleted file mode 100644
index 4b87e3cb..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstar.md
+++ /dev/null
@@ -1,68 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vstar.md` -->
-
-# pto.vstar
-
-Standalone reference page for `pto.vstar`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Flush remaining alignment state.
-
-## Mechanism
-
-`pto.vstar` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-pto.vstar %value, %dest : !pto.align, !pto.ptr<T, ub>
-```
-
-## Inputs
-
-`%value` is the pending alignment/buffer state that still needs to be emitted,
-  and `%dest` is the UB destination base pointer.
-
-## Expected Outputs
-
-No SSA result. The effect is a memory-side flush that writes the remaining
-  buffered bytes to memory.
-
-## Side Effects
-
-This operation writes UB-visible memory and/or updates streamed alignment state. Stateful unaligned forms expose their evolving state in SSA form, but a trailing flush form may still be required to complete the stream.
-
-## Constraints
-
-This op terminates an unaligned-store sequence. It MUST be paired with a
-  compatible prior state-producing store sequence so that the pending tail state
-  is well-defined.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-pto.vstar %value, %dest : !pto.align, !pto.ptr<T, ub>
-```
-
-## Detailed Notes
-
-## Stateful Store Ops
-
-These ops make reference-updated state explicit as SSA results.
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vstas](./vstas.md)
-- Next op in family: [pto.vstu](./vstu.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstar_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstar_zh.md
deleted file mode 100644
index e2fa56ab..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstar_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vstar
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vstar.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstas.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstas.md
deleted file mode 100644
index 800240c8..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstas.md
+++ /dev/null
@@ -1,64 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vstas.md` -->
-
-# pto.vstas
-
-Standalone reference page for `pto.vstas`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Scalar-register-offset form of alignment-state flush.
-
-## Mechanism
-
-`pto.vstas` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-pto.vstas %value, %dest, %offset : !pto.align, !pto.ptr<T, ub>, i32
-```
-
-## Inputs
-
-`%value` is the pending store-alignment state, `%dest` is the UB base
-  pointer, and `%offset` is the scalar-register style displacement.
-
-## Expected Outputs
-
-This op writes buffered tail bytes to UB and returns no SSA value.
-
-## Side Effects
-
-This operation writes UB-visible memory and/or updates streamed alignment state. Stateful unaligned forms expose their evolving state in SSA form, but a trailing flush form may still be required to complete the stream.
-
-## Constraints
-
-This family uses the same buffered-tail semantics as `pto.vsta` but keeps the
-  scalar-offset form explicit.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-pto.vstas %value, %dest, %offset : !pto.align, !pto.ptr<T, ub>, i32
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vsta](./vsta.md)
-- Next op in family: [pto.vstar](./vstar.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstas_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstas_zh.md
deleted file mode 100644
index 9a88f791..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstas_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vstas
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vstas.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsts.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsts.md
deleted file mode 100644
index 64a0c8cb..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsts.md
+++ /dev/null
@@ -1,81 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vsts.md` -->
-
-# pto.vsts
-
-Standalone reference page for `pto.vsts`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Vector store with distribution mode.
-
-## Mechanism
-
-`pto.vsts` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg<NxT>, !pto.ptr<T, ub>, !pto.mask
-```
-
-## Inputs
-
-`%value` is the source vector, `%dest` is the UB base pointer, `%offset` is
-  the displacement, `%mask` selects the active lanes or sub-elements, and
-  `DIST` selects the store distribution.
-
-## Expected Outputs
-
-This op has no SSA result; it writes to UB memory.
-
-## Side Effects
-
-This operation writes UB-visible memory and/or updates streamed alignment state. Stateful unaligned forms expose their evolving state in SSA form, but a trailing flush form may still be required to complete the stream.
-
-## Constraints
-
-The effective destination address MUST satisfy the alignment rule of the
-  selected store mode. Narrowing/packing modes may only preserve a subset of the
-  source bits. Merge-channel modes reinterpret the source vector as channel
-  planes and interleave them on store.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-```
-
-## Detailed Notes
-
-**Distribution modes:**
-
-| Mode | Description | C Semantics |
-|------|-------------|-------------|
-| `NORM_B8/B16/B32` | Contiguous store | `UB[base + i] = src[i]` |
-| `PK_B16/B32` | Pack/narrowing store | `UB_i16[base + 2*i] = truncate_16(src_i32[i])` |
-| `MRG4CHN_B8` | Merge 4 channels (R,G,B,A → RGBA) | Interleave 4 planes |
-| `MRG2CHN_B8/B16` | Merge 2 channels | Interleave 2 planes |
-
-**Example — Contiguous store:**
-```mlir
-pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-```
-
-## Dual Stores (Interleave)
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vgather2_bc](./vgather2-bc.md)
-- Next op in family: [pto.vstx2](./vstx2.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsts_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsts_zh.md
deleted file mode 100644
index 0c64a914..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vsts_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vsts
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vsts.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstu.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstu.md
deleted file mode 100644
index 8d2c9114..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstu.md
+++ /dev/null
@@ -1,67 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vstu.md` -->
-
-# pto.vstu
-
-Standalone reference page for `pto.vstu`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Unaligned store with align + offset state update.
-
-## Mechanism
-
-`pto.vstu` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%align_out, %offset_out = pto.vstu %align_in, %offset_in, %value, %base, "MODE" : !pto.align, index, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align, index
-```
-
-## Inputs
-
-`%align_in` is the incoming store-alignment state, `%offset_in` is the current
-  logical byte/element displacement, `%value` is the vector being stored, and
-  `%base` is the UB base pointer.
-
-## Expected Outputs
-
-`%align_out` is the updated alignment/tail state and `%offset_out` is the
-  next offset after applying the selected post-update rule.
-
-## Side Effects
-
-This operation writes UB-visible memory and/or updates streamed alignment state. Stateful unaligned forms expose their evolving state in SSA form, but a trailing flush form may still be required to complete the stream.
-
-## Constraints
-
-The alignment state MUST be threaded in program order. A terminating flush
-  form such as `pto.vstar`/`pto.vstas` is still required to commit the buffered
-  tail bytes.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-%align_out, %offset_out = pto.vstu %align_in, %offset_in, %value, %base, "MODE" : !pto.align, index, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align, index
-```
-
-## Detailed Notes
-
-**Mode tokens:** `POST_UPDATE`, `NO_POST_UPDATE`
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vstar](./vstar.md)
-- Next op in family: [pto.vstus](./vstus.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstu_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstu_zh.md
deleted file mode 100644
index ce6f818d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstu_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vstu
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vstu.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstur.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstur.md
deleted file mode 100644
index bfac74a9..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstur.md
+++ /dev/null
@@ -1,64 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vstur.md` -->
-
-# pto.vstur
-
-Standalone reference page for `pto.vstur`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Unaligned store with residual flush and state update.
-
-## Mechanism
-
-`pto.vstur` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%align_out = pto.vstur %align_in, %value, %base, "MODE" : !pto.align, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align
-```
-
-## Inputs
-
-`%align_in` is the incoming store-alignment state, `%value` is the vector to
-  store, and `%base` is the UB base pointer.
-
-## Expected Outputs
-
-`%align_out` is the updated residual state after the current partial store.
-
-## Side Effects
-
-This operation writes UB-visible memory and/or updates streamed alignment state. Stateful unaligned forms expose their evolving state in SSA form, but a trailing flush form may still be required to complete the stream.
-
-## Constraints
-
-This form exposes only the evolving state; it does not by itself guarantee
-  that all buffered bytes have reached memory. A compatible final flush is still
-  required unless the surrounding sequence is known to be complete.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-%align_out = pto.vstur %align_in, %value, %base, "MODE" : !pto.align, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vstus](./vstus.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstur_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstur_zh.md
deleted file mode 100644
index 2d0ede02..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstur_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vstur
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vstur.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstus.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstus.md
deleted file mode 100644
index 0dd01697..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstus.md
+++ /dev/null
@@ -1,67 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vstus.md` -->
-
-# pto.vstus
-
-Standalone reference page for `pto.vstus`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Unaligned store with scalar offset and state update.
-
-## Mechanism
-
-`pto.vstus` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-%align_out, %base_out = pto.vstus %align_in, %offset, %value, %base, "MODE" : !pto.align, i32, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align, !pto.ptr<T, ub>
-```
-
-## Inputs
-
-`%align_in` is the incoming store-alignment state, `%offset` is the scalar
-  displacement, `%value` is the vector being stored, and `%base` is the UB base
-  pointer.
-
-## Expected Outputs
-
-`%align_out` is the updated buffered-tail state and `%base_out` is the next
-  base pointer when the lowering chooses a post-update form.
-
-## Side Effects
-
-This operation writes UB-visible memory and/or updates streamed alignment state. Stateful unaligned forms expose their evolving state in SSA form, but a trailing flush form may still be required to complete the stream.
-
-## Constraints
-
-This is the scalar-offset stateful form of the unaligned store family. The
-  scalar offset width and update mode MUST match the selected form, and a later
-  flush op is still required.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```mlir
-%align_out, %base_out = pto.vstus %align_in, %offset, %value, %base, "MODE" : !pto.align, i32, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align, !pto.ptr<T, ub>
-```
-
-## Detailed Notes
-
-The family overview carries the remaining shared rules for this operation.
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vstu](./vstu.md)
-- Next op in family: [pto.vstur](./vstur.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstus_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstus_zh.md
deleted file mode 100644
index 38c8e50b..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstus_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vstus
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vstus.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstx2.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstx2.md
deleted file mode 100644
index 46b827ec..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstx2.md
+++ /dev/null
@@ -1,80 +0,0 @@
-<!-- Generated from `docs/isa/vector/ops/vector-load-store/vstx2.md` -->
-
-# pto.vstx2
-
-Standalone reference page for `pto.vstx2`. This page belongs to the [Vector Load Store](../../vector-load-store.md) family in the PTO ISA manual.
-
-## Summary
-
-Dual interleaved store (SoA → AoS conversion).
-
-## Mechanism
-
-`pto.vstx2` is part of the PTO vector memory/data-movement surface. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering.
-
-## Syntax
-
-```mlir
-pto.vstx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.ptr<T, ub>, index, !pto.mask
-```
-
-## Inputs
-
-`%low` and `%high` are the two source vectors, `%dest` is the UB base pointer,
-  `%offset` is the displacement, `DIST` selects the interleave layout, and
-  `%mask` gates the participating elements.
-
-## Expected Outputs
-
-This op has no SSA result; it writes an interleaved stream to UB.
-
-## Side Effects
-
-This operation writes UB-visible memory and/or updates streamed alignment state. Stateful unaligned forms expose their evolving state in SSA form, but a trailing flush form may still be required to complete the stream.
-
-## Constraints
-
-This family is only legal for interleave distributions. The two source
-  vectors form an ordered pair, and the interleave semantics of that pair MUST
-  be preserved.
-
-## Exceptions
-
-- It is illegal to use addresses outside the required UB-visible space or to violate the alignment/distribution contract of the selected form.
-- Masked-off lanes or inactive blocks do not make an otherwise-illegal address valid unless the operation text explicitly says so.
-- Any additional illegality stated in the constraints section is also part of the contract.
-
-## Target-Profile Restrictions
-
-- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
-- Code that depends on a family-specific type list, distribution mode, or fused form should treat that dependency as target-profile-specific unless the PTO manual states cross-target portability explicitly.
-
-## Examples
-
-```c
-// INTLV_B32:
-for (int i = 0; i < 64; i++) {
-    UB[base + 8*i]     = low[i];
-    UB[base + 8*i + 4] = high[i];
-}
-```
-
-## Detailed Notes
-
-**Distribution modes:** `INTLV_B8`, `INTLV_B16`, `INTLV_B32`
-
-```c
-// INTLV_B32:
-for (int i = 0; i < 64; i++) {
-    UB[base + 8*i]     = low[i];
-    UB[base + 8*i + 4] = high[i];
-}
-```
-
-## Strided Stores
-
-## Related Ops / Family Links
-
-- Family overview: [Vector Load Store](../../vector-load-store.md)
-- Previous op in family: [pto.vsts](./vsts.md)
-- Next op in family: [pto.vsst](./vsst.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstx2_zh.md b/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstx2_zh.md
deleted file mode 100644
index 5fb73c29..00000000
--- a/docs/mkdocs/src/docs/isa/vector/ops/vector-load-store/vstx2_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# pto.vstx2
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vstx2.md)
-- [中文手册入口](../../../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../../../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/pipeline-sync.md b/docs/mkdocs/src/docs/isa/vector/pipeline-sync.md
deleted file mode 100644
index ad70d17a..00000000
--- a/docs/mkdocs/src/docs/isa/vector/pipeline-sync.md
+++ /dev/null
@@ -1,466 +0,0 @@
-<!-- Generated from `docs/isa/vector/pipeline-sync.md` -->
-
-# Vector Families: Pipeline Sync
-
-This page documents the `pto.v*` synchronization families inside PTO ISA. The operation forms below describe the vector-pipe contract and the current A5-oriented target-profile details that backends must preserve when lowering legal PTO programs.
-
-> **Category:** Synchronization primitives for coordinating pipeline execution
-> **Pipelines:** MTE2 (GM→UB), PIPE_V (Vector), MTE3 (UB→GM)
-
-The PTO ISA vector surface model operates on the A5's **Decoupled Access-Execute** architecture. The MTE and Vector pipelines run asynchronously, requiring explicit synchronization to prevent data hazards.
-
----
-
-## Intra-Core Pipeline Sync
-
-These ops coordinate data flow between pipelines within a single vector core.
-
-### `pto.set_flag`
-
-- **syntax:** `pto.set_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]`
-- **semantics:** Signal event from source pipe to destination pipe.
-
-```c
-set_flag(src_pipe, dst_pipe, event_id);
-```
-
-**Example:** After MTE2 completes GM→UB transfer, signal Vector pipe:
-```mlir
-pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-```
-
----
-
-### `pto.wait_flag`
-
-- **syntax:** `pto.wait_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]`
-- **semantics:** Block destination pipe until source pipe signals event.
-
-```c
-wait_flag(src_pipe, dst_pipe, event_id);
-```
-
-**Example:** Vector pipe waits for MTE2 data to arrive:
-```mlir
-pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-```
-
----
-
-### `pto.pipe_barrier`
-
-- **syntax:** `pto.pipe_barrier "PIPE_*"`
-- **semantics:** Drain all pending ops in the specified pipe. All previously issued operations on that pipe complete before any subsequent operation begins.
-
-```c
-pipe_barrier(pipe);
-```
-
-**Pipe identifiers:** `PIPE_MTE2`, `PIPE_V`, `PIPE_MTE3`
-
-**Example:** Two back-to-back `copy_ubuf_to_gm` calls writing to the same GM address. Without a barrier, MTE3 may reorder them and the final GM value is non-deterministic:
-
-```mlir
-// Both stores target the same GM address — order matters!
-pto.copy_ubuf_to_gm %ub_partial_0, %gm_result, ...
-// Without pipe_barrier, MTE3 could execute the second copy before the first
-// completes, producing a non-deterministic result at %gm_result.
-pto.pipe_barrier "PIPE_MTE3"
-// After barrier: first copy is guaranteed complete. Second copy overwrites deterministically.
-pto.copy_ubuf_to_gm %ub_partial_1, %gm_result, ...
-```
-
----
-
-### `pto.get_buf`
-
-- **syntax:** `pto.get_buf "PIPE_*", %buf_id, %mode : i64, i64`
-- **semantics:** Acquire buffer slot for inter-pipeline double-buffering coordination.
-
-```c
-get_buf(pipe, buf_id, mode);
-```
-
----
-
-### `pto.rls_buf`
-
-- **syntax:** `pto.rls_buf "PIPE_*", %buf_id, %mode : i64, i64`
-- **semantics:** Release buffer slot to allow other pipeline to proceed.
-
-```c
-rls_buf(pipe, buf_id, mode);
-```
-
----
-
-### `pto.mem_bar`
-
-- **syntax:** `pto.mem_bar "BARRIER_TYPE"`
-- **semantics:** Intra-vector-pipe memory fence within `__VEC_SCOPE__`. Required when UB addresses alias between vector load/store operations.
-
-```c
-mem_bar(barrier_type);
-```
-
-**Barrier types:**
-
-| Type | Semantics |
-|------|-----------|
-| `VV_ALL` | All prior vector ops complete before subsequent |
-| `VST_VLD` | All prior vector stores visible before subsequent loads |
-| `VLD_VST` | All prior vector loads complete before subsequent stores |
-
-**Example:** Ensure stores are visible before loads to same UB region:
-```mlir
-pto.vsts %v0, %ub[%c0] : !pto.vreg<64xf32>, !pto.ptr<f32, ub>
-pto.mem_bar "VST_VLD"
-%v1 = pto.vlds %ub[%c0] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-```
-
----
-
-## Intra-Core Sync Patterns & Examples
-
-### Example 1: `set_flag` / `wait_flag` (Explicit Events)
-
-Each cross-pipeline data dependency requires an explicit signal/wait pair. The programmer must manually insert `set_flag` after the producer and `wait_flag` before the consumer.
-
-```mlir
-// ─── Stage 1: MTE2 loads data from GM into UB ───
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...
-
-// MTE2 signals: "UB data is ready for Vector pipe"
-pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-
-// ─── Stage 2: Vector pipe consumes UB data ───
-// Vector waits until MTE2's signal arrives
-pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
-
-scf.for %dummy = %c0 to %c1 step %c1 {
-  %v   = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-  pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-} {llvm.loop.aivector_scope}
-
-// Vector signals: "UB output is ready for MTE3"
-pto.set_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
-
-// ─── Stage 3: MTE3 stores result from UB back to GM ───
-// MTE3 waits until Vector's signal arrives
-pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
-
-pto.copy_ubuf_to_gm %ub_out, %gm_out, ...
-```
-
-**Key property:** Every cross-pipeline edge is an explicit `(set_flag, wait_flag)` pair. Simple for straight-line code, but gets verbose in loops (see Example 3).
-
----
-
-### Example 2: `get_buf` / `rls_buf` (Resource-Based)
-
-Instead of naming events, each pipeline declares when it **acquires** (`get_buf`) and **releases** (`rls_buf`) a shared UB buffer. Cross-pipeline RAW/WAR dependencies are resolved implicitly by program order — if MTE2 releases `buf_A` and Vector later acquires `buf_A`, the hardware ensures the acquire cannot proceed until the release completes.
-
-```mlir
-// ─── Stage 1: MTE2 loads data into UB ───
-// MTE2 acquires ub_ptr — blocks if Vector hasn't released it from a prior iteration
-pto.get_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64
-pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...
-// MTE2 done writing ub_ptr — release it so Vector can consume
-pto.rls_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64
-
-// ─── Stage 2: Vector computation ───
-// Vector acquires ub_ptr (input) — blocks until MTE2 releases it (RAW: MTE2 write → V read)
-pto.get_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
-// Vector acquires ub_out (output) — blocks until MTE3 releases it from a prior iteration (WAR: MTE3 read → V write)
-pto.get_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64
-
-scf.for %dummy = %c0 to %c1 step %c1 {
-  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-  %v   = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-  pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-} {llvm.loop.aivector_scope}
-
-// Vector done reading ub_ptr — release so MTE2 can reuse it in next iteration
-pto.rls_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
-// Vector done writing ub_out — release so MTE3 can consume
-pto.rls_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64
-
-// ─── Stage 3: MTE3 stores result to GM ───
-// MTE3 acquires ub_out — blocks until Vector releases it (RAW: V write → MTE3 read)
-pto.get_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64
-pto.copy_ubuf_to_gm %ub_out, %gm_out, ...
-// MTE3 done reading ub_out — release so Vector can reuse it in next iteration
-pto.rls_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64
-```
-
-**Key property:** No event IDs needed. Dependencies are implicit from program order of `get_buf`/`rls_buf` on the same buffer ID. This becomes much more convenient in multi-iteration loops (see Example 3).
-
----
-
-### Example 3: Ping/Pong Double-Buffering Loop
-
-Double-buffering overlaps DMA and compute by using two UB buffers alternately. All three stages (MTE2, Vector, MTE3) appear in the **same iteration** — the hardware pipelines them across iterations because different iterations operate on different buffers (`buf[i%2]`).
-
-#### Event ID scheme (`set_flag` / `wait_flag`)
-
-With 2 ping/pong buffers and 2 pipeline pairs (MTE2↔V, V↔MTE3), `set_flag`/`wait_flag` needs **8 event IDs** = 2 pipe-pairs × 2 buffers × (forward + reverse):
-
-**MTE2 ↔ Vector (input buffers):**
-
-| Event ID | Direction | Purpose |
-|----------|-----------|---------|
-| `EVT_IN_FWD_0` | MTE2 → V | RAW: buf_in[0] data ready |
-| `EVT_IN_FWD_1` | MTE2 → V | RAW: buf_in[1] data ready |
-| `EVT_IN_REV_0` | V → MTE2 | WAR: Vector done reading buf_in[0] |
-| `EVT_IN_REV_1` | V → MTE2 | WAR: Vector done reading buf_in[1] |
-
-**Vector ↔ MTE3 (output buffers):**
-
-| Event ID | Direction | Purpose |
-|----------|-----------|---------|
-| `EVT_OUT_FWD_0` | V → MTE3 | RAW: buf_out[0] result ready |
-| `EVT_OUT_FWD_1` | V → MTE3 | RAW: buf_out[1] result ready |
-| `EVT_OUT_REV_0` | MTE3 → V | WAR: MTE3 done reading buf_out[0] |
-| `EVT_OUT_REV_1` | MTE3 → V | WAR: MTE3 done reading buf_out[1] |
-
-#### 3a. `set_flag` / `wait_flag` version
-
-```mlir
-// ═══ Pre-loop: prime ALL reverse-dependency signals ═══
-// Both input and output buffers start unused. We must pre-send
-// reverse-dep signals so the first iteration's wait_flags don't deadlock.
-pto.set_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_0"]   // ◀ PRIME: buf_in[0] "free"
-pto.set_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_1"]   // ◀ PRIME: buf_in[1] "free"
-pto.set_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_0"]  // ◀ PRIME: buf_out[0] "free"
-pto.set_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_1"]  // ◀ PRIME: buf_out[1] "free"
-
-scf.for %i = %c0 to %N step %c1 {
-  // ── All 3 stages in same iteration, indexed by i%2 ──
-  // %pp = i % 2  (ping/pong selector for buffer & event IDs)
-
-  // ── MTE2: load tile[i] into buf_in[i%2] ──
-  // WAR: wait until Vector has released buf_in[i%2] from iteration i-2
-  pto.wait_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
-  pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_in[%pp], ...
-  // RAW: signal Vector that buf_in[i%2] data is ready
-  pto.set_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]
-
-  // ── Vector: compute buf_in[i%2] → buf_out[i%2] ──
-  // RAW: wait for MTE2 to finish loading buf_in[i%2]
-  pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]
-  // WAR: wait for MTE3 to finish reading buf_out[i%2] from iteration i-2
-  pto.wait_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
-  scf.for %dummy = %c0 to %c1 step %c1 {
-    %v   = pto.vlds %ub_in[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-    %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-    %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-    pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-  } {llvm.loop.aivector_scope}
-  // WAR: tell MTE2 "done reading buf_in[i%2]"
-  pto.set_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
-  // RAW: tell MTE3 "buf_out[i%2] result ready"
-  pto.set_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]
-
-  // ── MTE3: store result from buf_out[i%2] to GM ──
-  // RAW: wait for Vector to finish writing buf_out[i%2]
-  pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]
-  pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
-  // WAR: tell Vector "done reading buf_out[i%2]"
-  pto.set_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
-}
-
-// ═══ Post-loop: drain — match every pre-loop prime with a wait ═══
-// Each priming set_flag must be paired. The last loop iteration's
-// set_flags are consumed by wait_flags that will never fire inside the
-// loop (there is no iteration i+2). Drain them here.
-pto.wait_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_{(N-1)%2}"]  // ◀ DRAIN
-pto.wait_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_{(N-2)%2}"]  // ◀ DRAIN
-pto.wait_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_{(N-1)%2}"] // ◀ DRAIN
-pto.wait_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_{(N-2)%2}"] // ◀ DRAIN
-```
-
-**What `set_flag`/`wait_flag` requires outside the loop:**
-- **Before the loop (4 × `set_flag`):** Prime every reverse-dependency event ID — one per buffer per pipe-pair. Without this, the first iteration's `wait_flag` for reverse deps would deadlock (no signal was ever sent).
-- **After the loop (4 × `wait_flag`):** Drain the matching reverse-dep signals from the last iterations. Every `set_flag` must be paired with a `wait_flag` — the last loop iterations produce signals that no subsequent iteration consumes, so they must be drained explicitly.
-
-#### 3b. `get_buf` / `rls_buf` version
-
-Same ping/pong double-buffering, but **no pre-loop priming or post-loop draining needed.** Buffer acquire/release semantics handle everything.
-
-```mlir
-scf.for %i = %c0 to %N step %c1 {
-  // %pp = i % 2  (ping/pong selector)
-
-  // ── MTE2: load tile[i] into buf[i%2] ──
-  // Acquires buf[i%2] — on first iteration, buffer is free so proceeds immediately.
-  // On later iterations, blocks until Vector releases buf[i%2] (WAR: automatic).
-  pto.get_buf %bufid_buf[%pp], "PIPE_MTE2"
-  pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_buf[%pp], ...
-  pto.rls_buf %bufid_buf[%pp], "PIPE_MTE2"
-
-  // ── Vector: compute on buf[i%2] ──
-  // Acquires buf[i%2] — blocks until MTE2 releases it (RAW: automatic)
-  pto.get_buf %bufid_buf[%pp], "PIPE_V"
-  pto.get_buf %bufid_out[%pp], "PIPE_V"
-  scf.for %dummy = %c0 to %c1 step %c1 {
-    %v   = pto.vlds %ub_buf[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-    %mask = pto.pset_b32 "PAT_ALL" : !pto.mask
-    %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-    pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-  } {llvm.loop.aivector_scope}
-  // Release buf[i%2] — MTE2 can reuse in iteration i+2 (WAR resolved)
-  pto.rls_buf %bufid_buf[%pp], "PIPE_V"
-  pto.rls_buf %bufid_out[%pp], "PIPE_V"
-
-  // ── MTE3: store result ──
-  // Acquires out[i%2] — blocks until Vector releases it (RAW: automatic)
-  pto.get_buf %bufid_out[%pp], "PIPE_MTE3"
-  pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
-  pto.rls_buf %bufid_out[%pp], "PIPE_MTE3"
-}
-// No post-loop drain needed — last rls_buf completes the pipeline.
-```
-
-**No priming, no draining, no event IDs.** The acquire/release protocol on buffer IDs indexed by `i%2` implicitly resolves all cross-pipeline dependencies:
-- **RAW** (MTE2→V): Vector's `get_buf` blocks until MTE2's `rls_buf` on `buf[i%2]`
-- **WAR** (V→MTE2): MTE2's `get_buf` in iteration `i+2` blocks until Vector's `rls_buf` in iteration `i` (same buffer)
-- **First iteration:** Buffer is initially free, so `get_buf` proceeds without blocking — no priming needed
-
----
-
-## Comparison Summary
-
-| Aspect | `set_flag` / `wait_flag` | `get_buf` / `rls_buf` |
-|--------|--------------------------|------------------------|
-| Dependency model | Explicit event signals | Implicit via buffer acquire/release |
-| IDs per pipe-pair | **8** = 2 buffers × 2 dirs × 2 (fwd+rev) | 1 fwd + 1 rev per buffer (shared global pool) |
-| Total HW IDs | 8 per pipe-pair, grows with buffers | **32 global** across all pipes |
-| Reverse (WAR) deps | Extra `set_flag`/`wait_flag` pair per buffer | Handled automatically |
-| Pre-loop setup | `set_flag` to prime each reverse dep | None |
-| Post-loop teardown | `wait_flag` to drain all primed signals | None |
-| Straight-line code | Simple, clear | Slightly more verbose (bracket each stage) |
-| Ping/pong loops | 8 event IDs + 4 prime + 4 drain | Same pattern, no overhead |
-| Best used for | Simple pipelines, fine-grained control | Double/multi-buffering, complex loops |
-
----
-
-## Inter-Core Sync
-
-> **Note:** Inter-core sync is only needed for **mixed Cube+Vector tasks** where Cube produces data that Vector consumes (or vice versa). **Vec-only tasks can ignore this section entirely.**
-
-These ops coordinate execution across the Cube block and Vector subblocks within a cluster. Each core cluster consists of **1 Cube block : 2 Vector subblocks**, each with its own **SU (Sequencer Unit)** running independent instruction streams.
-
-```
-Core Cluster (1:2 ratio)
-┌─────────────────────────────────────────────┐
-│  ┌──────────────┐    ┌──────────────┐       │
-│  │  AIC (Cube)  │    │  AIV0 (Vec)  │       │
-│  │  ┌────────┐  │    │  ┌────────┐  │       │
-│  │  │   SU   │──┼────┼──│   SU   │  │       │
-│  │  └────────┘  │    │  └────────┘  │       │
-│  │  CUBE pipe   │    │  MTE2/V/MTE3 │       │
-│  │  L0C buffer  │    │  UB (256KB)  │       │
-│  └──────────────┘    └──────────────┘       │
-│                      ┌──────────────┐       │
-│                      │  AIV1 (Vec)  │       │
-│                      │  ┌────────┐  │       │
-│                      │  │   SU   │  │       │
-│                      │  └────────┘  │       │
-│                      │  MTE2/V/MTE3 │       │
-│                      │  UB (256KB)  │       │
-│                      └──────────────┘       │
-└─────────────────────────────────────────────┘
-```
-
-### Platform Comparison
-
-| Aspect | A2A3 (Ascend 910) | A5 (A5) |
-|--------|-------------------|-----------------|
-| **Signal op** | `set_cross_core` (mode2) | `set_intra_block` |
-| **Wait op** | `wait_flag_dev` | `wait_intra_core` |
-| **Wait behavior** | SU-level blocking (entire core stalls) | Per-pipeline (only named pipe stalls) |
-| **Semaphore pool** | 16 IDs per cluster, 4-bit counter | 16 IDs, but 32-ID address space (see below) |
-| **C→V** | **Broadcast**: one `set` reaches both AIV0+AIV1 | **1:1**: separate `set` per subblock required |
-| **V→C** | **Reduce**: Cube waits for both subblocks in one `wait` | **1:1**: Cube needs separate `wait` per subblock |
-
-### A2A3: `set_cross_core` / `wait_flag_dev`
-
-```c
-// mode2 broadcast/reduce semantics for 1:2 cluster
-set_cross_core(pipe, semaphore_id);   // pipe: VEC/MTE2/CUBE/FIX
-wait_flag_dev(semaphore_id);          // SU-level blocking
-```
-
-```
-C→V Broadcast (one set reaches both):
-    AIC ──set_cross_core──┬──> AIV0 sema++
-                          └──> AIV1 sema++
-
-V→C Reduce (one wait for both):
-    AIV0 ──set_cross_core──┐
-                           ├──> AIC wait_flag_dev (blocks until BOTH)
-    AIV1 ──set_cross_core──┘
-```
-
-### `pto.set_cross_core`
-
-- **syntax:** `pto.set_cross_core %core_id, %event_id : i64, i64`
-- **semantics:** Signal event to another core. Uses **mode2** for 1:2 cluster on A2A3.
-
-### `pto.wait_flag_dev`
-
-- **syntax:** `pto.wait_flag_dev %core_id, %event_id : i64, i64`
-- **semantics:** Wait for event from another core. **SU-level blocking** — entire core stalls.
-
-### A5: `set_intra_block` / `wait_intra_core`
-
-```c
-set_intra_block(trigger_pipe, semaphore_id);
-wait_intra_core(wait_pipe, semaphore_id);   // only named pipe stalls
-```
-
-**A5 semaphore address space:** The hardware has **16 physical semaphore IDs** but exposes a **32-ID address space** to support 1:1 signaling to each subblock:
-
-| ID Range | Target |
-|----------|--------|
-| 0–15 | AIV0 (subblock 0) |
-| 16–31 (+15 offset) | AIV1 (subblock 1) |
-
-This means C→V requires **separate `set_intra_block` calls** per subblock (no broadcast), and V→C requires **separate `wait_intra_core` calls** per subblock (no hardware reduce).
-
-```
-C→V on A5 (1:1, no broadcast — need two sets):
-    AIC ──set_intra_block(pipe, sema_id)────> AIV0
-    AIC ──set_intra_block(pipe, sema_id+15)──> AIV1
-
-V→C on A5 (1:1, no reduce — need two waits):
-    AIV0 ──set_intra_block──> AIC wait_intra_core(pipe, sema_id)
-    AIV1 ──set_intra_block──> AIC wait_intra_core(pipe, sema_id+15)  // extra wait
-```
-
-### `pto.set_intra_block`
-
-- **syntax:** `pto.set_intra_block %block_id, %event_id : i64, i64`
-- **semantics:** Signal event within a block (A5). Specifies **trigger pipe**. 1:1 per subblock.
-
-### `pto.wait_intra_core`
-
-- **syntax:** `pto.wait_intra_core %block_id, %event_id : i64, i64`
-- **semantics:** Wait for event within block (A5). Specifies **which pipeline should wait** — only that pipe stalls, SU and other pipes continue.
-
-### Wait Granularity Comparison
-
-```
-A2A3 wait_flag_dev (SU-level stall):
-    SU ──┬── PIPE_MTE2 ───╳ ALL STALLED
-         ├── PIPE_V    ───╳ ALL STALLED
-         └── PIPE_MTE3 ───╳ ALL STALLED
-
-A5 wait_intra_core "PIPE_MTE2" (per-pipe stall):
-    SU ──┬── PIPE_MTE2 ───╳ STALLED (waiting for Cube)
-         ├── PIPE_V    ─── ✓ RUNNING
-         └── PIPE_MTE3 ─── ✓ RUNNING
-```
diff --git a/docs/mkdocs/src/docs/isa/vector/predicate-and-materialization.md b/docs/mkdocs/src/docs/isa/vector/predicate-and-materialization.md
deleted file mode 100644
index a2a0ce34..00000000
--- a/docs/mkdocs/src/docs/isa/vector/predicate-and-materialization.md
+++ /dev/null
@@ -1,348 +0,0 @@
-<!-- Generated from `docs/isa/vector/predicate-and-materialization.md` -->
-
-# Vector Families: Predicate And Materialization
-
-This page documents the predicate-register and materialization families used by `pto.v*` code. Predicate load/store, mask generation, and predicate algebra are architecture-visible because they control which lanes participate in later vector operations.
-
-> **Category:** UB ↔ Predicate Register data movement
-> **Pipeline:** PIPE_V (Vector Core)
-
-Predicate registers (`!pto.mask`) are 256-bit registers that enable per-lane conditional execution. These ops move predicate values between UB and predicate registers.
-
----
-
-## Predicate Loads
-
-### `pto.plds`
-
-- **syntax:** `%result = pto.plds %source[%offset] {dist = "DIST"} : !pto.ptr<T, ub> -> !pto.mask`
-- **semantics:** Load predicate register with scalar offset.
-
-**Distribution modes:** `NORM`, `US`, `DS`
-
-**Example:**
-```mlir
-%mask = pto.plds %ub[%c0] {dist = "NORM"} : !pto.ptr<T, ub> -> !pto.mask
-```
-
----
-
-### `pto.pld`
-
-- **syntax:** `%result = pto.pld %source[%offset], "DIST" : !pto.ptr<T, ub>, index -> !pto.mask`
-- **semantics:** Load predicate register with areg offset.
-
----
-
-### `pto.pldi`
-
-- **syntax:** `%result = pto.pldi %source, %offset, "DIST" : !pto.ptr<T, ub>, i32 -> !pto.mask`
-- **semantics:** Load predicate register with immediate offset.
-
----
-
-## Predicate Stores
-
-### `pto.psts`
-
-- **syntax:** `pto.psts %value, %dest[%offset] : !pto.mask, !pto.ptr<T, ub>`
-- **semantics:** Store predicate register with scalar offset.
-
-**Example:**
-```mlir
-pto.psts %mask, %ub[%c0] : !pto.mask, !pto.ptr<T, ub>
-```
-
----
-
-### `pto.pst`
-
-- **syntax:** `pto.pst %value, %dest[%offset], "DIST" : !pto.mask, !pto.ptr<T, ub>, index`
-- **semantics:** Store predicate register with areg offset.
-
-**Distribution modes:** `NORM`, `PK`
-
----
-
-### `pto.psti`
-
-- **syntax:** `pto.psti %value, %dest, %offset, "DIST" : !pto.mask, !pto.ptr<T, ub>, i32`
-- **semantics:** Store predicate register with immediate offset.
-
----
-
-### `pto.pstu`
-
-- **syntax:** `%align_out, %base_out = pto.pstu %align_in, %value, %base : !pto.align, !pto.mask, !pto.ptr<T, ub> -> !pto.align, !pto.ptr<T, ub>`
-- **semantics:** Predicate unaligned store with align state update.
-
----
-
-## Typical Usage Pattern
-
-```mlir
-// Generate comparison mask
-%mask = pto.vcmp %v0, %v1, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-
-// Store mask to UB for later use
-pto.psts %mask, %ub_mask[%c0] : !pto.mask, !pto.ptr<T, ub>
-
-// ... later in another kernel ...
-
-// Load mask from UB
-%saved_mask = pto.plds %ub_mask[%c0] {dist = "NORM"} : !pto.ptr<T, ub> -> !pto.mask
-
-// Use for predicated select
-%result = pto.vsel %v_true, %v_false, %saved_mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
-
----
-
-> **Category:** Scalar broadcast, predicate generation and manipulation
-> **Pipeline:** PIPE_V (Vector Core)
-
-These ops create vectors from scalar values and manipulate predicate registers.
-
-## Common Operand Model
-
-- `%value` is the scalar source value in SSA form.
-- `%input` is either a source scalar or a source vector depending on the op.
-- `%result` is the destination vector register value.
-- For 32-bit scalar inputs, the scalar source MUST satisfy the backend's legal
-  scalar-source constraints for this family.
-
----
-
-## Scalar Materialization
-
-### `pto.vbr`
-
-- **syntax:** `%result = pto.vbr %value : T -> !pto.vreg<NxT>`
-- **semantics:** Broadcast scalar to all vector lanes.
-- **inputs:**
-  `%value` is the scalar source.
-- **outputs:**
-  `%result` is a vector whose active lanes all carry `%value`.
-- **constraints and limitations:**
-  Supported forms are `b8`, `b16`, and `b32`. For `b8`, only the low 8 bits of
-  the scalar source are consumed.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = value;
-```
-
-**Example:**
-```mlir
-%one = pto.vbr %c1_f32 : f32 -> !pto.vreg<64xf32>
-```
-
----
-
-### `pto.vdup`
-
-- **syntax:** `%result = pto.vdup %input {position = "POSITION"} : T|!pto.vreg<NxT> -> !pto.vreg<NxT>`
-- **semantics:** Duplicate scalar or vector element to all lanes.
-- **inputs:**
-  `%input` supplies the scalar or source-lane value selected by `position`.
-- **outputs:**
-  `%result` is the duplicated vector.
-- **constraints and limitations:**
-  `position` selects which source element or scalar position is duplicated. The
-  current PTO ISA vector surface representation models that selector as an attribute rather than a
-  separate operand.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = input_scalar_or_element;
-```
-
----
-
-## Predicate Generation
-
-### `pto.pset_b8` / `pto.pset_b16` / `pto.pset_b32`
-
-- **syntax:** `%result = pto.pset_b32 "PAT_*" : !pto.mask`
-- **semantics:** Generate predicate from pattern.
-
-**Patterns:**
-
-| Pattern | Description |
-|---------|-------------|
-| `PAT_ALL` | All lanes active |
-| `PAT_ALLF` | All lanes inactive |
-| `PAT_H` | High half active |
-| `PAT_Q` | Upper quarter active |
-| `PAT_VL1`...`PAT_VL128` | First N lanes active |
-| `PAT_M3`, `PAT_M4` | Modular patterns |
-
-**Example — All 64 f32 lanes active:**
-```mlir
-%all_active = pto.pset_b32 "PAT_ALL" : !pto.mask
-```
-
-**Example — First 16 lanes active:**
-```mlir
-%first_16 = pto.pset_b32 "PAT_VL16" : !pto.mask
-```
-
----
-
-### `pto.pge_b8` / `pto.pge_b16` / `pto.pge_b32`
-
-- **syntax:** `%result = pto.pge_b32 "PAT_*" : !pto.mask`
-- **semantics:** Generate tail mask — first N lanes active.
-
-```c
-for (int i = 0; i < TOTAL_LANES; i++)
-    mask[i] = (i < len);
-```
-
-**Example — Tail mask for remainder loop:**
-```mlir
-%tail_mask = pto.pge_b32 "PAT_VL8" : !pto.mask
-
----
-
-### `pto.plt_b8` / `pto.plt_b16` / `pto.plt_b32`
-
-- **syntax:** `%mask, %scalar_out = pto.plt_b32 %scalar : i32 -> !pto.mask, i32`
-- **semantics:** Generate predicate state together with updated scalar state.
-```
-
----
-
-## Predicate Pack/Unpack
-
-### `pto.ppack`
-
-- **syntax:** `%result = pto.ppack %input, "PART" : !pto.mask -> !pto.mask`
-- **semantics:** Narrowing pack of predicate register.
-
-**Part tokens:** `LOWER`, `HIGHER`
-
----
-
-### `pto.punpack`
-
-- **syntax:** `%result = pto.punpack %input, "PART" : !pto.mask -> !pto.mask`
-- **semantics:** Widening unpack of predicate register.
-
----
-
-## Predicate Logical Ops
-
-### `pto.pand`
-
-- **syntax:** `%result = pto.pand %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask`
-- **semantics:** Predicate bitwise AND.
-
-```c
-for (int i = 0; i < N; i++)
-    if (mask[i]) dst[i] = src0[i] & src1[i];
-```
-
----
-
-### `pto.por`
-
-- **syntax:** `%result = pto.por %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask`
-- **semantics:** Predicate bitwise OR.
-
-```c
-for (int i = 0; i < N; i++)
-    if (mask[i]) dst[i] = src0[i] | src1[i];
-```
-
----
-
-### `pto.pxor`
-
-- **syntax:** `%result = pto.pxor %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask`
-- **semantics:** Predicate bitwise XOR.
-
-```c
-for (int i = 0; i < N; i++)
-    if (mask[i]) dst[i] = src0[i] ^ src1[i];
-```
-
----
-
-### `pto.pnot`
-
-- **syntax:** `%result = pto.pnot %input, %mask : !pto.mask, !pto.mask -> !pto.mask`
-- **semantics:** Predicate bitwise NOT.
-
-```c
-for (int i = 0; i < N; i++)
-    if (mask[i]) dst[i] = ~src[i];
-```
-
----
-
-### `pto.psel`
-
-- **syntax:** `%result = pto.psel %src0, %src1, %sel : !pto.mask, !pto.mask, !pto.mask -> !pto.mask`
-- **semantics:** Predicate select (mux).
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = sel[i] ? src0[i] : src1[i];
-```
-
----
-
-## Predicate Movement
-
-### `pto.ppack`
-
-- **syntax:** `%result = pto.ppack %input, "PART" : !pto.mask -> !pto.mask`
-- **semantics:** Narrowing pack of predicate register.
-
-```c
-for (int i = 0; i < N; i++)
-    if (mask[i]) dst[i] = src[i];
-```
-
----
-
-### `pto.punpack`
-
-- **syntax:** `%result = pto.punpack %input, "PART" : !pto.mask -> !pto.mask`
-- **semantics:** Widening unpack of predicate register.
-
----
-
-### `pto.pdintlv_b8`
-
-- **syntax:** `%low, %high = pto.pdintlv_b8 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask`
-- **semantics:** Predicate deinterleave.
-
----
-
-### `pto.pintlv_b16`
-
-- **syntax:** `%low, %high = pto.pintlv_b16 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask`
-- **semantics:** Predicate interleave.
-
----
-
-## Typical Usage
-
-```mlir
-// Generate all-active mask for f32 (64 lanes)
-%all = pto.pset_b32 "PAT_ALL" : !pto.mask
-
-// Generate tail mask for remainder (last 12 elements)
-%tail = pto.pge_b32 "PAT_VL12" : !pto.mask
-
-// Compare and generate mask
-%cmp_mask = pto.vcmp %a, %b, %all, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask
-
-// Combine masks: only process tail elements that passed comparison
-%combined = pto.pand %cmp_mask, %tail, %all : !pto.mask, !pto.mask, !pto.mask -> !pto.mask
-
-// Use for predicated operation
-%result = pto.vsel %true_vals, %false_vals, %combined : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
diff --git a/docs/mkdocs/src/docs/isa/vector/predicate-and-materialization_zh.md b/docs/mkdocs/src/docs/isa/vector/predicate-and-materialization_zh.md
deleted file mode 100644
index 122f01c9..00000000
--- a/docs/mkdocs/src/docs/isa/vector/predicate-and-materialization_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: Predicate And Materialization
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](predicate-and-materialization.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/reduction-ops.md b/docs/mkdocs/src/docs/isa/vector/reduction-ops.md
deleted file mode 100644
index f0a85fa7..00000000
--- a/docs/mkdocs/src/docs/isa/vector/reduction-ops.md
+++ /dev/null
@@ -1,232 +0,0 @@
-<!-- Generated from `docs/isa/vector/reduction-ops.md` -->
-
-# Vector Families: Reduction Ops
-
-This page documents `pto.v*` reduction families. Lane grouping, result placement, and inactive-lane rules are part of the visible vector contract and are not left to backend folklore.
-
-> **Category:** Vector reduction operations
-> **Pipeline:** PIPE_V (Vector Core)
-
-Operations that reduce a vector to a scalar or per-group result.
-
-## Common Operand Model
-
-- `%input` is the source vector register value.
-- `%mask` is the predicate operand `Pg`; inactive lanes do not participate.
-- `%result` is the destination vector register value.
-- Reduction results are written into the low-significance portion of the
-  destination vector and the remaining destination bits are zero-filled.
-
----
-
-## Full Vector Reductions
-
-### `pto.vcadd`
-
-- **syntax:** `%result = pto.vcadd %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i16-i64, f16, f32
-- **semantics:** Sum all elements. Result in lane 0, others zeroed.
-
-```c
-T sum = 0;
-for (int i = 0; i < N; i++)
-    sum += src[i];
-dst[0] = sum;
-for (int i = 1; i < N; i++)
-    dst[i] = 0;
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects participating
-  lanes.
-- **outputs:** `%result` contains the reduction result in its low element(s).
-- **constraints and limitations:** Some narrow integer forms may widen the
-  internal accumulation or result placement. If all predicate bits are zero, the
-  result is zero.
-
----
-
-### `pto.vcmax`
-
-- **syntax:** `%result = pto.vcmax %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i16-i32, f16, f32
-- **semantics:** Find max element with argmax. Result value + index in lane 0.
-
-```c
-T mx = -INF; int idx = 0;
-for (int i = 0; i < N; i++)
-    if (src[i] > mx) { mx = src[i]; idx = i; }
-dst_val[0] = mx;
-dst_idx[0] = idx;
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects participating
-  lanes.
-- **outputs:** `%result` carries the reduction result in the low destination
-  positions.
-- **constraints and limitations:** This family computes both the extremum and
-  location information, but the exact packing of that information into the
-  destination vector depends on the chosen form. If all predicate bits are zero,
-  the result follows the zero-filled convention.
-
----
-
-### `pto.vcmin`
-
-- **syntax:** `%result = pto.vcmin %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i16-i32, f16, f32
-- **semantics:** Find min element with argmin. Result value + index in lane 0.
-
-```c
-T mn = INF; int idx = 0;
-for (int i = 0; i < N; i++)
-    if (src[i] < mn) { mn = src[i]; idx = i; }
-dst_val[0] = mn;
-dst_idx[0] = idx;
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects participating
-  lanes.
-- **outputs:** `%result` carries the reduction result in the low destination
-  positions.
-- **constraints and limitations:** As with `pto.vcmax`, the exact value/index
-  packing depends on the chosen form and MUST be preserved consistently.
-
----
-
-## Per-VLane (Group) Reductions
-
-The vector register is organized as **8 VLanes** of 32 bytes each. Group reductions operate within each VLane independently.
-
-```
-vreg layout (f32 example, 64 elements total):
-VLane 0: [0..7]   VLane 1: [8..15]  VLane 2: [16..23] VLane 3: [24..31]
-VLane 4: [32..39] VLane 5: [40..47] VLane 6: [48..55] VLane 7: [56..63]
-```
-
-### `pto.vcgadd`
-
-- **syntax:** `%result = pto.vcgadd %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i16-i32, f16, f32
-- **semantics:** Sum within each VLane. 8 results at indices 0, 8, 16, 24, 32, 40, 48, 56 (for f32).
-
-```c
-int K = N / 8;  // elements per VLane
-for (int g = 0; g < 8; g++) {
-    T sum = 0;
-    for (int i = 0; i < K; i++)
-        sum += src[g*K + i];
-    dst[g*K] = sum;
-    for (int i = 1; i < K; i++)
-        dst[g*K + i] = 0;
-}
-// For f32: results at dst[0], dst[8], dst[16], dst[24], dst[32], dst[40], dst[48], dst[56]
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects participating
-  lanes.
-- **outputs:** `%result` contains one sum per 32-byte VLane group, written
-  contiguously into the low slot of each group.
-- **constraints and limitations:** This is a per-32-byte VLane-group reduction.
-  Inactive lanes are treated as zero.
-
----
-
-### `pto.vcgmax`
-
-- **syntax:** `%result = pto.vcgmax %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i16-i32, f16, f32
-- **semantics:** Max within each VLane.
-
-```c
-int K = N / 8;
-for (int g = 0; g < 8; g++) {
-    T mx = -INF;
-    for (int i = 0; i < K; i++)
-        if (src[g*K + i] > mx) mx = src[g*K + i];
-    dst[g*K] = mx;
-    for (int i = 1; i < K; i++)
-        dst[g*K + i] = 0;
-}
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects participating
-  lanes.
-- **outputs:** `%result` contains one maximum per 32-byte VLane group.
-- **constraints and limitations:** Grouping is by hardware 32-byte VLane, not by
-  arbitrary software subvector.
-
----
-
-### `pto.vcgmin`
-
-- **syntax:** `%result = pto.vcgmin %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i16-i32, f16, f32
-- **semantics:** Min within each VLane.
-
-```c
-int K = N / 8;
-for (int g = 0; g < 8; g++) {
-    T mn = INF;
-    for (int i = 0; i < K; i++)
-        if (src[g*K + i] < mn) mn = src[g*K + i];
-    dst[g*K] = mn;
-    for (int i = 1; i < K; i++)
-        dst[g*K + i] = 0;
-}
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects participating
-  lanes.
-- **outputs:** `%result` contains one minimum per 32-byte VLane group.
-- **constraints and limitations:** Grouping is by hardware 32-byte VLane, not by
-  arbitrary software subvector.
-
----
-
-## Prefix Operations
-
-### `pto.vcpadd`
-
-- **syntax:** `%result = pto.vcpadd %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-- **semantics:** Inclusive prefix sum (scan).
-
-```c
-dst[0] = src[0];
-for (int i = 1; i < N; i++)
-    dst[i] = dst[i-1] + src[i];
-```
-
-**Example:**
-```c
-// input:  [1, 2, 3, 4, 5, ...]
-// output: [1, 3, 6, 10, 15, ...]
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects participating
-  lanes.
-- **outputs:** `%result` is the inclusive prefix-sum vector.
-- **constraints and limitations:** Only floating-point element types are
-  documented on the current A5 surface here.
-
----
-
-## Typical Usage
-
-```mlir
-// Softmax: find max for numerical stability
-%max_vec = pto.vcmax %logits, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-// max is in lane 0, broadcast it
-%max_broadcast = pto.vlds %ub_tmp[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-
-// Row-wise sum using vcgadd (for 8-row tile)
-%row_sums = pto.vcgadd %tile, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-// Results at indices 0, 8, 16, 24, 32, 40, 48, 56
-
-// Full vector sum for normalization
-%total = pto.vcadd %values, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-// total[0] contains the sum
-
-// Prefix sum for cumulative distribution
-%cdf = pto.vcpadd %pdf, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
diff --git a/docs/mkdocs/src/docs/isa/vector/reduction-ops_zh.md b/docs/mkdocs/src/docs/isa/vector/reduction-ops_zh.md
deleted file mode 100644
index 6baf001d..00000000
--- a/docs/mkdocs/src/docs/isa/vector/reduction-ops_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: Reduction Ops
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](reduction-ops.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/sfu-and-dsa-ops.md b/docs/mkdocs/src/docs/isa/vector/sfu-and-dsa-ops.md
deleted file mode 100644
index b712222f..00000000
--- a/docs/mkdocs/src/docs/isa/vector/sfu-and-dsa-ops.md
+++ /dev/null
@@ -1,320 +0,0 @@
-<!-- Generated from `docs/isa/vector/sfu-and-dsa-ops.md` -->
-
-# Vector Families: SFU And DSA Ops
-
-This page documents special-function, fused, and domain-specific `pto.v*` families. These forms are narrower than generic arithmetic and must carry explicit target-profile restrictions.
-
-> **Category:** Domain-specific accelerator and special function unit operations
-> **Pipeline:** PIPE_V (Vector Core) / SFU
-
-Fused operations, special functions, and UB-to-UB operations that leverage hardware acceleration.
-
-## Common Operand Model
-
-- `%input`, `%lhs`, `%rhs`, `%acc`, and `%alpha` are source SSA values whose
-  roles are called out per instruction.
-- `%mask` is the predicate operand `Pg` when present.
-- `%result` is the destination SSA value.
-- This page mixes three different backend shapes: pure `vreg -> vreg` ops,
-  conversion/fusion ops, and UB-to-UB helpers. Each instruction section calls
-  out which storage model it uses.
-
----
-
-## Fused Activation Ops (vreg→vreg)
-
-### `pto.vlrelu`
-
-- **syntax:** `%result = pto.vlrelu %input, %alpha, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-- **semantics:** Leaky ReLU with scalar alpha.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] >= 0) ? src[i] : alpha * src[i];
-```
-
-- **inputs:** `%input` is the activation vector, `%alpha` is the scalar slope,
-  and `%mask` selects active lanes.
-- **outputs:** `%result` is the leaky-ReLU vector.
-- **constraints and limitations:** Only `f16` and `f32` forms are currently
-  documented for `pto.vlrelu`.
-
----
-
-### `pto.vprelu`
-
-- **syntax:** `%result = pto.vprelu %input, %alpha : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-- **semantics:** Parametric ReLU with per-element alpha vector.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] >= 0) ? src[i] : alpha[i] * src[i];
-```
-
-- **inputs:** `%input` is the activation vector and `%alpha` is the per-element
-  slope vector.
-- **outputs:** `%result` is the parametric-ReLU vector.
-- **constraints and limitations:** Floating-point element types only on the
-  current A5 surface.
-
----
-
-### `pto.vexpdiff`
-
-- **syntax:** `%result = pto.vexpdiff %input, %max : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-- **semantics:** Fused exp(x - max) for numerically stable softmax.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = expf(src[i] - max[i]);
-```
-
-**Use case:** Softmax numerator computation with numerical stability.
-
-- **inputs:** `%input` is the source vector and `%max` is the broadcasted
-  subtraction term.
-- **outputs:** `%result` is the fused `exp(input - max)` vector.
-- **constraints and limitations:** Floating-point element types only.
-
----
-
-## Fused Compute+Convert Ops
-
-### `pto.vaddrelu`
-
-- **syntax:** `%result = pto.vaddrelu %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-- **semantics:** Fused add + ReLU.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = max(src0[i] + src1[i], 0);
-```
-
-- **inputs:** `%lhs` and `%rhs` are the two addends.
-- **outputs:** `%result` is the fused add-then-ReLU result.
-- **constraints and limitations:** Floating-point element types only on the
-  current documented surface.
-
----
-
-### `pto.vsubrelu`
-
-- **syntax:** `%result = pto.vsubrelu %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-- **semantics:** Fused sub + ReLU.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = max(src0[i] - src1[i], 0);
-```
-
-- **inputs:** `%lhs` is the minuend and `%rhs` is the subtrahend.
-- **outputs:** `%result` is the fused sub-then-ReLU result.
-- **constraints and limitations:** Floating-point element types only on the
-  current documented surface.
-
----
-
-### `pto.vaxpy`
-
-- **syntax:** `%result = pto.vaxpy %src0, %src1, %alpha : !pto.vreg<NxT>, !pto.vreg<NxT>, T -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-- **semantics:** AXPY — scalar-vector multiply-add.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = alpha * src0[i] + src1[i];
-```
-
-- **inputs:** `%src0` is the scaled vector, `%src1` is the addend vector, and
-  `%alpha` is the scalar multiplier.
-- **outputs:** `%result` is the fused AXPY result.
-- **constraints and limitations:** Floating-point element types only on the
-  current documented surface.
-
----
-
-### `pto.vaddreluconv`
-
-- **syntax:** `%result = pto.vaddreluconv %lhs, %rhs : !pto.vreg<NxT0>, !pto.vreg<NxT0> -> !pto.vreg<MxT1>`
-- **semantics:** Fused add + ReLU + type conversion (HW fusion).
-
-```c
-// f32→f16 variant:
-for (int i = 0; i < 64; i++)
-    dst_f16[i] = f32_to_f16(max(src0_f32[i] + src1_f32[i], 0));
-
-// f16→i8 variant:
-for (int i = 0; i < 128; i++)
-    dst_i8[i] = f16_to_i8(max(src0_f16[i] + src1_f16[i], 0));
-```
-
-- **inputs:** `%lhs` and `%rhs` are the source vectors.
-- **outputs:** `%result` is the fused add/ReLU/convert result.
-- **constraints and limitations:** Only backend-supported source/destination
-  type pairs are legal. Rounding, saturation, and packing rules follow the
-  semantics of this fused operation, not an arbitrary sequence of standalone
-  ops.
-
----
-
-### `pto.vmulconv`
-
-- **syntax:** `%result = pto.vmulconv %lhs, %rhs : !pto.vreg<NxT0>, !pto.vreg<NxT0> -> !pto.vreg<MxT1>`
-- **semantics:** Fused mul + type conversion (HW fusion).
-
-```c
-// f16→i8 variant:
-for (int i = 0; i < 128; i++)
-    dst_i8[i] = f16_to_i8(src0_f16[i] * src1_f16[i]);
-```
-
-- **inputs:** `%lhs` and `%rhs` are the source vectors.
-- **outputs:** `%result` is the fused mul/convert result.
-- **constraints and limitations:** Only backend-supported source/destination
-  type pairs are legal.
-
----
-
-## Extended Arithmetic
-
-### `pto.vmull`
-
-- **syntax:** `%low, %high = pto.vmull %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>, !pto.vreg<NxT>`
-- **A5 types:** i32/u32 (native 32×32→64 widening multiply)
-- **semantics:** Widening multiply with high/low results.
-
-```c
-for (int i = 0; i < 64; i++) {
-    int64_t r = (int64_t)src0_i32[i] * (int64_t)src1_i32[i];
-    dst_lo[i] = (int32_t)(r & 0xFFFFFFFF);
-    dst_hi[i] = (int32_t)(r >> 32);
-}
-```
-
-- **inputs:** `%lhs` and `%rhs` are the source vectors and `%mask` selects
-  active lanes.
-- **outputs:** `%low` and `%high` expose the widened-product low/high parts.
-- **constraints and limitations:** The current documented A5 form is the native
-  widening 32x32->64 integer multiply family.
-
----
-
-### `pto.vmula`
-
-- **syntax:** `%result = pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **semantics:** Multiply-accumulate.
-
-```c
-for (int i = 0; i < N; i++)
-    if (mask[i])
-        dst[i] = acc[i] + lhs[i] * rhs[i];
-```
-
-- **inputs:** `%acc` is the accumulator input, `%lhs` and `%rhs` are the
-  multiplicands, and `%mask` selects active lanes.
-- **outputs:** `%result` is the multiply-accumulate result.
-- **constraints and limitations:** `pto.vmula` is a fused multiply-accumulate
-  operation and is not always interchangeable with separate `vmul` plus `vadd`.
-
----
-
-## Index Generation
-
-### `pto.vci`
-
-- **syntax:** `%result = pto.vci %index {order = "ORDER"} : integer -> !pto.vreg<NxT>`
-- **semantics:** Generate lane index vector.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = base_index + i;
-```
-
-**Use case:** Generate indices for gather/scatter, argsort, etc.
-
-- **inputs:** `%index` is the scalar seed/base index.
-- **outputs:** `%result` is the generated index vector.
-- **constraints and limitations:** This page documents the arithmetic/indexing
-  use of the family; the conversion page also records the same opcode for
-  completeness.
-
----
-
-## UB-to-UB Operations
-
-### `pto.vtranspose`
-
-- **syntax:** `pto.vtranspose %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64`
-- **semantics:** UB-to-UB transpose operation (not vreg-to-vreg).
-
-**Note:** This operates on UB memory directly, not on vector registers.
-
-- **inputs:** `%dest` and `%src` are UB pointers and `%config` is the ISA
-  control/config word.
-- **outputs:** This op writes UB memory and returns no SSA value.
-- **constraints and limitations:** This is not a `vreg -> vreg` op even though
-  it lives in the `pto.v*` namespace. Its correctness depends on the control
-  word and UB layout contract.
-
----
-
-## Sorting Operations
-
-### `pto.vsort32`
-
-- **syntax:** `pto.vsort32 %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64`
-- **semantics:** Sort 32 elements in UB.
-- **inputs:** `%dest` and `%src` are UB pointers and `%config` is the ISA
-  control/config word.
-- **outputs:** This op writes UB memory and returns no SSA value.
-- **constraints and limitations:** This is a UB-to-UB accelerator helper, not a
-  pure vector-register op.
-
----
-
-### `pto.vmrgsort`
-
-- **syntax:** `pto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub> x4, i64, i64`
-- **semantics:** Merge-sort 4 pre-sorted input vectors.
-- **inputs:** `%dest` is the UB destination, `%src0..%src3` are the four
-  pre-sorted UB inputs, `%count` is the number of valid elements, and `%config`
-  is the operation control word.
-- **outputs:** This op writes UB memory and returns no SSA value.
-- **constraints and limitations:** Inputs MUST already be sorted according to
-  the sort order encoded by `%config`. This page uses the shorter mnemonic
-  `pto.vmrgsort`, while the current implementation summary still refers to
-  `pto.vmrgsort4`.
-
----
-
-## Current Implementation Surface Summary
-
-- `pto.vmull %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>, !pto.vreg<NxT>`
-- `pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- `pto.vci %index {order = "ORDER"} : integer -> !pto.vreg<NxT>`
-- `pto.vbitsort %dest, %src, %indices, %repeat_times : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, index`
-- `pto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, i64, i64`
-
----
-
-## Typical Usage
-
-```mlir
-// Softmax with fused expdiff
-%max_broadcast = pto.vlds %ub_max[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-%exp_stable = pto.vexpdiff %logits, %max_broadcast : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>
-
-// Leaky ReLU activation
-%activated = pto.vlrelu %linear_out, %alpha_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Fused residual add + ReLU
-%residual = pto.vaddrelu %conv_out, %skip_connection : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>
-
-// Generate indices for argsort
-%indices = pto.vci %c0 {order = "ASC"} : i32 -> !pto.vreg<64xi32>
-```
diff --git a/docs/mkdocs/src/docs/isa/vector/sfu-and-dsa-ops_zh.md b/docs/mkdocs/src/docs/isa/vector/sfu-and-dsa-ops_zh.md
deleted file mode 100644
index 7729494c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/sfu-and-dsa-ops_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: SFU And DSA Ops
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](sfu-and-dsa-ops.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/shared-arith.md b/docs/mkdocs/src/docs/isa/vector/shared-arith.md
deleted file mode 100644
index b79e28f8..00000000
--- a/docs/mkdocs/src/docs/isa/vector/shared-arith.md
+++ /dev/null
@@ -1,51 +0,0 @@
-<!-- Generated from `docs/isa/vector/shared-arith.md` -->
-
-# Vector Families: Shared Scalar Arithmetic
-
-Vector programs in PTO rely on the shared MLIR `arith` surface for scalar setup around `pto.v*` regions. This page keeps that relationship explicit without pretending that scalar bookkeeping is itself a vector payload family.
-
-## Summary
-
-Shared scalar arithmetic is part of the documented PTO source surface. It feeds vector regions with constants, offsets, loop bounds, and scalar predicates, but it does not replace `pto.v*` compute.
-
-## Mechanism
-
-Around vector code, `arith` is used to:
-
-- compute UB offsets and loop counters
-- derive tail counts and active-lane conditions
-- build scalar values broadcast or materialized into vector state
-- compare scalar loop/control values that guard vector regions
-
-The canonical scalar-side explanation lives in [Scalar And Control Families: Shared Scalar Arithmetic](../scalar/shared-arith.md). This page exists so the vector reference stays self-contained about what surrounds `pto.v*` execution.
-
-## Inputs
-
-- scalar integers
-- scalar floating-point values
-- `index` values
-- boolean-like results from scalar comparisons
-
-## Expected Outputs
-
-- scalar values consumed by vector configuration or control
-- branch predicates for structured control around vector scopes
-- scalar operands later materialized into vector state
-
-## Constraints
-
-- `arith` MUST remain scalar; vector payload math belongs to `pto.v*`.
-- Width changes, `index` conversions, and scalar comparisons that affect vector legality SHOULD be spelled explicitly.
-- This shared surface MUST be documented as supporting source syntax, not as hidden compiler-only machinery.
-
-## Cases That Are Not Allowed
-
-- documenting scalar setup as if it were a vector ALU family
-- using `arith` to stand in for vector-register semantics
-- leaving scalar-to-vector boundary assumptions implicit
-
-## Related Ops And Family Links
-
-- [Scalar And Control Families: Shared Scalar Arithmetic](../scalar/shared-arith.md)
-- [Vector Families: Predicate And Materialization](./predicate-and-materialization.md)
-- [Vector Families: Shared Structured Control Flow](./shared-scf.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/shared-arith_zh.md b/docs/mkdocs/src/docs/isa/vector/shared-arith_zh.md
deleted file mode 100644
index b4817c0c..00000000
--- a/docs/mkdocs/src/docs/isa/vector/shared-arith_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: Shared Scalar Arithmetic
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](shared-arith.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/shared-scf.md b/docs/mkdocs/src/docs/isa/vector/shared-scf.md
deleted file mode 100644
index 28cd0ab0..00000000
--- a/docs/mkdocs/src/docs/isa/vector/shared-scf.md
+++ /dev/null
@@ -1,51 +0,0 @@
-<!-- Generated from `docs/isa/vector/shared-scf.md` -->
-
-# Vector Families: Shared Structured Control Flow
-
-Vector code in PTO is surrounded by structured control, not by hidden launch magic. This page explains the control shell that wraps `pto.v*` execution without claiming that `scf` is itself a vector mnemonic family.
-
-## Summary
-
-Shared `scf` operations provide the loop and branch structure around vector regions. They keep vector execution analyzable, explicit, and compatible with the rest of the PTO manual.
-
-## Mechanism
-
-Around vector regions, `scf` is used to:
-
-- iterate over repeated vector work
-- carry scalar state across iterations
-- branch around target-specific vector paths
-- model vector execution scopes using structured regions instead of opaque launch syntax
-
-The canonical scalar-side explanation lives in [Scalar And Control Families: Shared Structured Control Flow](../scalar/shared-scf.md). This vector page keeps the relationship visible for readers following the `pto.v*` path.
-
-## Inputs
-
-- scalar loop bounds
-- scalar predicates
-- loop-carried SSA values
-- yielded state from vector-adjacent branches or loops
-
-## Expected Outputs
-
-- explicit structured regions around vector work
-- loop-carried scalar or stateful results
-- analyzable control boundaries for vector lowering
-
-## Constraints
-
-- Vector-side control MUST keep carried values and branch results explicit through `scf.yield`.
-- Structured control SHOULD remain in `scf` form unless a truly architecture-visible PTO synchronization mechanism is required.
-- The manual MUST distinguish between vector payload effects and the shared control shell that surrounds them.
-
-## Cases That Are Not Allowed
-
-- treating structured control as backend-only hidden behavior
-- collapsing vector loop state into vague prose instead of explicit carried SSA values
-- documenting `scf` as though it were a `pto.v*` opcode family
-
-## Related Ops And Family Links
-
-- [Scalar And Control Families: Shared Structured Control Flow](../scalar/shared-scf.md)
-- [Vector Families: Pipeline Sync](./pipeline-sync.md)
-- [Vector Families: Shared Scalar Arithmetic](./shared-arith.md)
diff --git a/docs/mkdocs/src/docs/isa/vector/shared-scf_zh.md b/docs/mkdocs/src/docs/isa/vector/shared-scf_zh.md
deleted file mode 100644
index 0e0eda8e..00000000
--- a/docs/mkdocs/src/docs/isa/vector/shared-scf_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: Shared Structured Control Flow
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](shared-scf.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/unary-vector-ops.md b/docs/mkdocs/src/docs/isa/vector/unary-vector-ops.md
deleted file mode 100644
index 8b22f9ec..00000000
--- a/docs/mkdocs/src/docs/isa/vector/unary-vector-ops.md
+++ /dev/null
@@ -1,251 +0,0 @@
-<!-- Generated from `docs/isa/vector/unary-vector-ops.md` -->
-
-# Vector Families: Unary Vector Ops
-
-This page documents single-input `pto.v*` compute families. Unless a form states otherwise, the vector-register shape, active-lane mask semantics, and target-profile restrictions below define the portable contract.
-
-> **Category:** Single-input vector operations
-> **Pipeline:** PIPE_V (Vector Core)
-
-Element-wise operations that take one vector input and produce one vector output.
-
-## Common Operand Model
-
-- `%input` is the source vector register value.
-- `%mask` is the predicate operand. For this family, inactive lanes follow the
-  predication behavior of the selected instruction form: zeroing forms
-  zero-fill inactive lanes, while merging forms preserve the destination value.
-- `%result` is the destination vector register value. Unless stated otherwise,
-  `%result` has the same lane count and element type as `%input`.
-
----
-
-## Arithmetic
-
-### `pto.vabs`
-
-- **syntax:** `%result = pto.vabs %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i8-i32, f16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] < 0) ? -src[i] : src[i];
-```
-
-- **inputs:** `%input` supplies the source lanes and `%mask` selects which lanes
-  participate.
-- **outputs:** `%result` receives the lane-wise absolute values.
-- **constraints and limitations:** Source and result types MUST match. Integer
-  overflow on the most-negative signed value follows the target-defined
-  behavior.
-
----
-
-### `pto.vneg`
-
-- **syntax:** `%result = pto.vneg %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** i8-i32, f16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = -src[i];
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects active lanes.
-- **outputs:** `%result` is the lane-wise arithmetic negation.
-- **constraints and limitations:** Source and result types MUST match.
-
----
-
-## Transcendental
-
-### `pto.vexp`
-
-- **syntax:** `%result = pto.vexp %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = expf(src[i]);
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects active lanes.
-- **outputs:** `%result` holds `exp(input[i])` per active lane.
-- **constraints and limitations:** Only floating-point element types are legal.
-
----
-
-### `pto.vln`
-
-- **syntax:** `%result = pto.vln %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = logf(src[i]);
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects active lanes.
-- **outputs:** `%result` holds the natural logarithm per active lane.
-- **constraints and limitations:** Only floating-point element types are legal.
-  For real-number semantics, active inputs SHOULD be strictly positive; non-
-  positive inputs follow the target's exception/NaN rules.
-
----
-
-### `pto.vsqrt`
-
-- **syntax:** `%result = pto.vsqrt %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = sqrtf(src[i]);
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects active lanes.
-- **outputs:** `%result` holds the square root per active lane.
-- **constraints and limitations:** Only floating-point element types are legal.
-  Negative active inputs follow the target's exception/NaN rules.
-
----
-
-### `pto.vrsqrt`
-
-- **syntax:** `%result = pto.vrsqrt %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = 1.0f / sqrtf(src[i]);
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects active lanes.
-- **outputs:** `%result` holds reciprocal-square-root values per active lane.
-- **constraints and limitations:** Only floating-point element types are legal.
-  Active inputs containing `+0` or `-0` follow the target's divide-style
-  exceptional behavior.
-
----
-
-### `pto.vrec`
-
-- **syntax:** `%result = pto.vrec %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = 1.0f / src[i];
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects active lanes.
-- **outputs:** `%result` holds the reciprocal per active lane.
-- **constraints and limitations:** Only floating-point element types are legal.
-  Active inputs containing `+0` or `-0` follow the target's divide-style
-  exceptional behavior.
-
----
-
-## Activation
-
-### `pto.vrelu`
-
-- **syntax:** `%result = pto.vrelu %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** f16, f32
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] > 0) ? src[i] : 0;
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects active lanes.
-- **outputs:** `%result` holds `max(input[i], 0)` per active lane.
-- **constraints and limitations:** Only floating-point element types are legal
-  on the current A5 surface described here.
-
----
-
-## Bitwise
-
-### `pto.vnot`
-
-- **syntax:** `%result = pto.vnot %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** all integer types
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = ~src[i];
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects active lanes.
-- **outputs:** `%result` holds the lane-wise bitwise inversion.
-- **constraints and limitations:** Integer element types only.
-
----
-
-### `pto.vbcnt`
-
-- **syntax:** `%result = pto.vbcnt %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** all integer types
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = __builtin_popcount(src[i]);
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects active lanes.
-- **outputs:** `%result` holds the population count for each active lane.
-- **constraints and limitations:** Integer element types only. The count is
-  over the source element width, not over the full vector register.
-
----
-
-### `pto.vcls`
-
-- **syntax:** `%result = pto.vcls %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **A5 types:** all integer types
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = count_leading_sign_bits(src[i]);
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects active lanes.
-- **outputs:** `%result` holds the leading-sign-bit count per active lane.
-- **constraints and limitations:** Integer element types only. This operation is
-  sign-aware, so signed interpretation matters.
-
----
-
-## Movement
-
-### `pto.vmov`
-
-- **syntax:** `%result = pto.vmov %input, %mask : !pto.vreg<NxT>, !pto.mask -> !pto.vreg<NxT>`
-- **semantics:** Vector register copy.
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i];
-```
-
-- **inputs:** `%input` is the source vector and `%mask` selects active lanes.
-- **outputs:** `%result` is a copy of the source vector.
-- **constraints and limitations:** Predicated `pto.vmov` behaves like a masked
-  copy, while the unpredicated form behaves like a full-register copy.
-
----
-
-## Typical Usage
-
-```mlir
-// Softmax numerator: exp(x - max)
-%sub = pto.vsub %x, %max_broadcast, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// Reciprocal for division
-%sum_rcp = pto.vrec %sum, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-
-// ReLU activation
-%activated = pto.vrelu %linear_out, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32>
-```
diff --git a/docs/mkdocs/src/docs/isa/vector/unary-vector-ops_zh.md b/docs/mkdocs/src/docs/isa/vector/unary-vector-ops_zh.md
deleted file mode 100644
index 5a96fa20..00000000
--- a/docs/mkdocs/src/docs/isa/vector/unary-vector-ops_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: Unary Vector Ops
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](unary-vector-ops.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/vec-scalar-ops.md b/docs/mkdocs/src/docs/isa/vector/vec-scalar-ops.md
deleted file mode 100644
index 961733d9..00000000
--- a/docs/mkdocs/src/docs/isa/vector/vec-scalar-ops.md
+++ /dev/null
@@ -1,265 +0,0 @@
-<!-- Generated from `docs/isa/vector/vec-scalar-ops.md` -->
-
-# Vector Families: Vector-Scalar Ops
-
-This page documents the `pto.v*` families that combine one vector register with one scalar operand. Scalar broadcasting, carry-chain rules, and active-lane behavior are architecture-visible and therefore documented here.
-
-> **Category:** Vector-scalar operations
-> **Pipeline:** PIPE_V (Vector Core)
-
-Operations that combine a vector with a scalar value, applying the scalar to every lane.
-
-## Common Operand Model
-
-- `%input` is the source vector register value.
-- `%scalar` is the scalar operand in SSA form.
-- `%mask` is the predicate operand.
-- `%result` is the destination vector register value.
-- For 32-bit scalar forms, the scalar source MUST satisfy the backend's legal
-  scalar-source constraints for this family.
-
----
-
-## Arithmetic
-
-### `pto.vadds`
-
-- **syntax:** `%result = pto.vadds %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] + scalar;
-```
-
-- **inputs:** `%input` is the source vector, `%scalar` is broadcast logically to
-  each active lane, and `%mask` selects active lanes.
-- **outputs:** `%result` is the lane-wise sum.
-- **constraints and limitations:** Inactive lanes follow the predication
-  behavior defined for this family. On the current surface, inactive lanes are
-  treated as zeroing lanes.
-
----
-
-### `pto.vsubs`
-
-- **syntax:** `%result = pto.vsubs %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] - scalar;
-```
-
-- **inputs:** `%input`, `%scalar`, and `%mask` as above.
-- **outputs:** `%result` is the lane-wise difference.
-- **constraints and limitations:** Integer or floating-point legality depends on
-  the selected type family in lowering.
-
----
-
-### `pto.vmuls`
-
-- **syntax:** `%result = pto.vmuls %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] * scalar;
-```
-
-- **inputs:** `%input`, `%scalar`, and `%mask` as above.
-- **outputs:** `%result` is the lane-wise product.
-- **constraints and limitations:** Supported element types are hardware-family
-  specific; the current PTO ISA vector surface documentation covers the common numeric cases.
-
----
-
-### `pto.vmaxs`
-
-- **syntax:** `%result = pto.vmaxs %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] > scalar) ? src[i] : scalar;
-```
-
-- **inputs:** `%input`, `%scalar`, and `%mask` as above.
-- **outputs:** `%result` is the lane-wise maximum.
-- **constraints and limitations:** Input and result types MUST match.
-
----
-
-### `pto.vmins`
-
-- **syntax:** `%result = pto.vmins %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] < scalar) ? src[i] : scalar;
-```
-
-- **inputs:** `%input`, `%scalar`, and `%mask` as above.
-- **outputs:** `%result` is the lane-wise minimum.
-- **constraints and limitations:** Input and result types MUST match.
-
----
-
-## Bitwise
-
-### `pto.vands`
-
-- **syntax:** `%result = pto.vands %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] & scalar;
-```
-
-- **inputs:** `%input`, `%scalar`, and `%mask` as above.
-- **outputs:** `%result` is the lane-wise bitwise AND.
-- **constraints and limitations:** Integer element types only.
-
----
-
-### `pto.vors`
-
-- **syntax:** `%result = pto.vors %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] | scalar;
-```
-
-- **inputs:** `%input`, `%scalar`, and `%mask` as above.
-- **outputs:** `%result` is the lane-wise bitwise OR.
-- **constraints and limitations:** Integer element types only.
-
----
-
-### `pto.vxors`
-
-- **syntax:** `%result = pto.vxors %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] ^ scalar;
-```
-
-- **inputs:** `%input`, `%scalar`, and `%mask` as above.
-- **outputs:** `%result` is the lane-wise bitwise XOR.
-- **constraints and limitations:** Integer element types only.
-
----
-
-## Shift
-
-### `pto.vshls`
-
-- **syntax:** `%result = pto.vshls %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] << scalar;
-```
-
-- **inputs:** `%input` is the value vector, `%scalar` is the uniform shift
-  amount, and `%mask` selects active lanes.
-- **outputs:** `%result` is the shifted vector.
-- **constraints and limitations:** Integer element types only. The shift amount
-  SHOULD stay within the source element width.
-
----
-
-### `pto.vshrs`
-
-- **syntax:** `%result = pto.vshrs %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = src[i] >> scalar;
-```
-
-- **inputs:** `%input` is the value vector, `%scalar` is the uniform shift
-  amount, and `%mask` selects active lanes.
-- **outputs:** `%result` is the shifted vector.
-- **constraints and limitations:** Integer element types only.
-
----
-
-### `pto.vlrelu`
-
-- **syntax:** `%result = pto.vlrelu %input, %scalar, %mask : !pto.vreg<NxT>, T, !pto.mask -> !pto.vreg<NxT>`
-
-```c
-for (int i = 0; i < N; i++)
-    dst[i] = (src[i] >= 0) ? src[i] : scalar * src[i];
-```
-
-- **inputs:** `%input` is the activation vector, `%scalar` is the leaky slope,
-  and `%mask` selects active lanes.
-- **outputs:** `%result` is the lane-wise leaky-ReLU result.
-- **constraints and limitations:** Only `f16` and `f32` forms are currently
-  documented for `pto.vlrelu`.
-
----
-
-## Carry Operations
-
-### `pto.vaddcs`
-
-- **syntax:** `%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask, !pto.mask -> !pto.vreg<NxT>, !pto.mask`
-- **semantics:** Add with carry-in and carry-out.
-
-```c
-for (int i = 0; i < N; i++) {
-    uint64_t r = (uint64_t)src0[i] + src1[i] + carry_in[i];
-    dst[i] = (T)r;
-    carry_out[i] = (r >> bitwidth);
-}
-```
-
-- **inputs:** `%lhs` and `%rhs` are the value vectors, `%carry_in` is the
-  incoming carry predicate, and `%mask` selects active lanes.
-- **outputs:** `%result` is the arithmetic result and `%carry` is the carry-out
-  predicate.
-- **constraints and limitations:** This is the scalar-extended carry-chain
-  family. Treat it as an unsigned integer operation unless the verifier states a
-  wider legal domain.
-
----
-
-### `pto.vsubcs`
-
-- **syntax:** `%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask, !pto.mask -> !pto.vreg<NxT>, !pto.mask`
-- **semantics:** Subtract with borrow-in and borrow-out.
-
-```c
-for (int i = 0; i < N; i++) {
-    dst[i] = src0[i] - src1[i] - borrow_in[i];
-    borrow_out[i] = (src0[i] < src1[i] + borrow_in[i]);
-}
-```
-
-- **inputs:** `%lhs` and `%rhs` are the value vectors, `%borrow_in` is the
-  incoming borrow predicate, and `%mask` selects active lanes.
-- **outputs:** `%result` is the arithmetic result and `%borrow` is the
-  borrow-out predicate.
-- **constraints and limitations:** This is the scalar-extended borrow-chain
-  family and SHOULD be treated as an unsigned integer operation.
-
----
-
-## Typical Usage
-
-```mlir
-// Add bias to all elements
-%biased = pto.vadds %activation, %bias_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Scale by constant
-%scaled = pto.vmuls %input, %scale, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Clamp to [0, 255] for uint8 quantization
-%clamped_low = pto.vmaxs %input, %c0, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-%clamped = pto.vmins %clamped_low, %c255, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32>
-
-// Shift right by fixed amount
-%shifted = pto.vshrs %data, %c4, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32>
-```
diff --git a/docs/mkdocs/src/docs/isa/vector/vec-scalar-ops_zh.md b/docs/mkdocs/src/docs/isa/vector/vec-scalar-ops_zh.md
deleted file mode 100644
index 4053f69e..00000000
--- a/docs/mkdocs/src/docs/isa/vector/vec-scalar-ops_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: Vector-Scalar Ops
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vec-scalar-ops.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/docs/mkdocs/src/docs/isa/vector/vector-families.md b/docs/mkdocs/src/docs/isa/vector/vector-families.md
deleted file mode 100644
index 37466634..00000000
--- a/docs/mkdocs/src/docs/isa/vector/vector-families.md
+++ /dev/null
@@ -1,50 +0,0 @@
-<!-- Generated from `docs/isa/vector/vector-families.md` -->
-
-# Vector Families
-
-Vector-family documentation explains how `pto.v*` groups behave. Each family describes the shared mechanism, operand model, constraints, and target-profile narrowing before the reader drops into the standalone per-op pages under `vector/ops/`.
-
-## Overview
-
-| Family | Prefix | Description |
-|--------|--------|-------------|
-| [Vector Load/Store](./vector-load-store.md) | `pto.vlds`, `pto.vsts`, `pto.vgather2` | UB↔vector register transfer, gather/scatter |
-| [Predicate and Materialization](./predicate-and-materialization.md) | `pto.vbr`, `pto.vdup` | Vector broadcast and duplication |
-| [Unary Vector Ops](./unary-vector-ops.md) | `pto.vabs`, `pto.vneg`, `pto.vexp`, `pto.vsqrt` | Single-input elementwise operations |
-| [Binary Vector Ops](./binary-vector-ops.md) | `pto.vadd`, `pto.vsub`, `pto.vmul`, `pto.vcmp` | Two-input elementwise operations |
-| [Vec-Scalar Ops](./vec-scalar-ops.md) | `pto.vadds`, `pto.vmuls`, `pto.vshls` | Vector combined with scalar operand |
-| [Conversion Ops](./conversion-ops.md) | `pto.vci`, `pto.vcvt`, `pto.vtrc` | Type conversion between numeric types |
-| [Reduction Ops](./reduction-ops.md) | `pto.vcadd`, `pto.vcmax`, `pto.vcgadd` | Cross-lane reductions |
-| [Compare and Select](./compare-select.md) | `pto.vcmp`, `pto.vsel`, `pto.vselr` | Comparison and conditional selection |
-| [Data Rearrangement](./data-rearrangement.md) | `pto.vintlv`, `pto.vslide`, `pto.vpack` | Lane permutation and packing |
-| [SFU and DSA Ops](./sfu-and-dsa-ops.md) | `pto.vprelu`, `pto.vaxpy`, `pto.vtranspose` | Special function units and DSA ops |
-
-## Shared Constraints
-
-All vector families must state:
-
-1. **Vector length** — The lane count `N` for vector registers in this family
-2. **Predication model** — How inactive lanes are treated (zeroed, preserved, or undefined)
-3. **Type support** — Which element types are legal (varies by A2/A3 vs A5)
-4. **Target-profile narrowing** — Where profiles differ from each other and from the portable ISA contract
-
-## Common Operand Model
-
-All vector operations share a common operand model:
-
-- **`%input` / `%src0` / `%src1`** — Source vector register operands (`!pto.vreg<NxT>`)
-- **`%mask`** — Predicate operand for masking inactive lanes (`!pto.mask`)
-- **`%result` / `%dst`** — Destination vector register operand
-- **Scalar operands** — Immediate values, rounding modes, or scalar register operands
-
-Vector length `N` is a power of 2. The predicate mask width must match `N`.
-
-## Navigation
-
-See the [Vector ISA reference](./README.md) for the full per-op reference under `vector/ops/`.
-
-## See Also
-
-- [Vector instruction surface](../instruction-surfaces/vector-instructions.md) — High-level surface description
-- [Instruction families](./README.md) — All family groups
-- [Format of instruction descriptions](../reference/format-of-instruction-descriptions.md) — Per-op page standard
diff --git a/docs/mkdocs/src/docs/isa/vector/vector-load-store.md b/docs/mkdocs/src/docs/isa/vector/vector-load-store.md
deleted file mode 100644
index 92991fcd..00000000
--- a/docs/mkdocs/src/docs/isa/vector/vector-load-store.md
+++ /dev/null
@@ -1,525 +0,0 @@
-<!-- Generated from `docs/isa/vector/vector-load-store.md` -->
-
-# Vector Families: Vector Load/Store
-
-This page documents UB-to-vector-register data movement inside PTO ISA. The detailed forms below describe how `pto.v*` kernels move payloads between vector-visible UB storage and vector registers without crossing back into the tile surface.
-
-> **Category:** UB ↔ Vector Register data movement
-> **Pipeline:** PIPE_V (Vector Core)
-
-Vector loads move data from Unified Buffer (UB) to vector registers (`vreg`). Vector stores move data from `vreg` back to UB. All vector compute operates only on `vreg` — UB is the staging area between DMA and compute.
-
-## Common Operand Model
-
-- `%source` / `%dest` is the base address operand in SSA form. The base pointer
-  MUST address the Vector tile buffer / UB space.
-- `%offset` is the displacement operand in SSA form. The exact encoding is
-  instruction-specific, but the effective address and any post-update behavior
-  MUST match the selected instruction form.
-- `%mask` is the predicate operand for predicated memory families. For memory
-  families,
-  inactive lanes or inactive blocks MUST NOT issue memory requests unless the
-  instruction explicitly documents a different behavior.
-- `%result` is the destination vector register value in SSA form.
-- `!pto.align` is the SSA carrier for alignment-buffer state used by unaligned
-  load/store families. The PTO ISA vector surface representation makes that state explicit rather than implicit.
-
----
-
-## Contiguous Loads
-
-### `pto.vlds`
-
-- **syntax:** `%result = pto.vlds %source[%offset] {dist = "DIST"} : !pto.ptr<T, ub> -> !pto.vreg<NxT>`
-- **semantics:** Vector load with distribution mode.
-- **inputs:**
-  `%source` is the UB base address, `%offset` is the load displacement, and
-  `DIST` selects the distribution mode.
-- **outputs:**
-  `%result` is the loaded vector register value.
-- **constraints and limitations:**
-  The effective address MUST satisfy the alignment rule of the selected
-  distribution mode. `NORM` reads one full vector footprint. Broadcast,
-  upsample, downsample, unpack, split-channel, and deinterleave modes change
-  how memory bytes are mapped into destination lanes, but they do not change the
-  fact that the source is UB memory.
-
-**Distribution modes:**
-
-| Mode | Description | C Semantics |
-|------|-------------|-------------|
-| `NORM` | Contiguous 256B load | `dst[i] = UB[base + i * sizeof(T)]` |
-| `BRC_B8/B16/B32` | Broadcast single element | `dst[i] = UB[base]` for all i |
-| `US_B8/B16` | Upsample (duplicate each element) | `dst[2*i] = dst[2*i+1] = UB[base + i]` |
-| `DS_B8/B16` | Downsample (every 2nd element) | `dst[i] = UB[base + 2*i]` |
-| `UNPK_B8/B16/B32` | Unpack (zero-extend to wider type) | `dst_i32[i] = (uint32_t)UB_i16[base + 2*i]` |
-| `SPLT4CHN_B8` | Split 4-channel (RGBA → R plane) | Extract every 4th byte |
-| `SPLT2CHN_B8/B16` | Split 2-channel | Extract every 2nd element |
-| `DINTLV_B32` | Deinterleave 32-bit | Even elements only |
-| `BLK` | Block load | Blocked access pattern |
-
-**Example — Contiguous load:**
-```mlir
-%v = pto.vlds %ub[%offset] {dist = "NORM"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-```
-
-**Example — Broadcast scalar to all lanes:**
-```mlir
-%v = pto.vlds %ub[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
-```
-
----
-
-### `pto.vldas`
-
-- **syntax:** `%result = pto.vldas %source : !pto.ptr<T, ub> -> !pto.align`
-- **semantics:** Prime alignment buffer for subsequent unaligned load.
-- **inputs:**
-  `%source` is the UB address whose surrounding aligned block seeds the load
-  alignment state.
-- **outputs:**
-  `%result` is the initialized load-alignment state.
-- **constraints and limitations:**
-  This op is the required leading operation for a `pto.vldus` stream using the
-  same alignment state. The source address itself need not be 32-byte aligned;
-  hardware truncates it to the aligned block boundary for the priming load.
-
----
-
-### `pto.vldus`
-
-- **syntax:** `%result, %align_out, %base_out = pto.vldus %source, %align : !pto.ptr<T, ub>, !pto.align -> !pto.vreg<NxT>, !pto.align, !pto.ptr<T, ub>`
-- **semantics:** Unaligned load using primed align state.
-- **inputs:**
-  `%source` is the current UB address and `%align` is the incoming load
-  alignment state primed by `pto.vldas` or a prior `pto.vldus`.
-- **outputs:**
-  `%result` is the assembled vector value, `%align_out` is the updated alignment
-  state, and `%base_out` is the post-update base pointer state exposed in SSA
-  form.
-- **constraints and limitations:**
-  A matching `pto.vldas` MUST appear before the first dependent `pto.vldus`
-  stream in the same vector loop. Both the alignment state and the base address
-  advance across the stream, and the PTO ISA vector surface representation exposes those updates as SSA results.
-
-**Unaligned load pattern:**
-```mlir
-%align = pto.vldas %ub : !pto.ptr<f32, ub> -> !pto.align
-%vec, %align2, %ub2 = pto.vldus %ub, %align : !pto.ptr<f32, ub>, !pto.align -> !pto.vreg<64xf32>, !pto.align, !pto.ptr<f32, ub>
-```
-
----
-
-## Dual Loads (Deinterleave)
-
-### `pto.vldx2`
-
-- **syntax:** `%low, %high = pto.vldx2 %source[%offset], "DIST" : !pto.ptr<T, ub>, index -> !pto.vreg<NxT>, !pto.vreg<NxT>`
-- **semantics:** Dual load with deinterleave (AoS → SoA conversion).
-- **inputs:**
-  `%source` is the UB base pointer, `%offset` is the displacement, and `DIST`
-  selects a dual-load/deinterleave layout.
-- **outputs:**
-  `%low` and `%high` are the two destination vectors.
-- **constraints and limitations:**
-  This family is only legal for interleave/deinterleave style distributions.
-  The two outputs form an ordered pair, and that pairing MUST be preserved.
-
-**Distribution modes:** `DINTLV_B8`, `DINTLV_B16`, `DINTLV_B32`, `BDINTLV`
-
-```c
-// DINTLV_B32: deinterleave 32-bit elements
-for (int i = 0; i < 64; i++) {
-    low[i]  = UB[base + 8*i];       // even elements
-    high[i] = UB[base + 8*i + 4];   // odd elements
-}
-```
-
-**Example — Load interleaved XY pairs into separate X/Y vectors:**
-```mlir
-%x, %y = pto.vldx2 %ub[%offset], "DINTLV_B32" : !pto.ptr<f32, ub>, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32>
-```
-
----
-
-## Strided Loads
-
-### `pto.vsld`
-
-- **syntax:** `%result = pto.vsld %source[%offset], "STRIDE" : !pto.ptr<T, ub> -> !pto.vreg<NxT>`
-- **semantics:** Strided load with fixed stride pattern.
-- **inputs:**
-  `%source` is the UB base pointer and `%offset` is the displacement encoded
-  with the selected fixed stride mode.
-- **outputs:**
-  `%result` is the loaded vector.
-- **constraints and limitations:**
-  This is a deprecated compatibility family. The selected stride token
-  determines which sub-elements are read from each source block.
-
-**Stride modes:** `STRIDE_S3_B16`, `STRIDE_S4_B64`, `STRIDE_S8_B32`, `STRIDE_S2_B64`
-
----
-
-### `pto.vsldb`
-
-- **syntax:** `%result = pto.vsldb %source, %offset, %mask : !pto.ptr<T, ub>, i32, !pto.mask -> !pto.vreg<NxT>`
-- **semantics:** Block-strided load for 2D tile access.
-- **inputs:**
-  `%source` is the UB base pointer, `%offset` is the packed stride/control word,
-  and `%mask` controls which blocks participate.
-- **outputs:**
-  `%result` is the loaded vector.
-- **constraints and limitations:**
-  `%offset` is not a plain byte displacement; it encodes the block stride and
-  repeat pattern. If a block is masked off, the corresponding destination block
-  is zeroed and MUST NOT raise an address overflow exception for that block.
-
----
-
-## Gather (Indexed) Loads
-
-### `pto.vgather2`
-
-- **syntax:** `%result = pto.vgather2 %source, %offsets, %active_lanes : !pto.ptr<T, ub>, !pto.vreg<NxI>, index -> !pto.vreg<NxT>`
-- **semantics:** Indexed gather from UB.
-- **inputs:**
-  `%source` is the UB base pointer, `%offsets` provides per-lane element
-  offsets, and `%active_lanes` bounds how many lanes participate.
-- **outputs:**
-  `%result` is the gathered vector.
-- **constraints and limitations:**
-  Only the first `%active_lanes` indices participate. The index element width
-  and interpretation MUST match the selected gather form, and each effective
-  address must satisfy that form's alignment rules.
-
-```c
-for (int i = 0; i < active_lanes; i++)
-    dst[i] = UB[base + offsets[i] * sizeof(T)];
-```
-
----
-
-### `pto.vgatherb`
-
-- **syntax:** `%result = pto.vgatherb %source, %offsets, %active_lanes : !pto.ptr<T, ub>, !pto.vreg<NxI>, index -> !pto.vreg<NxT>`
-- **semantics:** Byte-granularity indexed gather from UB.
-- **inputs:**
-  `%source` is the UB base pointer, `%offsets` contains per-block byte offsets,
-  and `%active_lanes` bounds the number of active gathered blocks.
-- **outputs:**
-  `%result` is the gathered vector.
-- **constraints and limitations:**
-  This is a block gather, not a byte-per-lane gather. `%source` MUST be 32-byte
-  aligned, each participating offset MUST describe a 32-byte-aligned block, and
-  inactive blocks are zero-filled.
-
-```c
-for (int i = 0; i < active_lanes; i++)
-    dst[i] = UB[base + offsets[i]];  // byte-addressed
-```
-
----
-
-### `pto.vgather2_bc`
-
-- **syntax:** `%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr<T, ub>, !pto.vreg<NxI>, !pto.mask -> !pto.vreg<NxT>`
-- **semantics:** Gather with broadcast, conditioned by mask.
-- **inputs:**
-  `%source` is the UB base pointer, `%offsets` contains gather indices, and
-  `%mask` gates which lanes participate.
-- **outputs:**
-  `%result` is the gathered vector.
-- **constraints and limitations:**
-  This is a backward-compatible family. Masked-off lanes do not participate in
-  address coalescing and do not trigger address overflow exceptions; their
-  destination lanes are zero-filled.
-
----
-
-## Contiguous Stores
-
-### `pto.vsts`
-
-- **syntax:** `pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg<NxT>, !pto.ptr<T, ub>, !pto.mask`
-- **semantics:** Vector store with distribution mode.
-- **inputs:**
-  `%value` is the source vector, `%dest` is the UB base pointer, `%offset` is
-  the displacement, `%mask` selects the active lanes or sub-elements, and
-  `DIST` selects the store distribution.
-- **outputs:**
-  This op has no SSA result; it writes to UB memory.
-- **constraints and limitations:**
-  The effective destination address MUST satisfy the alignment rule of the
-  selected store mode. Narrowing/packing modes may only preserve a subset of the
-  source bits. Merge-channel modes reinterpret the source vector as channel
-  planes and interleave them on store.
-
-**Distribution modes:**
-
-| Mode | Description | C Semantics |
-|------|-------------|-------------|
-| `NORM_B8/B16/B32` | Contiguous store | `UB[base + i] = src[i]` |
-| `PK_B16/B32` | Pack/narrowing store | `UB_i16[base + 2*i] = truncate_16(src_i32[i])` |
-| `MRG4CHN_B8` | Merge 4 channels (R,G,B,A → RGBA) | Interleave 4 planes |
-| `MRG2CHN_B8/B16` | Merge 2 channels | Interleave 2 planes |
-
-**Example — Contiguous store:**
-```mlir
-pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask
-```
-
----
-
-## Dual Stores (Interleave)
-
-### `pto.vstx2`
-
-- **syntax:** `pto.vstx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.ptr<T, ub>, index, !pto.mask`
-- **semantics:** Dual interleaved store (SoA → AoS conversion).
-- **inputs:**
-  `%low` and `%high` are the two source vectors, `%dest` is the UB base pointer,
-  `%offset` is the displacement, `DIST` selects the interleave layout, and
-  `%mask` gates the participating elements.
-- **outputs:**
-  This op has no SSA result; it writes an interleaved stream to UB.
-- **constraints and limitations:**
-  This family is only legal for interleave distributions. The two source
-  vectors form an ordered pair, and the interleave semantics of that pair MUST
-  be preserved.
-
-**Distribution modes:** `INTLV_B8`, `INTLV_B16`, `INTLV_B32`
-
-```c
-// INTLV_B32:
-for (int i = 0; i < 64; i++) {
-    UB[base + 8*i]     = low[i];
-    UB[base + 8*i + 4] = high[i];
-}
-```
-
----
-
-## Strided Stores
-
-### `pto.vsst`
-
-- **syntax:** `pto.vsst %value, %dest[%offset], "STRIDE" : !pto.vreg<NxT>, !pto.ptr<T, ub>`
-- **semantics:** Strided store with fixed stride pattern.
-- **inputs:**
-  `%value` is the source vector, `%dest` is the UB base pointer, and `%offset`
-  / `STRIDE` select the fixed strided layout.
-- **outputs:**
-  This op writes UB memory and returns no SSA value.
-- **constraints and limitations:**
-  This is a deprecated compatibility family. The stride token, not the vector
-  lane number alone, determines which destination elements are written.
-
----
-
-### `pto.vsstb`
-
-- **syntax:** `pto.vsstb %value, %dest, %offset, %mask : !pto.vreg<NxT>, !pto.ptr<T, ub>, i32, !pto.mask`
-- **semantics:** Block-strided store for 2D tile access.
-- **inputs:**
-  `%value` is the source vector, `%dest` is the UB base pointer, `%offset` is
-  the packed stride/control word, and `%mask` controls block participation.
-- **outputs:**
-  This op writes UB memory and returns no SSA value.
-- **constraints and limitations:**
-  `%offset` is a control word, not a plain byte displacement. This is a
-  deprecated compatibility family kept for surface coverage.
-
----
-
-## Scatter (Indexed) Stores
-
-### `pto.vscatter`
-
-- **syntax:** `pto.vscatter %value, %dest, %offsets, %active_lanes : !pto.vreg<NxT>, !pto.ptr<T, ub>, !pto.vreg<NxI>, index`
-- **semantics:** Indexed scatter to UB.
-- **inputs:**
-  `%value` is the source vector, `%dest` is the UB base pointer, `%offsets`
-  provides per-lane or per-block indices, and `%active_lanes` bounds the active
-  requests.
-- **outputs:**
-  This op writes UB memory and returns no SSA value.
-- **constraints and limitations:**
-  Only `b8`, `b16`, and `b32` element sizes are supported. The index vector
-  must use a supported integer element type and layout for this family.
-  Each computed address MUST be element-aligned. If two or more indices alias,
-  only one write is guaranteed and the winning lane is implementation-defined.
-
-```c
-for (int i = 0; i < active_lanes; i++)
-    UB[base + offsets[i] * sizeof(T)] = src[i];
-```
-
----
-
-## Alignment State Stores
-
-### `pto.vsta`
-
-- **syntax:** `pto.vsta %value, %dest[%offset] : !pto.align, !pto.ptr<T, ub>, index`
-- **semantics:** Flush alignment state to memory.
-- **inputs:**
-  `%value` is the pending store-alignment state, `%dest` is the UB base pointer,
-  and `%offset` is the flush displacement.
-- **outputs:**
-  This op writes buffered tail bytes to UB and returns no SSA value.
-- **constraints and limitations:**
-  The flush address MUST match the post-updated address expected by the
-  preceding unaligned-store stream. After the flush, the corresponding store
-  alignment state is consumed.
-
----
-
-### `pto.vstas`
-- **syntax:** `pto.vstas %value, %dest, %offset : !pto.align, !pto.ptr<T, ub>, i32`
-- **semantics:** Scalar-register-offset form of alignment-state flush.
-- **inputs:**
-  `%value` is the pending store-alignment state, `%dest` is the UB base
-  pointer, and `%offset` is the scalar-register style displacement.
-- **outputs:**
-  This op writes buffered tail bytes to UB and returns no SSA value.
-- **constraints and limitations:**
-  This family uses the same buffered-tail semantics as `pto.vsta` but keeps the
-  scalar-offset form explicit.
-
----
-
-### `pto.vstar`
-- **syntax:** `pto.vstar %value, %dest : !pto.align, !pto.ptr<T, ub>`
-- **semantics:** Flush alignment state using the register-update form.
-- **inputs:**
-  `%value` is the pending store-alignment state and `%dest` is the UB base
-  pointer.
-- **outputs:**
-  This op writes buffered tail bytes to UB and returns no SSA value.
-- **constraints and limitations:**
-  The implicit update state consumed by this flush MUST correspond to the same
-  store stream that produced `%value`.
-
----
-
-### `pto.vstu`
-- **syntax:** `%align_out, %base_out = pto.vstu %align_in, %base_in, %value, %dest, %mode : !pto.align, !pto.ptr<T, ub>, !pto.vreg<NxT>, !pto.ptr<T, ub>, index -> !pto.align, !pto.ptr<T, ub>`
-- **semantics:** Unaligned store with explicit threaded alignment/base state.
-- **inputs:**
-  `%align_in` is the incoming store-alignment state, `%base_in` is the current
-  stream base, `%value` is the vector to store, `%dest` is the UB base pointer,
-  and `%mode` selects the post-update behavior.
-- **outputs:**
-  `%align_out` is the updated buffered-tail state and `%base_out` is the
-  post-update base pointer state.
-- **constraints and limitations:**
-  This op models a stateful unaligned-store sequence in SSA form. A final
-  `pto.vsta` / `pto.vstas` / `pto.vstar` is still required to flush the trailing
-  buffered bytes.
-
----
-
-### `pto.vstus`
-- **syntax:** `%align_out, %base_out = pto.vstus %align_in, %base_in, %value, %dest, %offset : !pto.align, !pto.ptr<T, ub>, !pto.vreg<NxT>, !pto.ptr<T, ub>, i32 -> !pto.align, !pto.ptr<T, ub>`
-- **semantics:** Scalar-offset unaligned store with threaded state.
-- **inputs:**
-  Same roles as `pto.vstu`, but `%offset` is provided explicitly as the scalar
-  displacement.
-- **outputs:**
-  Updated alignment state and base state.
-- **constraints and limitations:**
-  The same final flush requirement and state-threading constraints as
-  `pto.vstu` apply.
-
----
-
-### `pto.vstur`
-- **syntax:** `%align_out = pto.vstur %align_in, %value, %dest : !pto.align, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align`
-- **semantics:** Register-update unaligned store form.
-- **inputs:**
-  `%align_in` is the incoming store-alignment state, `%value` is the vector to
-  store, and `%dest` is the UB base pointer.
-- **outputs:**
-  `%align_out` is the updated buffered-tail state.
-- **constraints and limitations:**
-  This op updates only the residual alignment state. A matching flush op is
-  still required to emit the trailing bytes.
-
-- **syntax:** `pto.vstas %value, %dest, %offset : !pto.align, !pto.ptr<T, ub>, i32`
-- **semantics:** Flush alignment state with scalar offset.
-
----
-
-### `pto.vstar`
-
-- **syntax:** `pto.vstar %value, %dest : !pto.align, !pto.ptr<T, ub>`
-- **semantics:** Flush remaining alignment state.
-- **inputs:**
-  `%value` is the pending alignment/buffer state that still needs to be emitted,
-  and `%dest` is the UB destination base pointer.
-- **outputs:**
-  No SSA result. The effect is a memory-side flush that writes the remaining
-  buffered bytes to memory.
-- **constraints and limitations:**
-  This op terminates an unaligned-store sequence. It MUST be paired with a
-  compatible prior state-producing store sequence so that the pending tail state
-  is well-defined.
-
----
-
-## Stateful Store Ops
-
-These ops make reference-updated state explicit as SSA results.
-
-### `pto.vstu`
-
-- **syntax:** `%align_out, %offset_out = pto.vstu %align_in, %offset_in, %value, %base, "MODE" : !pto.align, index, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align, index`
-- **semantics:** Unaligned store with align + offset state update.
-- **inputs:**
-  `%align_in` is the incoming store-alignment state, `%offset_in` is the current
-  logical byte/element displacement, `%value` is the vector being stored, and
-  `%base` is the UB base pointer.
-- **outputs:**
-  `%align_out` is the updated alignment/tail state and `%offset_out` is the
-  next offset after applying the selected post-update rule.
-- **constraints and limitations:**
-  The alignment state MUST be threaded in program order. A terminating flush
-  form such as `pto.vstar`/`pto.vstas` is still required to commit the buffered
-  tail bytes.
-
-**Mode tokens:** `POST_UPDATE`, `NO_POST_UPDATE`
-
----
-
-### `pto.vstus`
-
-- **syntax:** `%align_out, %base_out = pto.vstus %align_in, %offset, %value, %base, "MODE" : !pto.align, i32, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align, !pto.ptr<T, ub>`
-- **semantics:** Unaligned store with scalar offset and state update.
-- **inputs:**
-  `%align_in` is the incoming store-alignment state, `%offset` is the scalar
-  displacement, `%value` is the vector being stored, and `%base` is the UB base
-  pointer.
-- **outputs:**
-  `%align_out` is the updated buffered-tail state and `%base_out` is the next
-  base pointer when the lowering chooses a post-update form.
-- **constraints and limitations:**
-  This is the scalar-offset stateful form of the unaligned store family. The
-  scalar offset width and update mode MUST match the selected form, and a later
-  flush op is still required.
-
----
-
-### `pto.vstur`
-
-- **syntax:** `%align_out = pto.vstur %align_in, %value, %base, "MODE" : !pto.align, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align`
-- **semantics:** Unaligned store with residual flush and state update.
-- **inputs:**
-  `%align_in` is the incoming store-alignment state, `%value` is the vector to
-  store, and `%base` is the UB base pointer.
-- **outputs:**
-  `%align_out` is the updated residual state after the current partial store.
-- **constraints and limitations:**
-  This form exposes only the evolving state; it does not by itself guarantee
-  that all buffered bytes have reached memory. A compatible final flush is still
-  required unless the surrounding sequence is known to be complete.
diff --git a/docs/mkdocs/src/docs/isa/vector/vector-load-store_zh.md b/docs/mkdocs/src/docs/isa/vector/vector-load-store_zh.md
deleted file mode 100644
index 55f83d41..00000000
--- a/docs/mkdocs/src/docs/isa/vector/vector-load-store_zh.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Vector Families: Vector Load/Store
-
-本页为自动生成的中文入口页，用于保证中文导航保持在中文路径下。
-
-## 当前状态
-
-- [对应英文页面](vector-load-store.md)
-- [中文手册入口](../../PTO-Virtual-ISA-Manual_zh.md)
-- [中文 ISA 指令参考入口](../README_zh.md)
-
-## 说明
-
-当前 PTO ISA 的新英文手册结构已经展开，但对应的中文正文尚未完全按新结构补齐。在中文导航中点击本页时，你仍然停留在中文路径下；若需要完整细节，请使用上面的英文页面或已有中文参考页。
diff --git a/include/pto/README.md b/include/pto/README.md
index 7c509eaa..3c85bd29 100644
--- a/include/pto/README.md
+++ b/include/pto/README.md
@@ -1,33 +1,85 @@
 # include/pto/
 
-This is the primary public header entry for PTO Tile Lib. It contains:
+This is the primary public header entry for PTO Tile Library. It contains:
 
-- The Tile type system and shared utilities
+- Tile type system and shared utilities
 - PTO instruction API declarations (Auto/Manual forms)
-- CPU simulation/stub support
+- CPU simulation / stub support
 - NPU instruction implementations (split by SoC generation)
 
 ## Recommended Include
 
-- `include/pto/pto-inst.hpp`: Unified entry header (recommended for upper-layer code)
-
-In CPU simulation scenarios, this header can include CPU stubs (for example, when `__CPU_SIM` is defined it pulls in `pto/common/cpu_stub.hpp`).
+| Scenario | Recommended Header |
+|----------|-------------------|
+| Upper-layer code (general) | `include/pto/pto-inst.hpp` — unified entry |
+| CPU simulation stub injection | `__CPU_SIM` defined → auto-imports `pto/common/cpu_stub.hpp` |
 
 ## Layout
 
-- `common/`: Platform-independent Tile and instruction infrastructure
-  - `pto_tile.hpp`: Core Tile types and layout
-  - `pto_instr.hpp`, `pto_instr_impl.hpp`: Instruction declarations and shared implementations
-  - `memory.hpp`, `constants.hpp`, `utils.hpp`, `type.hpp`: Common utilities and constants
-- `cpu/`: CPU-side simulation/debug support (if enabled)
-- `npu/`: NPU-side implementations, split by SoC version
-  - `npu/a2a3/`: Ascend A2/A3 series
-  - `npu/a5/`: Ascend A5 series
-- `comm/`: Communication instruction library
-  - `pto_comm_inst.hpp`: Unified entry header for communication instructions
-  - `comm_types.hpp`: Core type definitions for communication instructions
-  - `pto_comm_instr_impl.hpp`: Platform dispatch layer for communication instructions
+```
+include/pto/
+├── pto-inst.hpp              # Unified entry header (recommended)
+├── pto.hpp                    # Core header (includes pto-inst.hpp)
+│
+├── common/                    # Platform-independent infrastructure
+│   ├── pto_tile.hpp          # Core Tile types and layout
+│   ├── pto_instr.hpp         # PTO instruction declarations
+│   ├── pto_instr_impl.hpp    # PTO instruction shared implementations
+│   ├── memory.hpp             # Memory operations
+│   ├── constants.hpp          # Constant definitions
+│   ├── utils.hpp              # General utilities
+│   ├── type.hpp               # Type definitions
+│   └── cpu_stub.hpp           # CPU simulation stub
+│
+├── cpu/                       # CPU-side simulation (if enabled)
+│
+└── npu/                      # NPU-side implementations (split by SoC)
+    ├── a2a3/                 # Ascend A2/A3 series
+    │   ├── TAdd.hpp          # TADD implementation
+    │   ├── TMatmul.hpp       # TMATMUL implementation
+    │   ├── TLoad.hpp         # TLOAD implementation
+    │   └── ...               # Other instruction implementations
+    └── a5/                   # Ascend A5 series
+        ├── TAdd.hpp
+        ├── TMatmul.hpp
+        ├── TLoad.hpp
+        └── ...
+```
+
+## Common TileType ↔ Hardware Buffer Mapping
+
+| TileType | Hardware Buffer | Capacity | Typical Use |
+|----------|----------------|----------|-------------|
+| `Vec` | Unified Buffer (UB) | 256 KB | General elementwise operations |
+| `Mat` | L1 | 512 KB | Matrix multiply operands |
+| `Left` | L0A | 64 KB | Matmul A operand |
+| `Right` | L0B | 64 KB | Matmul B operand |
+| `Acc` | L0C | 256 KB | Matmul accumulator |
+
+## Typical Usage
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+// Define GlobalTensor
+using DynShape = Shape<1, 1, 1, kGRows_, kGCols_>;
+using DynStride = Stride<1, 1, 1, kGCols_, 1>;
+using GlobalData = GlobalTensor<T, DynShape, DynStride>;
+GlobalData srcGlobal(src);
+
+// Define Tile
+using TileData = Tile<TileType::Vec, T, kTRows_, kTCols_, BLayout::RowMajor, -1, -1>;
+TileData srcTile(kTRows_, kTCols_);
+
+// PTO instruction calls
+TLOAD(srcTile, srcGlobal);
+TADD(dstTile, src0Tile, src1Tile);
+TSTORE(dstGlobal, dstTile);
+```
 
 ## Related Docs
 
-- Instruction reference: `docs/isa/`
+- [ISA Instruction Reference](../docs/isa/) — Full PTO ISA instruction semantics
+- [Tile Programming Model](../../docs/coding/Tile.md) — Tile type system in depth
+- [include/README_zh.md](./README_zh.md) — 中文版入口
diff --git a/include/pto/README_zh.md b/include/pto/README_zh.md
index 0eebf636..587da742 100644
--- a/include/pto/README_zh.md
+++ b/include/pto/README_zh.md
@@ -1,34 +1,85 @@
 # include/pto/
 
-该目录是 PTO Tile Lib 的主要公共头文件入口，包含：
+本目录是 PTO Tile Library 的**主要公共头文件入口**，包含：
 
 - Tile 类型系统与共享工具
-- PTO 指令 API 声明（Auto/Manual 两种形式）
-- CPU 仿真/Stub 支持
+- PTO 指令 API 声明（Auto / Manual 两种形式）
+- CPU 仿真 / stub 支持
 - NPU 指令实现（按 SoC 代际划分）
 
-## 推荐的 include
+## 推荐的 include 方式
 
-- `include/pto/pto-inst.hpp`：统一入口头（建议上层代码直接 include 该文件）
-
-在 CPU 仿真场景下，该头文件会包含 CPU stub（例如定义 `__CPU_SIM` 时会引入 `pto/common/cpu_stub.hpp`）。
+| 场景 | 推荐头文件 |
+|------|-----------|
+| 上层代码（通用） | `include/pto/pto-inst.hpp` — 统一入口 |
+| CPU 仿真时自动注入 stub | `__CPU_SIM` 定义时自动引入 `pto/common/cpu_stub.hpp` |
 
 ## 目录结构
 
-- `common/`：平台无关的 Tile 与指令基础设施
-  - `pto_tile.hpp`：核心 Tile 类型与布局
-  - `pto_instr.hpp`、`pto_instr_impl.hpp`：指令声明与共享实现
-  - `memory.hpp`、`constants.hpp`、`utils.hpp`、`type.hpp`：通用工具与常量
-- `cpu/`：CPU 侧仿真/调试支持（如启用）
-- `npu/`：NPU 侧实现（按 SoC 版本拆分）
-  - `npu/a2a3/`：Ascend A2/A3 系列
-  - `npu/a5/`：Ascend A5 系列
-- `comm/`：通信指令库
-  - `pto_comm_inst.hpp`：通信指令统一入口头文件
-  - `comm_types.hpp`：通信指令的核心类型定义
-  - `pto_comm_instr_impl.hpp`：通信指令的平台分发层
+```
+include/pto/
+├── pto-inst.hpp              # 统一入口头文件（推荐）
+├── pto.hpp                    # 核心头文件（含 pto-inst.hpp）
+│
+├── common/                    # 平台无关的基础设施
+│   ├── pto_tile.hpp          # 核心 Tile 类型与布局
+│   ├── pto_instr.hpp         # PTO 指令声明
+│   ├── pto_instr_impl.hpp    # PTO 指令共享实现
+│   ├── memory.hpp             # 内存操作相关
+│   ├── constants.hpp          # 常量定义
+│   ├── utils.hpp              # 通用工具
+│   ├── type.hpp               # 类型定义
+│   └── cpu_stub.hpp           # CPU 仿真 stub
+│
+├── cpu/                       # CPU 侧仿真实现（如启用）
+│
+└── npu/                      # NPU 侧实现（按 SoC 版本拆分）
+    ├── a2a3/                 # Ascend A2/A3 系列
+    │   ├── TAdd.hpp          # TADD 实现
+    │   ├── TMatmul.hpp       # TMATMUL 实现
+    │   ├── TLoad.hpp         # TLOAD 实现
+    │   └── ...               # 其他指令实现
+    └── a5/                   # Ascend A5 系列
+        ├── TAdd.hpp
+        ├── TMatmul.hpp
+        ├── TLoad.hpp
+        └── ...
+```
+
+## 常用 TileType 与硬件 Buffer 对应关系
+
+| TileType | 硬件 Buffer | 容量 | 典型用途 |
+|----------|------------|------|----------|
+| `Vec` | Unified Buffer（UB） | 256 KB | 通用逐元素运算 |
+| `Mat` | L1 | 512 KB | 矩阵乘法操作数 |
+| `Left` | L0A | 64 KB | 矩阵乘法 A 操作数 |
+| `Right` | L0B | 64 KB | 矩阵乘法 B 操作数 |
+| `Acc` | L0C | 256 KB | 矩阵乘法累加器 |
+
+## 典型使用方式
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+// 定义 GlobalTensor
+using DynShape = Shape<1, 1, 1, kGRows_, kGCols_>;
+using DynStride = Stride<1, 1, 1, kGCols_, 1>;
+using GlobalData = GlobalTensor<T, DynShape, DynStride>;
+GlobalData srcGlobal(src);
+
+// 定义 Tile
+using TileData = Tile<TileType::Vec, T, kTRows_, kTCols_, BLayout::RowMajor, -1, -1>;
+TileData srcTile(kTRows_, kTCols_);
+
+// PTO 指令调用
+TLOAD(srcTile, srcGlobal);
+TADD(dstTile, src0Tile, src1Tile);
+TSTORE(dstGlobal, dstTile);
+```
 
 ## 相关文档
 
-- 指令参考：`docs/isa/`
-- 通信指令参考：`docs/isa/comm/`
+- [ISA 指令参考](../docs/isa/) — PTO ISA 完整指令语义
+- [Tile 编程模型](../../docs/coding/Tile_zh.md) — Tile 类型系统详解
+- [include/README_zh.md](./README_zh.md) — 中文版入口
diff --git a/include/pto/npu/README.md b/include/pto/npu/README.md
index 53f44b4a..153844a0 100644
--- a/include/pto/npu/README.md
+++ b/include/pto/npu/README.md
@@ -2,10 +2,40 @@
 
 NPU-side PTO instruction implementations. Different SoC generations have different optimized implementations and pipeline details.
 
+## Choose by SoC
+
+| SoC Generation | Directory | Description |
+|----------------|----------|-------------|
+| Ascend A2/A3 | `a2a3/` | Ascend 910B / 910C series implementations |
+| Ascend A5 | `a5/` | Ascend 950 series implementations |
+| Kirin 9030 | `kirin9030/` | Kirin 9030-specific implementations |
+| Kirin X90 | `kirinX90/` | Kirin X90-specific implementations |
+
 ## Layout
 
-- `a2a3/`: Ascend A2/A3 implementations (e.g., `TAdd.hpp`, `TMatmul.hpp`, `TLoad.hpp`)
-- `a5/`: Ascend A5 implementations (e.g., `TAdd.hpp`, `TMatmul.hpp`, `TLoad.hpp`)
+```
+include/pto/npu/
+├── a2a3/                      # Ascend A2/A3 (910B/910C) series
+│   ├── TAdd.hpp              # TADD implementation
+│   ├── TSub.hpp              # TSUB implementation
+│   ├── TMul.hpp              # TMUL implementation
+│   ├── TMatmul.hpp           # TMATMUL implementation
+│   ├── TLoad.hpp             # TLOAD implementation
+│   ├── TStore.hpp            # TSTORE implementation
+│   └── ...                    # Other instruction implementations
+│
+├── a5/                        # Ascend A5 (950) series
+│   ├── TAdd.hpp
+│   ├── TSub.hpp
+│   ├── TMul.hpp
+│   ├── TMatmul.hpp
+│   ├── TLoad.hpp
+│   ├── TStore.hpp
+│   └── ...                    # Other implementations (A5-specific ops like TMATMUL_MX)
+│
+├── kirin9030/                 # Kirin 9030
+└── kirinX90/                  # Kirin X90
+```
 
 ## Selecting the SoC Version
 
@@ -14,4 +44,20 @@ SoC selection is controlled by the build system and test scripts:
 - `tests/script/run_st.py` / `tests/script/build_st.py`: select via `-v a3|a5`
 - `tests/npu/<soc>/src/st/CMakeLists.txt`: builds the corresponding ST targets and dependencies per SoC
 
-For an end-to-end walkthrough, start with `docs/getting-started.md`.
+## Key A2/A3 vs A5 Differences
+
+| Feature | A2/A3 | A5 |
+|---------|:------:|:--:|
+| Matrix multiply unit | CUBE | CUBE (enhanced) |
+| MXFP4/MXFP8 support | — | Supported |
+| Vector instructions | Emulated | Full hardware support |
+| Fractal layouts | Emulated | Full support |
+| FP8 types | — | Supported |
+
+## Related Docs
+
+| Document | Content |
+|----------|---------|
+| [include/pto/README.md](../README.md) | PTO header entry point |
+| [docs/getting-started.md](../../../docs/getting-started.md) | Complete getting started guide |
+| [include/pto/npu/README_zh.md](./README_zh.md) | 中文版 |
diff --git a/include/pto/npu/README_zh.md b/include/pto/npu/README_zh.md
index 22d73c2e..a7cc15ab 100644
--- a/include/pto/npu/README_zh.md
+++ b/include/pto/npu/README_zh.md
@@ -1,17 +1,63 @@
 # include/pto/npu/
 
-NPU 侧 PTO 指令实现。不同 SoC 代际会对应不同的优化实现与流水线细节。
+NPU 侧 PTO 指令实现。不同 SoC 代际的指令实现与流水线细节有所不同。
+
+## 按平台选择
+
+| SoC 代际 | 目录 | 说明 |
+|----------|------|------|
+| Ascend A2/A3 | `a2a3/` | Ascend 910B / 910C 系列实现 |
+| Ascend A5 | `a5/` | Ascend 950 系列实现 |
+| Kirin 9030 | `kirin9030/` | Kirin 9030 专用实现 |
+| Kirin X90 | `kirinX90/` | Kirin X90 专用实现 |
 
 ## 目录结构
 
-- `a2a3/`：Ascend A2/A3 实现（例如 `TAdd.hpp`、`TMatmul.hpp`、`TLoad.hpp`）
-- `a5/`：Ascend A5 实现（例如 `TAdd.hpp`、`TMatmul.hpp`、`TLoad.hpp`）
+```
+include/pto/npu/
+├── a2a3/                      # Ascend A2/A3（910B/910C）系列
+│   ├── TAdd.hpp              # TADD 实现
+│   ├── TSub.hpp              # TSUB 实现
+│   ├── TMul.hpp              # TMUL 实现
+│   ├── TMatmul.hpp           # TMATMUL 实现
+│   ├── TLoad.hpp             # TLOAD 实现
+│   ├── TStore.hpp            # TSTORE 实现
+│   └── ...                    # 其他指令实现
+│
+├── a5/                        # Ascend A5（950）系列
+│   ├── TAdd.hpp
+│   ├── TSub.hpp
+│   ├── TMul.hpp
+│   ├── TMatmul.hpp
+│   ├── TLoad.hpp
+│   ├── TStore.hpp
+│   └── ...                    # 其他指令实现（A5 特有指令如 TMATMUL_MX 等）
+│
+├── kirin9030/                 # Kirin 9030
+└── kirinX90/                  # Kirin X90
+```
 
 ## 选择 SoC 版本
 
-SoC 选择由构建系统与测试脚本控制：
+SoC 版本选择由构建系统和测试脚本控制：
 
 - `tests/script/run_st.py` / `tests/script/build_st.py`：通过 `-v a3|a5` 选择
-- `tests/npu/<soc>/src/st/CMakeLists.txt`：按 SoC 构建对应的 ST 目标与依赖
+- `tests/npu/<soc>/src/st/CMakeLists.txt`：为每个 SoC 构建对应的 ST targets 和依赖
+
+## A2/A3 与 A5 的主要差异
+
+| 特性 | A2/A3 | A5 |
+|------|:------:|:--:|
+| 矩阵乘法单元 | CUBE | CUBE（增强） |
+| MXFP4/MXFP8 支持 | — | 支持 |
+| Vector 指令 | 仿真 | 硬件完整支持 |
+| 分形布局 | 仿真 | 完整支持 |
+| FP8 类型 | — | 支持 |
+
+## 相关文档
 
-端到端流程建议从 `docs/getting-started.md` 开始。
+| 文档 | 内容 |
+|------|------|
+| [include/pto/README_zh.md](../README_zh.md) | PTO 头文件总入口 |
+| [docs/getting-started_zh.md](../../../docs/getting-started_zh.md) | 完整上手指南 |
+| [include/pto/npu/README_zh.md](./README_zh.md) | 中文版 |
diff --git a/include/pto/npu/a2a3/README.md b/include/pto/npu/a2a3/README.md
index 37e19784..5f0a4be6 100644
--- a/include/pto/npu/a2a3/README.md
+++ b/include/pto/npu/a2a3/README.md
@@ -1,13 +1,37 @@
 # include/pto/npu/a2a3/
 
-Ascend A2/A3 series PTO instruction implementation headers.
+Ascend A2/A3 (Ascend 910B / 910C) series PTO instruction implementation headers.
 
-## Overview
+## Key Files
 
-- Implementations are organized per instruction (or instruction family), for example: `TAdd.hpp`, `TMatmul.hpp`, `TLoad.hpp`, `TStore.hpp`
-- Some shared operator patterns are also provided (for example, Reduce/Expand/PartOp helpers)
+```
+include/pto/npu/a2a3/
+├── TAdd.hpp            # TADD element-wise addition
+├── TSub.hpp            # TSUB element-wise subtraction
+├── TMul.hpp            # TMUL element-wise multiplication
+├── TDiv.hpp            # TDIV element-wise division
+├── TMatmul.hpp         # TMATMUL matrix multiply (hardware CUBE)
+├── TLoad.hpp           # TLOAD GM → tile buffer
+├── TStore.hpp          # TSTORE tile buffer → GM
+├── TAssign.hpp         # TASSIGN resource binding
+├── TSync.hpp           # TSYNC synchronization
+├── ...                  # Other instruction implementations (Reduce, Expand, Layout, etc.)
+```
+
+## Key Differences from A5
+
+| Feature | A2/A3 | A5 |
+|---------|:------:|:--:|
+| Matrix multiply | Hardware CUBE | Enhanced CUBE |
+| MXFP4 / MXFP8 | Not supported | Supported |
+| Vector instructions | Emulated | Full hardware support |
+| Fractal layouts | Emulated | Full support |
+| FP8 types | Not supported | Supported |
 
 ## Related
 
-- ISA semantics and examples: `docs/isa/`
-- A2/A3 NPU ST tests: `tests/npu/a2a3/src/st/`
+| Document | Content |
+|----------|---------|
+| [docs/isa/](../../docs/isa/) | ISA semantics and examples |
+| [tests/npu/a2a3/src/st/](../../tests/npu/a2a3/src/st/) | A2/A3 NPU ST tests |
+| [include/pto/npu/](../README.md) | NPU implementation entry |
diff --git a/include/pto/npu/a2a3/README_zh.md b/include/pto/npu/a2a3/README_zh.md
index 56f65590..b8549106 100644
--- a/include/pto/npu/a2a3/README_zh.md
+++ b/include/pto/npu/a2a3/README_zh.md
@@ -1,13 +1,38 @@
 # include/pto/npu/a2a3/
 
-Ascend A2/A3 系列 PTO 指令实现头文件。
+Ascend A2/A3（Ascend 910B / 910C）系列 PTO 指令实现头文件。
 
-## 概览
+## 主要文件
 
-- 按指令（或指令族）组织实现，例如：`TAdd.hpp`、`TMatmul.hpp`、`TLoad.hpp`、`TStore.hpp`
-- 同时提供一些可复用的算子模式（例如 Reduce/Expand/PartOp 等辅助实现）
+```
+include/pto/npu/a2a3/
+├── TAdd.hpp            # TADD 逐元素加法
+├── TSub.hpp            # TSUB 逐元素减法
+├── TMul.hpp            # TMUL 逐元素乘法
+├── TDiv.hpp            # TDIV 逐元素除法
+├── TMatmul.hpp         # TMATMUL 矩阵乘法（硬件 CUBE）
+├── TLoad.hpp           # TLOAD GM → tile buffer
+├── TStore.hpp          # TSTORE tile buffer → GM
+├── TAssign.hpp         # TASSIGN 资源绑定
+├── TSync.hpp           # TSYNC 同步
+├── ...                  # 其他指令实现（Reduce、Expand、Layout 等）
+```
+
+## 与 A5 的主要差异
+
+| 特性 | A2/A3 | A5 |
+|------|:------:|:--:|
+| 矩阵乘法 | 硬件 CUBE | 增强版 CUBE |
+| MXFP4 / MXFP8 | 不支持 | 支持 |
+| Vector 指令 | 仿真 | 硬件完整支持 |
+| 分形布局 | 仿真 | 完整支持 |
+| FP8 类型 | 不支持 | 支持 |
+| Tile 指令 | 完整 | 完整 |
 
 ## 相关内容
 
-- ISA 语义与示例：`docs/isa/`
-- A2/A3 NPU ST 测试：`tests/npu/a2a3/src/st/`
+| 文档 | 内容 |
+|------|------|
+| [docs/isa/](../../docs/isa/) | ISA 语义与示例 |
+| [tests/npu/a2a3/src/st/](../../tests/npu/a2a3/src/st/) | A2/A3 NPU ST 测试 |
+| [include/pto/npu/](../README_zh.md) | NPU 实现总入口 |
diff --git a/include/pto/npu/a5/README.md b/include/pto/npu/a5/README.md
index 81323518..cba8d38a 100644
--- a/include/pto/npu/a5/README.md
+++ b/include/pto/npu/a5/README.md
@@ -1,13 +1,47 @@
 # include/pto/npu/a5/
 
-Ascend A5 series PTO instruction implementation headers.
+Ascend A5 (Ascend 950) series PTO instruction implementation headers.
 
-## Overview
+## Key Files
 
-- Implementations are organized per instruction (or instruction family), for example: `TAdd.hpp`, `TMatmul.hpp`, `TLoad.hpp`, `TStore.hpp`
-- Includes A5-specific operator patterns and utilities where applicable
+```
+include/pto/npu/a5/
+├── TAdd.hpp            # TADD element-wise addition
+├── TSub.hpp            # TSUB element-wise subtraction
+├── TMul.hpp            # TMUL element-wise multiplication
+├── TDiv.hpp            # TDIV element-wise division
+├── TMatmul.hpp         # TMATMUL matrix multiply (enhanced CUBE)
+├── TMatmulMx.hpp       # TMATMUL_MX matrix multiply with MX format
+├── TLoad.hpp           # TLOAD GM → tile buffer
+├── TStore.hpp          # TSTORE tile buffer → GM
+├── TAssign.hpp         # TASSIGN resource binding
+├── TSync.hpp           # TSYNC synchronization
+├── ...                  # Other instruction implementations (Reduce, Expand, Layout, etc.)
+```
+
+## A5-Specific Features
+
+| Feature | Description |
+|---------|-------------|
+| MXFP8 / MXFP4 | Hybrid-precision matrix multiply supported on A5 hardware |
+| Fractal layouts | Full NZ / ZN / FR / RN fractal layout support |
+| Vector hardware | `pto.v*` instructions have full hardware support on A5 (not emulated) |
+| FP8 types | `f8e4m3`, `f8e5m2` data types supported |
+
+## Key Differences from A2/A3
+
+| Feature | A2/A3 | A5 |
+|---------|:------:|:--:|
+| Matrix multiply | Hardware CUBE | Enhanced CUBE |
+| MXFP4 / MXFP8 | Not supported | Supported |
+| Vector instructions | Emulated | Full hardware support |
+| Fractal layouts | Emulated | Full support |
+| FP8 types | Not supported | Supported |
 
 ## Related
 
-- ISA semantics and examples: `docs/isa/`
-- A5 NPU ST tests: `tests/npu/a5/src/st/`
+| Document | Content |
+|----------|---------|
+| [docs/isa/](../../docs/isa/) | ISA semantics and examples |
+| [tests/npu/a5/src/st/](../../tests/npu/a5/src/st/) | A5 NPU ST tests |
+| [include/pto/npu/](../README.md) | NPU implementation entry |
diff --git a/include/pto/npu/a5/README_zh.md b/include/pto/npu/a5/README_zh.md
index 242c3d9d..23a85949 100644
--- a/include/pto/npu/a5/README_zh.md
+++ b/include/pto/npu/a5/README_zh.md
@@ -1,13 +1,47 @@
 # include/pto/npu/a5/
 
-Ascend A5 系列 PTO 指令实现头文件。
+Ascend A5（Ascend 950）系列 PTO 指令实现头文件。
 
-## 概览
+## 主要文件
 
-- 按指令（或指令族）组织实现，例如：`TAdd.hpp`、`TMatmul.hpp`、`TLoad.hpp`、`TStore.hpp`
-- 包含 A5 专用的算子模式与工具（如适用）
+```
+include/pto/npu/a5/
+├── TAdd.hpp            # TADD 逐元素加法
+├── TSub.hpp            # TSUB 逐元素减法
+├── TMul.hpp            # TMUL 逐元素乘法
+├── TDiv.hpp            # TDIV 逐元素除法
+├── TMatmul.hpp         # TMATMUL 矩阵乘法（增强版 CUBE）
+├── TMatmulMx.hpp       # TMATMUL_MX 带 MX 格式的矩阵乘法
+├── TLoad.hpp           # TLOAD GM → tile buffer
+├── TStore.hpp          # TSTORE tile buffer → GM
+├── TAssign.hpp         # TASSIGN 资源绑定
+├── TSync.hpp           # TSYNC 同步
+├── ...                  # 其他指令实现（Reduce、Expand、Layout 等）
+```
+
+## A5 特有功能
+
+| 功能 | 说明 |
+|------|------|
+| MXFP8 / MXFP4 | A5 硬件支持的混合精度矩阵乘法 |
+| 分形布局 | NZ / ZN / FR / RN 分形布局完整支持 |
+| Vector 硬件 | `pto.v*` 指令在 A5 上硬件完整支持（非仿真） |
+| FP8 类型 | `f8e4m3`、`f8e5m2` 数据类型支持 |
+
+## 与 A2/A3 的主要差异
+
+| 特性 | A2/A3 | A5 |
+|------|:------:|:--:|
+| 矩阵乘法 | 硬件 CUBE | 增强版 CUBE |
+| MXFP4 / MXFP8 | 不支持 | 支持 |
+| Vector 指令 | 仿真 | 硬件完整支持 |
+| 分形布局 | 仿真 | 完整支持 |
+| FP8 类型 | 不支持 | 支持 |
 
 ## 相关内容
 
-- ISA 语义与示例：`docs/isa/`
-- A5 NPU ST 测试：`tests/npu/a5/src/st/`
+| 文档 | 内容 |
+|------|------|
+| [docs/isa/](../../docs/isa/) | ISA 语义与示例 |
+| [tests/npu/a5/src/st/](../../tests/npu/a5/src/st/) | A5 NPU ST 测试 |
+| [include/pto/npu/](../README_zh.md) | NPU 实现总入口 |
diff --git a/include/pto/npu/kirin9030/README.md b/include/pto/npu/kirin9030/README.md
index 682717f6..65f86d0b 100644
--- a/include/pto/npu/kirin9030/README.md
+++ b/include/pto/npu/kirin9030/README.md
@@ -1,13 +1,28 @@
 # include/pto/npu/kirin9030/
 
-Kirin9030 series PTO instruction implementation headers.
+Kirin 9030 series PTO instruction implementation headers.
 
-## Overview
+Kirin 9030 is an Ascend SoC variant targeting consumer scenarios. Its PTO instruction implementations may differ in certain details from other datacenter SoCs (A2/A3/A5).
 
-- Implementations are organized per instruction (or instruction family), for example: `TAdd.hpp`, `TMatmul.hpp`, `TLoad.hpp`, `TStore.hpp`
-- Includes Kirin9030-specific operator patterns and utilities where applicable
+## Key Files
+
+```
+include/pto/npu/kirin9030/
+├── TAdd.hpp
+├── TSub.hpp
+├── TMul.hpp
+├── TDiv.hpp
+├── TMatmul.hpp
+├── TLoad.hpp
+├── TStore.hpp
+├── TAssign.hpp
+├── TSync.hpp
+└── ...                  # Other instruction implementations
+```
 
 ## Related
 
-- ISA semantics and examples: `docs/isa/`
-- Kirin9030 NPU ST tests: `tests/npu/Kirin9030/src/st/`
+| Document | Content |
+|----------|---------|
+| [docs/isa/](../../docs/isa/) | ISA semantics and examples |
+| [include/pto/npu/](../README.md) | NPU implementation entry |
diff --git a/include/pto/npu/kirin9030/README_zh.md b/include/pto/npu/kirin9030/README_zh.md
index 68f11a20..ef3d1589 100644
--- a/include/pto/npu/kirin9030/README_zh.md
+++ b/include/pto/npu/kirin9030/README_zh.md
@@ -1,13 +1,28 @@
 # include/pto/npu/kirin9030/
 
-Kirin9030 系列 PTO 指令实现头文件。
+Kirin 9030 系列 PTO 指令实现头文件。
 
-## 概览
+Kirin 9030 是昇腾面向消费端场景的 SoC 变体，其 PTO 指令实现在部分细节上与其他数据中心 SoC（A2/A3/A5）有所不同。
 
-- 按指令（或指令族）组织实现，例如：`TAdd.hpp`、`TMatmul.hpp`、`TLoad.hpp`、`TStore.hpp`
-- 包含 Kirin9030 专用的算子模式与工具（如适用）
+## 主要文件
+
+```
+include/pto/npu/kirin9030/
+├── TAdd.hpp
+├── TSub.hpp
+├── TMul.hpp
+├── TDiv.hpp
+├── TMatmul.hpp
+├── TLoad.hpp
+├── TStore.hpp
+├── TAssign.hpp
+├── TSync.hpp
+└── ...                  # 其他指令实现
+```
 
 ## 相关内容
 
-- ISA 语义与示例：`docs/isa/`
-- Kirin9030 NPU ST 测试：`tests/npu/Kirin9030/src/st/`
+| 文档 | 内容 |
+|------|------|
+| [docs/isa/](../../docs/isa/) | ISA 语义与示例 |
+| [include/pto/npu/](../README_zh.md) | NPU 实现总入口 |
diff --git a/include/pto/npu/kirinX90/README.md b/include/pto/npu/kirinX90/README.md
index 70245746..2a006005 100644
--- a/include/pto/npu/kirinX90/README.md
+++ b/include/pto/npu/kirinX90/README.md
@@ -1,13 +1,28 @@
 # include/pto/npu/kirinX90/
 
-KirinX90 series PTO instruction implementation headers.
+Kirin X90 series PTO instruction implementation headers.
 
-## Overview
+Kirin X90 is an Ascend SoC variant targeting consumer scenarios, sharing test cases with Kirin 9030.
 
-- Implementations are organized per instruction (or instruction family), for example: `TAdd.hpp`, `TMatmul.hpp`, `TLoad.hpp`, `TStore.hpp`
-- Includes KirinX90-specific operator patterns and utilities where applicable
+## Key Files
+
+```
+include/pto/npu/kirinX90/
+├── TAdd.hpp
+├── TSub.hpp
+├── TMul.hpp
+├── TDiv.hpp
+├── TMatmul.hpp
+├── TLoad.hpp
+├── TStore.hpp
+├── TAssign.hpp
+├── TSync.hpp
+└── ...                  # Other instruction implementations
+```
 
 ## Related
 
-- ISA semantics and examples: `docs/isa/`
-- KirinX90 NPU ST tests: `tests/npu/KirinX90/src/st/`, share test cases with Kirin9030.
+| Document | Content |
+|----------|---------|
+| [docs/isa/](../../docs/isa/) | ISA semantics and examples |
+| [include/pto/npu/](../README.md) | NPU implementation entry |
diff --git a/include/pto/npu/kirinX90/README_zh.md b/include/pto/npu/kirinX90/README_zh.md
index a6572595..67d4849d 100644
--- a/include/pto/npu/kirinX90/README_zh.md
+++ b/include/pto/npu/kirinX90/README_zh.md
@@ -1,13 +1,28 @@
 # include/pto/npu/kirinX90/
 
-KirinX90 系列 PTO 指令实现头文件。
+Kirin X90 系列 PTO 指令实现头文件。
 
-## 概览
+Kirin X90 是昇腾面向消费端场景的 SoC 变体，与 Kirin 9030 共用测试用例。
 
-- 按指令（或指令族）组织实现，例如：`TAdd.hpp`、`TMatmul.hpp`、`TLoad.hpp`、`TStore.hpp`
-- 包含 KirinX90 专用的算子模式与工具（如适用）
+## 主要文件
+
+```
+include/pto/npu/kirinX90/
+├── TAdd.hpp
+├── TSub.hpp
+├── TMul.hpp
+├── TDiv.hpp
+├── TMatmul.hpp
+├── TLoad.hpp
+├── TStore.hpp
+├── TAssign.hpp
+├── TSync.hpp
+└── ...                  # 其他指令实现
+```
 
 ## 相关内容
 
-- ISA 语义与示例：`docs/isa/`
-- KirinX90 NPU ST 测试：`tests/npu/Kirin9030/src/st/`，与Kirin9030共用测试用例。
+| 文档 | 内容 |
+|------|------|
+| [docs/isa/](../../docs/isa/) | ISA 语义与示例 |
+| [include/pto/npu/](../README_zh.md) | NPU 实现总入口 |
diff --git a/kernels/README.md b/kernels/README.md
index 7406b0c2..e0c4d0e6 100644
--- a/kernels/README.md
+++ b/kernels/README.md
@@ -1,31 +1,68 @@
 # Kernels
 
-This directory contains kernel/operator implementations that complement PTO Tile Lib.
+This directory contains kernel/operator implementations that complement PTO Tile Library.
 
-Most kernel subdirectories are **self-contained mini-projects** (kernel + host + scripts) with their own `README.md`, `CMakeLists.txt`, and `run.sh`.
+Most subdirectories are **self-contained mini-projects** (kernel + host + scripts) with their own `README.md`, `CMakeLists.txt`, and `run.sh` for independent discovery and execution.
 
-## Where to start
+## Choose by Task
 
-- Manual (hand-tuned) NPU kernels: [manual](manual/README.md)
-- Custom operator scaffolding: [custom](custom/README.md)
-- End-to-end demos (including CPU): [demos](../demos/README.md)
+| Your goal | Start here |
+|-----------|-----------|
+| Learn PTO programming | [docs/coding/tutorial.md](../docs/coding/tutorial.md) |
+| High-performance GEMM | [manual/a2a3/gemm_performance/README.md](manual/a2a3/gemm_performance/README.md) |
+| Flash Attention | [manual/common/flash_atten/README.md](manual/common/flash_atten/README.md) |
+| Conv2D forward | [manual/a2a3/conv2d_forward/README.md](manual/a2a3/conv2d_forward/README.md) |
+| MXFP8 / MXFP4 Matmul | [manual/a5/matmul_mxfp8_performance/README.md](manual/a5/matmul_mxfp8_performance/README.md) |
+| Custom operator scaffolding | [custom/README.md](custom/README.md) |
 
-## Directory layout
+## Directory Layout
 
-- `manual/`: Hand-tuned kernels with explicit buffering/synchronization (NPU-focused)
-  - `manual/a2a3/`: Kernels for A2/A3 platforms
-    - `manual/a2a3/gemm_performance/`: High-performance GEMM example
-    - `manual/a2a3/conv2d_forward/`: Conv2D forward kernel example
-    - `manual/a2a3/topk/`: TopK kernel example
-  - `manual/a5/`: Kernels for A5 platforms
-    - `manual/a5/flash_atten/`: Flash-Attention kernel for A5
-    - `manual/a5/matmul_mxfp4_performance/`: MXFP4 matrix multiplication example
-    - `manual/a5/matmul_mxfp8_performance/`: MXFP8 matrix multiplication example
-  - `manual/common/`: Cross-platform kernels
-    - `manual/common/flash_atten/`: Flash-Attention kernel (A2/A3/A5)
-- `custom/`: Examples/scaffolding for custom kernel/operator extensions
+```
+kernels/
+├── manual/                    # Hand-tuned (hand-written, performance-oriented) NPU kernels
+│   ├── a2a3/                 # Ascend A2/A3 platforms
+│   │   ├── gemm_performance/ # High-performance GEMM — pipeline optimization, double-buffering, address planning
+│   │   ├── conv2d_forward/  # Conv2D forward kernel — img2col + GEMM
+│   │   ├── topk/            # TopK kernel — sorting and selection
+│   │   └── allgather_gemm/  # Multi-NPU GEMM — AllGather communication fused with GEMM
+│   │
+│   ├── a5/                  # Ascend A5 platforms
+│   │   ├── flash_atten/     # Flash Attention — A5-specific optimization
+│   │   ├── matmul_mxfp8_performance/  # MXFP8 matrix multiplication
+│   │   ├── matmul_mxfp4_performance/  # MXFP4 matrix multiplication
+│   │   ├── allgather_gemm/  # Multi-NPU GEMM (A5)
+│   │   └── engram_simt/     # Engram SIMT example
+│   │
+│   └── common/               # Cross-platform kernels (work on A2/A3/A5)
+│       └── flash_atten/      # Flash Attention — cross-platform generic version
+│
+└── custom/                   # Custom operator scaffolding and examples
+    └── fused_add_relu_mul/  # Fused operator example: Add + ReLU + Mul
+```
 
-## Notes
+## What Makes Manual Kernels Different
 
-- Public interfaces live in `include/`; tests live in `tests/`.
-- If you add a new kernel project here, prefer adding a small `README.md` and a `run.sh` so it can be discovered and executed consistently.
+Unlike examples in `demos/`, kernels under `manual/` target **production-grade performance tuning**:
+
+- **Explicit management** of tile buffer address allocation (`TASSIGN`)
+- **Explicit management** of pipeline synchronization (`set_flag`/`wait_flag`)
+- **Double/multi-buffering** to overlap data movement and compute
+- **Address alignment and planning** to maximize UB / L0 utilization
+- Microarchitectural tuning specific to each SoC (A2/A3 vs A5)
+
+## Related Docs
+
+| Document | Content |
+|----------|---------|
+| [kernels/README_zh.md](./README_zh.md) | 中文版入口 |
+| [demos/](../demos/README.md) | End-to-end demos (including CPU versions) |
+| [docs/coding/opt.md](../docs/coding/opt.md) | Performance optimization and bottleneck analysis |
+| [docs/isa/README.md](../docs/isa/README.md) | PTO ISA instruction reference |
+
+## Notes for Adding New Kernels
+
+When adding a new kernel project, please include:
+
+1. A short `README_zh.md` (Chinese) and `README.md` (English) explaining platform requirements and how to run
+2. A `run.sh` or equivalent script for consistent discovery and execution
+3. Use `pto_add_kernel(<target_name>)` in `CMakeLists.txt`
diff --git a/kernels/README_zh.md b/kernels/README_zh.md
index 9e7fe9e1..aea5ddfa 100644
--- a/kernels/README_zh.md
+++ b/kernels/README_zh.md
@@ -1,31 +1,68 @@
 # Kernels
 
-本目录包含与 PTO Tile Lib 配套的 kernel / operator 实现。
+本目录包含与 PTO Tile Library 配套的 kernel / 算子实现。
 
-多数子目录都是**自包含的小工程**（kernel + host + 脚本），通常会包含自己的 `README.md`、`CMakeLists.txt` 与 `run.sh`，便于独立发现与运行。
+每个子目录都是一个**自包含的小工程**（kernel + host + 脚本），通常包含自己的 `README.md`、`CMakeLists.txt` 与 `run.sh`，便于独立发现与运行。
 
-## 从哪里开始
+## 按任务选择
 
-- 手工调优（manual）的 NPU kernels：[manual](manual/README_zh.md)
-- 自定义算子脚手架：[custom](custom/README_zh.md)
-- 端到端 demo（包含 CPU）：[demos](../demos/README_zh.md)
+| 你的目标 | 从这里开始 |
+|----------|----------|
+| 学习 PTO 编程 | [docs/coding/tutorial_zh.md](../docs/coding/tutorial_zh.md) |
+| 高性能 GEMM | [manual/a2a3/gemm_performance/README_zh.md](manual/a2a3/gemm_performance/README_zh.md) |
+| Flash Attention | [manual/common/flash_atten/README_zh.md](manual/common/flash_atten/README_zh.md) |
+| Conv2D forward | [manual/a2a3/conv2d_forward/README_zh.md](manual/a2a3/conv2d_forward/README_zh.md) |
+| MXFP8 / MXFP4 Matmul | [manual/a5/matmul_mxfp8_performance/README_zh.md](manual/a5/matmul_mxfp8_performance/README_zh.md) |
+| 自定义算子脚手架 | [custom/README_zh.md](custom/README_zh.md) |
 
 ## 目录结构
 
-- `manual/`：手工调优 kernels（显式管理 buffer / 同步 / 流水线，偏 NPU）
-  - `manual/a2a3/`：A2/A3 平台 kernels
-    - `manual/a2a3/gemm_performance/`：高性能 GEMM 示例
-    - `manual/a2a3/conv2d_forward/`：Conv2D 前向 kernel 示例
-    - `manual/a2a3/topk/`：TopK kernel 示例
-  - `manual/a5/`：A5 平台 kernels
-    - `manual/a5/flash_atten/`：A5 平台 Flash-Attention kernel
-    - `manual/a5/matmul_mxfp4_performance/`：MXFP4 矩阵乘法示例
-    - `manual/a5/matmul_mxfp8_performance/`：MXFP8 矩阵乘法示例
-  - `manual/common/`：跨平台 kernels
-    - `manual/common/flash_atten/`：Flash-Attention kernel（A2/A3/A5）
-- `custom/`：自定义 kernel / operator 扩展的示例与脚手架
-
-## 备注
-
-- 公共接口在 `include/`；测试在 `tests/`。
-- 新增 kernel 工程时，建议配套一个简短的 `README.md` 和一个 `run.sh`，方便统一发现与运行。
+```
+kernels/
+├── manual/                    # 手工调优（手写、面向性能）的 NPU kernels
+│   ├── a2a3/                 # Ascend A2/A3 平台
+│   │   ├── gemm_performance/ # 高性能 GEMM — 流水线优化、双缓冲、地址规划
+│   │   ├── conv2d_forward/   # Conv2D 前向 kernel — img2col + GEMM
+│   │   ├── topk/            # TopK kernel — 排序与选取
+│   │   └── allgather_gemm/  # 多卡 GEMM — AllGather 通信与 GEMM 融合
+│   │
+│   ├── a5/                  # Ascend A5 平台
+│   │   ├── flash_atten/     # Flash Attention — A5 专用优化版本
+│   │   ├── matmul_mxfp8_performance/  # MXFP8 矩阵乘法
+│   │   ├── matmul_mxfp4_performance/  # MXFP4 矩阵乘法
+│   │   ├── allgather_gemm/  # 多卡 GEMM（A5）
+│   │   └── engram_simt/     # Engram SIMT 示例
+│   │
+│   └── common/               # 跨平台 kernels（适用于 A2/A3/A5）
+│       └── flash_atten/      # Flash Attention — 跨平台通用版本
+│
+└── custom/                   # 自定义算子脚手架与示例
+    └── fused_add_relu_mul/  # 融合算子示例：Add + ReLU + Mul
+```
+
+## Manual kernels 的特点
+
+与 `demos/` 中的示例不同，`manual/` 下的 kernels 面向**生产级性能调优**：
+
+- **显式管理** tile buffer 地址分配（`TASSIGN`）
+- **显式管理** 流水线同步（`set_flag`/`wait_flag`）
+- **双缓冲 / 多缓冲** 重叠数据搬运与计算
+- **地址对齐与规划** 最大化 UB / L0 利用率
+- 针对特定 SoC（A2/A3 vs A5）的微架构特性调优
+
+## 相关文档
+
+| 文档 | 内容 |
+|------|------|
+| [kernels/README_zh.md](./README_zh.md) | 中文版入口 |
+| [demos/](../demos/README_zh.md) | 端到端示例（包含 CPU 版本） |
+| [docs/coding/opt_zh.md](../docs/coding/opt_zh.md) | 性能优化与瓶颈分析 |
+| [docs/isa/README_zh.md](../docs/isa/README_zh.md) | PTO ISA 指令参考 |
+
+## 新增 kernel 注意事项
+
+新增 kernel 工程时，建议配套：
+
+1. 一个简短的 `README_zh.md`（中文）和 `README.md`（英文），说明平台依赖与运行方式
+2. 一个 `run.sh` 或等效脚本，方便统一发现与运行
+3. 在 `CMakeLists.txt` 中使用 `pto_add_kernel(<target_name>)` 模板
diff --git a/kernels/custom/README_zh.md b/kernels/custom/README_zh.md
index 0a461de7..f89c7055 100644
--- a/kernels/custom/README_zh.md
+++ b/kernels/custom/README_zh.md
@@ -1,16 +1,21 @@
-# 自定义算子开发（Custom Operators）
+# Custom Operators
 
 本目录包含 **PTO 自定义算子开发示例**，展示如何从零开始实现自定义算子。
 
-如果你刚接触 PTO 编程，建议先从基础教程入手：
+## 按任务选择
 
-- 快速入门：[docs/getting-started_zh.md](../../docs/getting-started_zh.md)
-- 编程教程：[docs/coding/tutorial_zh.md](../../docs/coding/tutorial_zh.md)
-- Add 算子示例：[demos/baseline/add/README_zh.md](../../demos/baseline/add/README_zh.md)
+| 你的目标 | 从这里开始 |
+|----------|----------|
+| 第一次学习 PTO | [快速入门](../../docs/getting-started_zh.md) |
+| 编写第一个算子 | [上手教程](../../docs/coding/tutorial_zh.md) |
+| Add 算子示例 | [demos/baseline/add/README_zh.md](../../demos/baseline/add/README_zh.md) |
+| 算子融合 | [fused_add_relu_mul/README_zh.md](fused_add_relu_mul/README_zh.md) |
 
 ## 示例列表
 
-- `fused_add_relu_mul/`：算子融合示例，将 Add + ReLU + Mul 融合为一个 kernel，性能提升 2-3×。
+| 算子 | 说明 | 关键技术 |
+|------|------|----------|
+| [fused_add_relu_mul](./fused_add_relu_mul/README_zh.md) | 融合 Add + ReLU + Mul 为单个 kernel | 算子融合、Tile 级流水、2-3x 性能提升 |
 
 ## 如何运行
 
@@ -18,17 +23,27 @@
 
 - [fused_add_relu_mul/README_zh.md](fused_add_relu_mul/README_zh.md)
 
-## 开发自定义算子
+## 开发自定义算子步骤
 
 参考 `fused_add_relu_mul/` 示例，按以下步骤开发：
 
-1. 创建目录：`mkdir -p kernels/custom/my_operator`
-2. 实现 kernel：`my_operator_kernel.cpp`
-3. 编写测试：`main.cpp`
-4. 配置构建：`CMakeLists.txt`
-5. 运行验证：`./run.sh --sim`
+1. **创建目录**：`mkdir -p kernels/custom/my_operator`
+2. **实现 kernel**：`my_operator_kernel.cpp`
+3. **编写测试**：`main.cpp`
+4. **配置构建**：`CMakeLists.txt`
+5. **运行验证**：`./run.sh --sim`
 
-详细开发指南请参考：
+## 详细开发指南
 
-- [算子融合技术](../../docs/coding/operator-fusion_zh.md)
-- [性能优化指南](../../docs/coding/opt_zh.md)
+| 文档 | 内容 |
+|------|------|
+| [算子融合技术](../../docs/coding/operator-fusion_zh.md) | 融合多个算子的技术 |
+| [性能优化指南](../../docs/coding/opt_zh.md) | 性能瓶颈分析与调优 |
+
+## 相关文档
+
+| 文档 | 内容 |
+|------|------|
+| [kernels/README_zh.md](../README_zh.md) | kernels 总入口 |
+| [demos/README_zh.md](../../demos/README_zh.md) | 端到端示例 |
+| [docs/isa/README_zh.md](../../docs/isa/README_zh.md) | ISA 指令参考 |
diff --git a/kernels/manual/README.md b/kernels/manual/README.md
index 1ed02894..22d7368f 100644
--- a/kernels/manual/README.md
+++ b/kernels/manual/README.md
@@ -1,24 +1,54 @@
-# Manual kernels
+# Manual Kernels
 
-This folder contains **manual (hand-tuned) kernel examples** that use explicit buffering, synchronization, and pipeline control for maximum performance on supported NPUs.
+This folder contains **manual (hand-tuned) kernel examples** that use explicit buffering, synchronization, and pipeline control for maximum performance on Ascend NPUs.
 
 If you are new to PTO programming, start from the ISA and tutorials first:
 
 - Programming tutorials: [docs/coding/tutorial.md](../../docs/coding/tutorial.md)
 - Optimization notes: [docs/coding/opt.md](../../docs/coding/opt.md)
-- PTO ISA reference: [docs/PTOISA.md](../../docs/PTOISA.md)
+- PTO ISA reference: [docs/isa/README.md](../../docs/isa/README.md)
 
 ## Platforms
 
-- `a2a3/`: Manual kernels for Ascend A2/A3 platforms.
-- `a5/`: Manual kernels for Ascend A5 platforms.
-- `common/`: Cross-platform manual kernels (shared examples).
+| Platform | Directory | Typical Kernels |
+|----------|----------|----------------|
+| Ascend A2/A3 | `a2a3/` | GEMM, Conv2D, TopK, AllGather-GEMM |
+| Ascend A5 | `a5/` | Flash Attention, MXFP4/8 Matmul, AllGather-GEMM |
+| Cross-platform | `common/` | Flash Attention (shared across A2/A3/A5) |
 
-## How to run
+## Catalog
 
-Each subdirectory is a standalone example with its own build/run instructions. See:
+### A2/A3 Kernels (`a2a3/`)
 
-- [a2a3/README.md](a2a3/README.md)
-- [a5/README.md](a5/README.md)
-- [common/flash_atten/README.md](common/flash_atten/README.md)
+| Kernel | Description | Key Techniques |
+|--------|-------------|----------------|
+| [GEMM Performance](a2a3/gemm_performance/) | High-performance matrix multiplication | Double-buffering, L0A/L0B/L0C tiling, UB staging |
+| [Conv2D Forward](a2a3/conv2d_forward/) | Conv2D forward pass via img2col | img2col + GEMM fusion, fractal layout |
+| [TopK](a2a3/topk/) | Top-K element selection | Sorting-based selection, tile-level reduction |
+| [AllGather-GEMM](a2a3/allgather_gemm/) | Multi-NPU GEMM with AllGather | Collective communication fused with GEMM |
 
+### A5 Kernels (`a5/`)
+
+| Kernel | Description | Key Techniques |
+|--------|-------------|----------------|
+| [Flash Attention](a5/flash_atten/) | Flash Attention algorithm | Dynamic tiling, online softmax, A5-specific optimizations |
+| [MXFP8 Matmul](a5/matmul_mxfp8_performance/) | MXFP8 precision matrix multiplication | MXFP8 dequantization, FP8 compute, A5 hardware support |
+| [MXFP4 Matmul](a5/matmul_mxfp4_performance/) | MXFP4 precision matrix multiplication | MXFP4 dequantization, FP4 compute |
+| [AllGather-GEMM](a5/allgather_gemm/) | Multi-NPU GEMM (A5) | Collective communication fused with GEMM |
+| [Engram SIMT](a5/engram_simt/) | Engram SIMT example | SIMT-style programming on A5 |
+
+### Cross-Platform Kernels (`common/`)
+
+| Kernel | Description | Platforms |
+|--------|-------------|-----------|
+| [Flash Attention](common/flash_atten/) | Flash Attention algorithm | A2/A3/A5 |
+
+## How to Run
+
+Each subdirectory is a standalone example with its own build/run instructions. See the `README.md` in each folder.
+
+## See Also
+
+- [kernels/README.md](../README.md) — Parent directory entry
+- [docs/coding/opt.md](../../docs/coding/opt.md) — Performance bottleneck analysis
+- [docs/isa/README.md](../../docs/isa/README.md) — PTO ISA reference
diff --git a/kernels/manual/README_zh.md b/kernels/manual/README_zh.md
index 5cefee15..6b869606 100644
--- a/kernels/manual/README_zh.md
+++ b/kernels/manual/README_zh.md
@@ -1,23 +1,54 @@
-# 手工调优 kernels（Manual kernels）
+# 手工调优 kernels
 
-本目录包含**手工调优（手写、面向性能）**的 kernel 示例：需要显式管理 buffer、同步与流水线控制，以在支持的 NPU 上获得最佳性能。
+本目录包含**手工调优（手写、面向性能）**的 kernel 示例：在支持的昇腾 NPU 上使用显式 buffer 管理、同步与流水线控制，以获得最佳性能。
 
 如果你刚接触 PTO 编程，建议先从 ISA 与教程入手：
 
 - 编程教程：[docs/coding/tutorial_zh.md](../../docs/coding/tutorial_zh.md)
 - 优化笔记：[docs/coding/opt_zh.md](../../docs/coding/opt_zh.md)
-- PTO ISA 参考：[docs/PTOISA_zh.md](../../docs/PTOISA_zh.md)
+- PTO ISA 参考：[docs/isa/README_zh.md](../../docs/isa/README_zh.md)
 
-## 平台
+## 按平台选择
 
-- `a2a3/`：Ascend A2/A3 平台的手工调优 kernels。
-- `a5/`：Ascend A5 平台的手工调优 kernels。
-- `common/`：跨平台手工调优 kernels（共享示例）。
+| 平台 | 目录 | 典型 Kernel |
+|------|------|------------|
+| Ascend A2/A3 | `a2a3/` | GEMM、Conv2D、TopK、AllGather-GEMM |
+| Ascend A5 | `a5/` | Flash Attention、MXFP4/8 Matmul、AllGather-GEMM |
+| 跨平台 | `common/` | Flash Attention（A2/A3/A5 通用） |
+
+## 目录索引
+
+### A2/A3 Kernels（`a2a3/`）
+
+| Kernel | 说明 | 关键技术 |
+|--------|------|----------|
+| [GEMM Performance](a2a3/gemm_performance/) | 高性能矩阵乘法 | 双缓冲、L0A/L0B/L0C 分块、UB 暂存 |
+| [Conv2D Forward](a2a3/conv2d_forward/) | Conv2D 前向（img2col 方式） | img2col + GEMM 融合、分形布局 |
+| [TopK](a2a3/topk/) | Top-K 元素选取 | 基于排序的选取、tile 级归约 |
+| [AllGather-GEMM](a2a3/allgather_gemm/) | 多卡 GEMM 与 AllGather | 集合通信与 GEMM 融合 |
+
+### A5 Kernels（`a5/`）
+
+| Kernel | 说明 | 关键技术 |
+|--------|------|----------|
+| [Flash Attention](a5/flash_atten/) | Flash Attention 算法 | 动态分块、在线 softmax、A5 专用优化 |
+| [MXFP8 Matmul](a5/matmul_mxfp8_performance/) | MXFP8 精度矩阵乘法 | MXFP8 反量化、FP8 算子、A5 硬件支持 |
+| [MXFP4 Matmul](a5/matmul_mxfp4_performance/) | MXFP4 精度矩阵乘法 | MXFP4 反量化、FP4 算子 |
+| [AllGather-GEMM](a5/allgather_gemm/) | 多卡 GEMM（A5） | 集合通信与 GEMM 融合 |
+| [Engram SIMT](a5/engram_simt/) | Engram SIMT 示例 | A5 上的 SIMT 风格编程 |
+
+### 跨平台 Kernels（`common/`）
+
+| Kernel | 说明 | 平台 |
+|--------|------|------|
+| [Flash Attention](common/flash_atten/) | Flash Attention 算法 | A2/A3/A5 |
 
 ## 如何运行
 
-每个子目录都是一个独立示例，包含各自的构建/运行说明。请从这里开始：
+每个子目录都是独立示例，包含各自的构建/运行说明。参见各目录中的 `README_zh.md`。
+
+## 相关文档
 
-- [a2a3/README_zh.md](a2a3/README_zh.md)
-- [a5/README_zh.md](a5/README_zh.md)
-- [common/flash_atten/README_zh.md](common/flash_atten/README_zh.md)
+- [kernels/README_zh.md](../README_zh.md) — 父目录入口
+- [docs/coding/opt_zh.md](../../docs/coding/opt_zh.md) — 性能瓶颈分析与调优
+- [docs/isa/README_zh.md](../../docs/isa/README_zh.md) — PTO ISA 指令参考
diff --git a/tests/README.md b/tests/README.md
index 3ccd2f82..64f3b40a 100644
--- a/tests/README.md
+++ b/tests/README.md
@@ -1,97 +1,67 @@
 # tests/
 
-Tests and examples for PTO Tile Lib, covering both CPU simulation and NPU (including `sim` and on-board `npu` modes).
+Tests and examples for PTO Tile Library, covering both CPU simulation and NPU (including `sim` and on-board `npu` modes).
 
 ## Test Entry Points
 
-Common test entry points:
-
-- Full CPU Simulator run: `python3 tests/run_cpu.py --clean --verbose`
-- GEMM demo: `python3 tests/run_cpu.py --demo gemm --verbose`
-- Flash Attention demo: `python3 tests/run_cpu.py --demo flash_attn --verbose`
-- Single ST testcase: `python3 tests/script/run_st.py -r [sim|npu] -v [a3|a5] -t [TEST_CASE] -g [GTEST_FILTER_CASE]`
-- One-click scripts: `./tests/run_st.sh`, `./tests/run_cpu_tests.sh`
+| Scenario | Command |
+|---------|---------|
+| Full CPU Simulator run | `python3 tests/run_cpu.py --clean --verbose` |
+| GEMM demo | `python3 tests/run_cpu.py --demo gemm --verbose` |
+| Flash Attention demo | `python3 tests/run_cpu.py --demo flash_attn --verbose` |
+| Single ST testcase (NPU) | `python3 tests/script/run_st.py -r [sim\|npu] -v [a3\|a5] -t [TEST_CASE] -g [GTEST_FILTER_CASE]` |
+| One-click scripts | `./build.sh --run_all --a3 --sim` |
 
 ## Layout
 
-- `script/`: Recommended entry scripts
-  - `run_st.py`: Build and run NPU ST (`-r sim|npu -v a3|a5 -t <testcase> -g <gtest_filter>`)
-  - `build_st.py`: Build NPU ST only
-  - `all_cpu_tests.py`: Build and run CPU ST suites in batch
-  - `README.md`: Script usage
-- `cpu/`: CPU-side ST tests (gtest + CMake)
-  - `cpu/st/`: CPU ST projects and testcase data generation scripts
-- `npu/`: NPU-side ST tests split by SoC
-  - `npu/a2a3/src/st/`: A2/A3 compute ST
-  - `npu/a2a3/comm/st/`: A2/A3 communication ST
-  - `npu/a5/src/st/`: A5 compute ST
-  - `npu/a5/comm/st/`: A5 communication ST
-- `common/`: Shared test resources (if present)
-- `run_comm_test.sh`: One-click script for communication ST (see below)
-
-## Communication Tests (Comm ST)
-
-Communication tests verify multi-device PTO communication primitives (Put / Get / Broadcast / Gather / Scatter / Reduce / Notify / Wait / Test), built on MPI + HCCL.
-
-### Prerequisites: MPI Installation
-
-Communication tests require an MPI environment (MPICH or OpenMPI). Two components are needed at runtime:
-
-1. **`mpirun`**: launches multi-process execution
-2. **`libmpi.so`**: loaded at runtime via `dlopen`
-
-#### Install MPICH (Recommended)
-
-```bash
-# Ubuntu / Debian
-sudo apt install mpich libmpich-dev
-
-# CentOS / RHEL / EulerOS
-sudo yum install mpich mpich-devel
-# May need to load a module or add to PATH manually:
-export PATH=/usr/lib64/mpich/bin:$PATH
 ```
-
-
-#### Build MPICH from Source (No Root)
-
-```bash
-wget https://www.mpich.org/static/downloads/4.2.3/mpich-4.2.3.tar.gz
-tar xzf mpich-4.2.3.tar.gz && cd mpich-4.2.3
-./configure --prefix=$HOME/mpich --disable-fortran
-make -j$(nproc) && make install
-export MPI_HOME=$HOME/mpich
-export PATH=$MPI_HOME/bin:$PATH
+tests/
+├── script/                     # Test entry scripts (recommended entry point)
+│   ├── run_st.py              # Build and run NPU ST
+│   ├── build_st.py            # Build NPU ST only
+│   ├── all_cpu_tests.py       # Batch build and run CPU ST suites
+│   └── README.md              # Script usage guide
+│
+├── cpu/                        # CPU-side ST tests (gtest + CMake)
+│   └── st/                    # CPU ST projects and testcase data generation scripts
+│
+├── npu/                        # NPU-side ST tests split by SoC
+│   ├── a2a3/
+│   │   ├── src/st/           # A2/A3 compute ST
+│   │   └── comm/st/           # A2/A3 communication ST
+│   └── a5/
+│       ├── src/st/            # A5 compute ST
+│       └── comm/st/           # A5 communication ST
+│
+├── run_st.sh                   # NPU ST one-click run script
+└── run_comm_test.sh           # Communication ST one-click run script
 ```
 
-#### Environment Variables
+## Synchronous and Asynchronous Communication Tests
 
-| Variable | Description |
-|----------|-------------|
-| `MPI_HOME` | MPI installation root; the script searches `$MPI_HOME/bin/mpirun` |
-| `MPI_LIB_PATH` | Direct path to `libmpi.so` (overrides default search) |
+Communication tests verify multi-device PTO communication primitives (Put / Get / Broadcast / Gather / Scatter / Reduce / Notify / Wait / Test), built on MPI + HCCL.
 
-If `mpirun` is already on `PATH` and `libmpi.so` is in a standard library path, these variables are not required.
+Communication tests are divided into **synchronous** and **asynchronous** instruction categories:
 
-#### Verify Installation
+| Type | Test Examples | CANN Version Required |
+|------|--------------|----------------------|
+| Synchronous | `tput`, `tget`, `treduce`, `tbroadcast`, etc. | CANN 8.x+ |
+| Asynchronous | `tput_async`, `tget_async` | **CANN 9.0+** |
 
-```bash
-mpirun --version
-mpirun -n 2 echo "MPI OK"
-```
+> Asynchronous instructions depend on SDMA opapi interfaces introduced in CANN 9.0 (e.g., `aclnnShmemSdmaStarsQuery`). They will fail on older CANN versions due to missing symbols. `run_comm_test.sh` **does not include async tests by default**; use `-a` to opt in.
 
 ### Quick Start
 
 ```bash
-# Run all tests with 8 NPUs (default A2/A3)
+# 8-NPU full test (default A2/A3, no async tests)
 ./run_comm_test.sh
 
+# Include async tests (requires CANN 9.0+)
+./run_comm_test.sh -a
+
 # A5 SoC, 2 NPUs
 ./run_comm_test.sh -v a5 -n 2
 
-# Run only the tput testcase
-./run_comm_test.sh -t tput
-
 # Enable debug logging
 ./run_comm_test.sh -d -t tput
 ```
@@ -102,13 +72,18 @@ mpirun -n 2 echo "MPI OK"
 |------|-------------|---------|
 | `-n` | Number of available NPUs: 2, 4, or 8 | 8 |
 | `-v` | SoC version: `a3` (Ascend910B) or `a5` (Ascend910_9599) | a3 |
-| `-t` | Run specific testcase(s) (repeatable), e.g. `tput`, `treduce` | all |
-| `-d` | Enable debug mode with verbose init/sync logging | off |
-
-### How It Works
+| `-t` | Run specific testcase(s) (repeatable) | all |
+| `-a` | Include async instruction tests (requires CANN 9.0+) | off |
+| `-d` | Enable debug mode | off |
 
-The script automatically runs each testcase at each applicable rank count (2 / 4 / 8, up to `-n`), using GTest filters to select only the tests matching the current rank count. For example, with `-n 4` it first runs default tests at 2 ranks, then tests with the `4Ranks` suffix at 4 ranks, skipping 8-rank tests.
+## Suggested Reading Order
 
-## Suggested Reading
+| Order | Document |
+|-------|---------|
+| 1 (learn first) | [docs/getting-started.md](../docs/getting-started.md) |
+| 2 | [docs/coding/tutorial.md](../docs/coding/tutorial.md) |
+| 3 | This page |
 
-- Getting started (recommended: CPU first, then NPU): [docs/getting-started.md](../docs/getting-started.md)
+For more complete environment setup and dependency details, see:
+- [docs/getting-started.md](../docs/getting-started.md)
+- [tests/README_zh.md](./README_zh.md) — 中文版
diff --git a/tests/README_zh.md b/tests/README_zh.md
index 4c8b38a0..fc4230db 100644
--- a/tests/README_zh.md
+++ b/tests/README_zh.md
@@ -1,94 +1,54 @@
 # tests/
 
-PTO Tile Lib 的测试与示例，覆盖 CPU 仿真与 NPU（`sim` 和板上 `npu` 两种模式）。
+PTO Tile Library 的测试与示例，覆盖 CPU 仿真与 NPU（`sim` 和板上 `npu` 两种模式）。
 
 ## 测试入口
 
-常见测试入口如下：
-
-- CPU Simulator 全量运行：`python3 tests/run_cpu.py --clean --verbose`
-- GEMM demo：`python3 tests/run_cpu.py --demo gemm --verbose`
-- Flash Attention demo：`python3 tests/run_cpu.py --demo flash_attn --verbose`
-- 单个 ST 用例：`python3 tests/script/run_st.py -r [sim|npu] -v [a3|a5] -t [TEST_CASE] -g [GTEST_FILTER_CASE]`
-- 一键脚本：`./tests/run_st.sh`、`./tests/run_cpu_tests.sh`
+| 场景 | 命令 |
+|------|------|
+| CPU Simulator 全量运行 | `python3 tests/run_cpu.py --clean --verbose` |
+| GEMM demo | `python3 tests/run_cpu.py --demo gemm --verbose` |
+| Flash Attention demo | `python3 tests/run_cpu.py --demo flash_attn --verbose` |
+| 单个 ST 用例（NPU） | `python3 tests/script/run_st.py -r [sim\|npu] -v [a3\|a5] -t [TEST_CASE] -g [GTEST_FILTER_CASE]` |
+| 一键脚本 | `./build.sh --run_all --a3 --sim` |
 
 ## 目录结构
 
-- `script/`：推荐的入口脚本
-  - `run_st.py`：构建并运行 NPU ST（`-r sim|npu -v a3|a5 -t <testcase> -g <gtest_filter>`）
-  - `build_st.py`：仅构建 NPU ST
-  - `all_cpu_tests.py`：批量构建并运行 CPU ST 套件
-  - `README.md`：脚本使用说明
-- `cpu/`：CPU 侧 ST 测试（gtest + CMake）
-  - `cpu/st/`：CPU ST 工程与 testcase 数据生成脚本
-- `npu/`：按 SoC 拆分的 NPU 侧 ST 测试
-  - `npu/a2a3/src/st/`：A2/A3 计算 ST
-  - `npu/a2a3/comm/st/`：A2/A3 通信 ST
-  - `npu/a5/src/st/`：A5 计算 ST
-  - `npu/a5/comm/st/`：A5 通信 ST
-- `common/`：共享测试资源（如存在）
-- `run_comm_test.sh`：通信 ST 一键运行脚本（详见下方说明）
-
-## 通信测试（Comm ST）
-
-通信测试验证多卡间的 PTO 通信原语（Put / Get / Broadcast / Gather / Scatter / Reduce / Notify / Wait / Test），基于 MPI + HCCL 实现。
-
-### 前置依赖：MPI 安装
-
-通信测试需要 MPI 环境（MPICH 或 OpenMPI 均可）。运行时需要两个组件：
-
-1. **`mpirun`**：用于启动多进程
-2. **`libmpi.so`**：运行时通过 `dlopen` 动态加载
-
-#### 安装 MPICH（推荐）
-
-```bash
-# Ubuntu / Debian
-sudo apt install mpich libmpich-dev
-
-# CentOS / RHEL / EulerOS
-sudo yum install mpich mpich-devel
-# 安装后可能需要加载 module 或手动加入 PATH：
-export PATH=/usr/lib64/mpich/bin:$PATH
 ```
-
-#### 从源码安装 MPICH（无 root 权限时）
-
-```bash
-wget https://www.mpich.org/static/downloads/4.2.3/mpich-4.2.3.tar.gz
-tar xzf mpich-4.2.3.tar.gz && cd mpich-4.2.3
-./configure --prefix=$HOME/mpich --disable-fortran
-make -j$(nproc) && make install
-export MPI_HOME=$HOME/mpich
-export PATH=$MPI_HOME/bin:$PATH
+tests/
+├── script/                     # 测试入口脚本（推荐从此入口）
+│   ├── run_st.py              # 构建并运行 NPU ST
+│   ├── build_st.py            # 仅构建 NPU ST
+│   ├── all_cpu_tests.py       # 批量构建并运行 CPU ST 套件
+│   └── README.md              # 脚本使用说明
+│
+├── cpu/                        # CPU 侧 ST 测试（gtest + CMake）
+│   └── st/                    # CPU ST 工程与 testcase 数据生成脚本
+│
+├── npu/                        # 按 SoC 拆分的 NPU 侧 ST 测试
+│   ├── a2a3/
+│   │   ├── src/st/           # A2/A3 计算 ST
+│   │   └── comm/st/           # A2/A3 通信 ST
+│   └── a5/
+│       ├── src/st/            # A5 计算 ST
+│       └── comm/st/           # A5 通信 ST
+│
+├── run_st.sh                   # NPU ST 一键运行脚本
+└── run_comm_test.sh           # 通信 ST 一键运行脚本
 ```
 
-#### 环境变量
-
-| 变量 | 说明 |
-|------|------|
-| `MPI_HOME` | MPI 安装根目录，脚本会自动搜索 `$MPI_HOME/bin/mpirun` |
-| `MPI_LIB_PATH` | 直接指定 `libmpi.so` 路径（覆盖默认搜索） |
-
-如果 `mpirun` 已在 `PATH` 中且 `libmpi.so` 在标准库路径下，则无需设置这些变量。
+## 同步与异步通信指令测试
 
-#### 验证安装
-
-```bash
-mpirun --version
-mpirun -n 2 echo "MPI OK"
-```
-
-### 同步与异步指令测试
+通信测试验证多卡间的 PTO 通信原语（Put / Get / Broadcast / Gather / Scatter / Reduce / Notify / Wait / Test），基于 MPI + HCCL 实现。
 
-通信测试分为**同步指令**（如 `tput`、`tget`）和**异步指令**（如 `tput_async`、`tget_async`）两类：
+通信测试分为**同步指令**和**异步指令**两类：
 
 | 类型 | 测试用例示例 | CANN 版本要求 |
 |------|-------------|--------------|
 | 同步指令 | `tput`、`tget`、`treduce`、`tbroadcast` 等 | CANN 8.x 及以上 |
 | 异步指令 | `tput_async`、`tget_async` | **CANN 9.0 及以上** |
 
-异步指令依赖 CANN 9.0 引入的 SDMA opapi 接口（如 `aclnnShmemSdmaStarsQuery`），在低版本 CANN 上会因符号缺失而运行失败。因此 `run_comm_test.sh` **默认不包含异步指令测试**，需通过 `-a` 参数显式启用。
+> 异步指令依赖 CANN 9.0 引入的 SDMA opapi 接口，在低版本 CANN 上会因符号缺失而运行失败。`run_comm_test.sh` **默认不包含异步指令测试**，需通过 `-a` 参数显式启用。
 
 ### 快速开始
 
@@ -99,43 +59,31 @@ mpirun -n 2 echo "MPI OK"
 # 包含异步指令测试（需 CANN 9.0+）
 ./run_comm_test.sh -a
 
-# 仅跑异步 tput 用例
-./run_comm_test.sh -t tput_async
-
 # 指定 A5 SoC，2 卡
 ./run_comm_test.sh -v a5 -n 2
 
-# 仅跑 tput 用例
-./run_comm_test.sh -t tput
-
 # 开启 debug 日志
 ./run_comm_test.sh -d -t tput
 ```
 
-也可以通过 `run_st.py` 直接运行，脚本会自动按 rank 数分轮执行：
-
-```bash
-# 自动分轮运行 tput_async（2/4/8 rank）
-python3 tests/script/run_st.py -r npu -v a3 -t comm/tput_async
-
-# 限制最多 2 rank
-python3 tests/script/run_st.py -r npu -v a3 -t comm/tput_async -n 2
-```
-
 ### 参数说明
 
 | 参数 | 说明 | 默认值 |
 |------|------|--------|
 | `-n` | 可用 NPU 数量：2、4 或 8 | 8 |
 | `-v` | SoC 版本：`a3`（Ascend910B）或 `a5`（Ascend910_9599） | a3 |
-| `-t` | 指定测试用例（可多次使用），如 `tput`、`treduce` | 全部 |
-| `-a` | 包含异步指令测试（`*_async`），需 CANN 9.0+ | 关闭 |
-| `-d` | 开启调试模式，打印详细初始化与同步日志 | 关闭 |
-
-### 运行机制
-
-脚本会根据 `-n` 指定的卡数，自动为每个测试用例分别以 2 / 4 / 8 rank 运行，通过 GTest Filter 确保每次只执行与当前 rank 数匹配的测试。例如 `-n 4` 时会先以 2 rank 跑默认用例，再以 4 rank 跑带 `4Ranks` 后缀的用例，跳过 8 rank 用例。
+| `-t` | 指定测试用例（可多次使用） | 全部 |
+| `-a` | 包含异步指令测试（需 CANN 9.0+） | 关闭 |
+| `-d` | 开启调试模式 | 关闭 |
 
 ## 建议阅读顺序
 
-- 入门指南（建议先 CPU，再 NPU）：[docs/getting-started.md](../docs/getting-started.md)
+| 顺序 | 文档 |
+|------|------|
+| 1（推荐先学） | [docs/getting-started_zh.md](../docs/getting-started_zh.md) |
+| 2 | [docs/coding/tutorial_zh.md](../docs/coding/tutorial_zh.md) |
+| 3 | 本页 |
+
+更完整的环境配置与依赖说明请参见：
+- [docs/getting-started_zh.md](../docs/getting-started_zh.md)
+- [tests/README_zh.md](./README_zh.md) — 中文版