Skip to content

perf(buffers): remove GPU syncs from CircularBuffer/DelayBuffer hot path#5395

Open
haixuanTao wants to merge 2 commits intoisaac-sim:mainfrom
haixuanTao:fix/circular-buffer-remove-gpu-syncs
Open

perf(buffers): remove GPU syncs from CircularBuffer/DelayBuffer hot path#5395
haixuanTao wants to merge 2 commits intoisaac-sim:mainfrom
haixuanTao:fix/circular-buffer-remove-gpu-syncs

Conversation

@haixuanTao
Copy link
Copy Markdown

@haixuanTao haixuanTao commented Apr 24, 2026

Human written summary

Did some performance check on WBC-Agile, and notice 10% fps increase when doing the below changes in simulation.

This should not change any of the simulator logic.

AI Generated Summary

CircularBuffer.append and CircularBuffer.__getitem__ each used torch.any(...) to detect post-reset state, forcing a GPU→CPU synchronization on every call. For consumers that invoke the delay buffer once per physics step (DelayedPDActuator, RemotizedPDActuator, DelayedDCMotor, observation-history buffers in observation_manager), this produces two syncs per call and measurably starves the GPU at high num_envs.

Replace the torch.any() probes with a CPU-side _any_first_push_pending flag maintained by reset() (or buffer lazy-init) and cleared by append() once the first-push replication runs. The public API is unchanged:

  • _num_pushes stays a per-env long GPU tensor; the current_length property still returns min(num_pushes, max_len) per batch index.
  • First-push replication still fills the entire history for any env whose num_pushes is zero at append time. The replication is now written via torch.where with a broadcast mask so the scatter is dynamic-shape-free and cannot reintroduce a sync through advanced-indexing shape inference.
  • __getitem__ still raises RuntimeError on a read from an unwritten env — just gated on the CPU flag instead of a GPU probe.
  • All other error/shape invariants preserved.

DelayBuffer.compute also drops a redundant .clone() on the returned gather. CircularBuffer.__getitem__ uses advanced indexing (self._buffer[index_in_buffer, self._ALL_INDICES]) which already allocates fresh storage, so callers that mutate the return value in place do not affect the internal ring. Saves one (batch, *feat) D2D copy per call.

Microbenchmark on CUDA (RTX 5090, num_envs=6144, features=12, history_length=10, resetting ~10% of envs every 50 steps):

upstream DelayBuffer.compute : 619 µs/call
patched  DelayBuffer.compute : 354 µs/call  (1.75x)

Per actuator step with three delay buffers: ~265 µs/call × 3 ≈ 800 µs saved per physics tick.

Validated against the existing test_circular_buffer.py (10 cases) and test_delay_buffer.py (3 cases) suites. End-to-end verified by re-running a 6144-env locomotion training job on the patched tree and confirming the wandb FPS plateau matches a prior run that used a hand-rolled sync-free sibling of DelayBuffer.

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.
List any dependencies that are required for this change.

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (existing functionality will not work without user modification)
  • Documentation update

Screenshots

Please attach before and after screenshots of the change if applicable.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

``CircularBuffer.append`` and ``CircularBuffer.__getitem__`` each used
``torch.any(...)`` to detect post-reset state, forcing a GPU→CPU
synchronization on every call. For consumers that invoke the delay
buffer once per physics step (``DelayedPDActuator``,
``RemotizedPDActuator``, ``DelayedDCMotor``, observation-history
buffers in ``observation_manager``), this produces two syncs per call
and measurably starves the GPU at high ``num_envs``.

Replace the ``torch.any()`` probes with a CPU-side
``_any_first_push_pending`` flag maintained by ``reset()`` (or buffer
lazy-init) and cleared by ``append()`` once the first-push replication
runs. The public API is unchanged:

- ``_num_pushes`` stays a per-env ``long`` GPU tensor; the
  ``current_length`` property still returns ``min(num_pushes, max_len)``
  per batch index.
- First-push replication still fills the entire history for any env
  whose ``num_pushes`` is zero at append time. The replication is now
  written via ``torch.where`` with a broadcast mask so the scatter is
  dynamic-shape-free and cannot reintroduce a sync through
  advanced-indexing shape inference.
- ``__getitem__`` still raises ``RuntimeError`` on a read from an
  unwritten env — just gated on the CPU flag instead of a GPU probe.
- All other error/shape invariants preserved.

``DelayBuffer.compute`` also drops a redundant ``.clone()`` on the
returned gather. ``CircularBuffer.__getitem__`` uses advanced indexing
(``self._buffer[index_in_buffer, self._ALL_INDICES]``) which already
allocates fresh storage, so callers that mutate the return value in
place do not affect the internal ring. Saves one ``(batch, *feat)``
D2D copy per call.

Microbenchmark on CUDA (RTX 5090, ``num_envs=6144, features=12,
history_length=10``, resetting ~10% of envs every 50 steps):

    upstream DelayBuffer.compute : 619 µs/call
    patched  DelayBuffer.compute : 354 µs/call  (1.75x)

Per actuator step with three delay buffers: ~265 µs/call × 3 ≈ 800 µs
saved per physics tick.

Validated against the existing ``test_circular_buffer.py`` (10 cases)
and ``test_delay_buffer.py`` (3 cases) suites. End-to-end verified by
re-running a 6144-env locomotion training job on the patched tree and
confirming the wandb FPS plateau matches a prior run that used a
hand-rolled sync-free sibling of ``DelayBuffer``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Haixuan Xavier Tao <tao.xavier@outlook.com>
@github-actions github-actions Bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Apr 24, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR eliminates GPU→CPU synchronizations from CircularBuffer.append() and __getitem__() by replacing torch.any() probes with a CPU-side boolean flag _any_first_push_pending. It also removes a redundant .clone() from DelayBuffer.compute(). The implementation correctly maintains the first-push replication invariant and preserves the public API, with a reported 1.75x speedup on the hot path.

Architecture Impact

Moderate blast radius, well-contained. The CircularBuffer and DelayBuffer are used by:

  • DelayedPDActuator, RemotizedPDActuator, DelayedDCMotor (actuator delay modeling)
  • ObservationManager (observation history buffers)
  • Any custom code using these utility classes

The semantic changes are minimal: the flag tracks the same predicate (any(num_pushes == 0)) but via bookkeeping rather than GPU probes. The .clone() removal in DelayBuffer.compute() relies on the documented behavior that advanced indexing allocates fresh storage—this is correct per PyTorch semantics.

Implementation Verdict

Minor fixes needed — One potential correctness issue with the torch.where rewrite, and a minor initialization concern.

Test Coverage

Good additions:

  • test_partial_reset_then_read_raises — guards the flag-setting behavior in reset()
  • test_interleaved_partial_reset_and_append — validates first-push replication across partial reset cycles

Missing: No explicit test that the .clone() removal is safe (i.e., mutating DelayBuffer.compute() output doesn't corrupt internal state). The existing tests likely don't exercise this path.

CI Status

No CI checks available yet — should verify all existing tests pass, especially on GPU.

Findings

🟡 Warning: circular_buffer.py:157 — torch.where broadcasts differently than the original indexed assignment

The original code:

self._buffer[:, is_first_push] = data[is_first_push]

writes data[is_first_push] (shape: [num_reset_envs, *feat]) to all max_length slots for each reset env.

The new code:

mask = is_first_push.view(1, -1, *([1] * (data.ndim - 1)))
self._buffer = torch.where(mask, data.unsqueeze(0), self._buffer)

broadcasts data.unsqueeze(0) (shape: [1, batch_size, *feat]) across the max_length dimension. This is semantically equivalent because torch.where will replicate the [1, ...] tensor across the first dimension when mask is [1, batch_size, 1, ...]. However, this replaces the entire self._buffer tensor rather than writing in-place, which could cause issues if any external code holds a reference to self._buffer. Given this is an internal implementation detail and not exposed, this is likely fine, but worth noting.

🔵 Improvement: circular_buffer.py:47-53 — Flag initialization duplicated across __init__ and lazy buffer creation

The flag _any_first_push_pending is set to False in __init__ but then set to True in the first append() when self._buffer is None. This works but is slightly confusing. Consider initializing _any_first_push_pending = True in __init__ since a freshly constructed buffer is semantically "pending first push" until data is appended. This would also remove the need for the _any_first_push_pending = True line at L143.

🔵 Improvement: circular_buffer.py:115 — Reset clears buffer to 0.0 even for partial resets

self._buffer[:, batch_ids, :] = 0.0

This writes zeros to all max_length slots for reset batch indices. While correct, this is slightly wasteful since the first append will overwrite with torch.where. The zeros are only needed for the .buffer property to return cleared data—consider documenting this rationale.

🔵 Improvement: delay_buffer.py:176-178 — Add test for clone removal safety

The comment correctly explains why .clone() is unnecessary, but adding a test that mutates the return value and verifies the internal buffer is unaffected would be valuable:

def test_compute_returns_independent_tensor(delay_buffer):
    result = delay_buffer.compute(data)
    result[:] = 999  # mutate
    # verify internal buffer unchanged

🟡 Warning: circular_buffer.py:143 — Flag set inside conditional may not trigger on all first-append paths

When self._buffer is None, we set _any_first_push_pending = True at L143. But this is inside the if self._buffer is None block. The subsequent if self._any_first_push_pending: check at L150 will correctly trigger. However, if the flag was already False from __init__ and someone called append() twice rapidly, the first would set it True and clear it, and the second would skip replication—which is correct. This logic is sound, just subtle.

🔵 Improvement: test_circular_buffer.py:171-197 — Test uses .buffer property which involves clone + roll

The test test_interleaved_partial_reset_and_append accesses circular_buffer.buffer which creates a cloned, rolled view. This is fine for correctness testing but doesn't directly verify the internal _buffer state. The test assertions are valid since .buffer correctly reflects the logical state.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR eliminates two per-call GPU→CPU synchronizations in CircularBuffer by replacing torch.any(num_pushes == 0) probes with a CPU-side _any_first_push_pending boolean flag, and removes a redundant .clone() in DelayBuffer.compute. The public API and first-push replication semantics are preserved.

  • P1 — false RuntimeError on empty reset(): reset() sets _any_first_push_pending = True unconditionally, including when batch_ids=[]. Any subsequent __getitem__ call before the next append() raises "buffer empty" spuriously. In RL training loops where done-env IDs are computed dynamically, an empty list at early steps is common and would surface as an intermittent crash.

Confidence Score: 4/5

Mostly safe, but the unconditional flag set in reset() introduces a false "buffer empty" error when called with an empty batch ID list, which can occur in RL training loops.

One P1 behavioral regression: reset(batch_ids=[]) sets _any_first_push_pending = True even though no batch is actually reset, causing the next __getitem__ to raise incorrectly. All other changes are logically sound and well-tested. Score is 4 rather than 5 because the P1 needs a guard before this is safe to merge.

source/isaaclab/isaaclab/utils/buffers/circular_buffer.py — the reset() method needs a guard to avoid setting the flag when batch_ids is empty.

Important Files Changed

Filename Overview
source/isaaclab/isaaclab/utils/buffers/circular_buffer.py Replaces per-call torch.any() GPU→CPU sync with a CPU-side bool flag _any_first_push_pending; flag is set unconditionally in reset() including for empty batch_ids, causing a false "buffer empty" RuntimeError in __getitem__ until the next append().
source/isaaclab/isaaclab/utils/buffers/delay_buffer.py Removes redundant .clone() from compute() return path; advanced indexing in CircularBuffer.__getitem__ already allocates fresh storage, so the clone was unnecessary.
source/isaaclab/test/utils/test_circular_buffer.py Adds two new tests: test_partial_reset_then_read_raises guards the flag-based empty check, and test_interleaved_partial_reset_and_append validates first-push replication semantics across multiple partial reset cycles.
source/isaaclab/docs/CHANGELOG.rst Adds 0.54.4 changelog entry describing the GPU sync removal and clone elimination.
source/isaaclab/config/extension.toml Bumps version from 0.54.3 to 0.54.4.
CONTRIBUTORS.md Adds author Haixuan Xavier Tao to contributors list.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[append called] --> B{_buffer is None?}
    B -- Yes --> C[allocate _buffer\nset _any_first_push_pending = True]
    C --> D[advance _pointer\nwrite data to _buffer at pointer]
    B -- No --> D
    D --> E{_any_first_push_pending?}
    E -- Yes --> F[compute is_first_push = num_pushes == 0\nbuild broadcast mask]
    F --> G[self._buffer = torch.where mask data self._buffer\nset _any_first_push_pending = False]
    G --> H[num_pushes += 1]
    E -- No --> H

    I[reset called] --> J{batch_ids empty?}
    J -- No / None --> K[zero num_pushes for indices\nzero buffer slots\nset _any_first_push_pending = True]
    J -- Yes --> L[no-op — flag should NOT be set\n current code sets it anyway]

    M[__getitem__ called] --> N{_any_first_push_pending\nor _buffer is None?}
    N -- Yes --> O[raise RuntimeError\n'buffer empty']
    N -- No --> P[compute valid_keys\nreturn advanced-index gather\nno clone needed]
Loading

Comments Outside Diff (1)

  1. source/isaaclab/isaaclab/utils/buffers/circular_buffer.py, line 103-121 (link)

    P1 False-positive RuntimeError when reset(batch_ids=[]) is called

    reset() unconditionally sets _any_first_push_pending = True regardless of whether any batch index was actually zeroed. If a caller passes an empty batch_ids (e.g. when no environments terminated at a given step), the flag is raised even though every batch still has valid data. The next __getitem__ call then throws a spurious "buffer empty" RuntimeError until the next append() clears the flag. The original torch.any(self._num_pushes == 0) check was immune to this because it probed the actual tensor state.

    In locomotion/RL training it is common for vectorized code to pass a dynamically-sized list of done-env IDs that can be empty on the first steps; hitting this branch would surface as an intermittent crash.

Reviews (1): Last reviewed commit: "perf(buffers): remove GPU syncs from Cir..." | Re-trigger Greptile

Greptile flagged a regression in the prior commit: ``reset()``
unconditionally set ``_any_first_push_pending = True``, so a call with
a zero-length ``batch_ids`` — common when termination masks produce an
empty env-ids tensor at a given step — would spuriously raise
``RuntimeError`` on the next ``__getitem__`` even though every batch
still held valid data. The original ``torch.any(num_pushes == 0)``
guard was immune because it probed the real tensor state.

Early-return from ``reset()`` when ``batch_ids`` is a zero-length
sequence (list or tensor) so the CPU flag stays consistent with the
buffer state. Adds a regression test covering both empty-list and
empty-tensor inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@haixuanTao haixuanTao force-pushed the fix/circular-buffer-remove-gpu-syncs branch from 56fdaf3 to 4b3701e Compare April 24, 2026 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant