Add adaptive message batching to reduce overhead under load by SimonHeybrock · Pull Request #821 · scipp/esslivedata

SimonHeybrock · 2026-03-19T14:16:29Z

Summary

Under sustained load, fixed per-batch overhead (workflow graph evaluation, serialization, thread dispatch) dominates cycle time. AdaptiveMessageBatcher wraps SimpleMessageBatcher and widens the batch window when consecutive batches signal overload, then cautiously de-escalates when headroom is available. This is lossless: the dashboard updates less frequently but no data is dropped.

This might make #739 redundant, or at least less urgent.

Design

Uses a Multiplicative Increase, Additive Decrease (MIAD) strategy inspired by TCP congestion control:

Escalation (+2 half-steps = x2 window): Triggered after 2 consecutive batches where processing time exceeds the batch window. Fast response to overload.
De-escalation (-1 half-step = x1/√2 window): Triggered after 3 consecutive batches with >25% headroom. The asymmetric step sizes mean two de-escalation steps undo one escalation, providing natural damping.
Idle de-escalation: Wall-clock fallback after 3 idle windows with no data.
Batch window grid: base × √2^n (1.0s → 1.41s → 2.0s → 2.83s → 4.0s → ...), capped at base × 2^max_level. Fixed grid avoids floating-point drift.

Key properties:

Idle poll cycles between batches do not reset the consecutive counters, which is essential for de-escalation under continuous light load (e.g., cosmic background after shutter close).
Empty time-gap batches (message_count=0) are treated as no-ops and don't interfere with escalation or de-escalation.
batch_interval_s flows through ServiceStatus → x5f2 → BackendStatusWidget.

Scenario tests

Comprehensive simulation-based tests in adaptive_batching_scenarios_test.py drive a MessageBatcher through realistic load patterns and assert on observable properties (escalation time, backlog bounds, oscillation count, de-escalation). All acceptance thresholds are collected in a central LIMITS table for tuning.

These tests are very long, therefore I created plots for each of them, shown in a comment below. I recommend reviewing those instead of the testcases.

Scenarios covered: step-function overload at four severity levels, no-escalation under light load (20-85% utilization), GC jitter resilience, boundary oscillation, creeping overload, de-escalation under continuous load (light, moderate, multi-level, partial), shutter open/close with cosmic background, backlog draining, and the 70-100% utilization dead zone.

Test plan

Verify adaptive batching behavior on a live dashboard under load

🤖 Generated with Claude Code

Under sustained load the fixed per-batch overhead (workflow graph evaluation, serialization, thread dispatch) dominates cycle time. AdaptiveMessageBatcher wraps SimpleMessageBatcher and widens the batch window (1s → 2s → 4s → 8s) when consecutive non-empty batches indicate the system cannot keep up, then de-escalates when idle cycles show spare capacity. This is lossless — the dashboard simply updates less frequently, but no data is dropped. The current batch interval is reported in ServiceStatus and displayed in the backend status widget when above the base 1s value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… strategy The old count-based heuristic used consecutive non-empty batches as a proxy for overload, causing slow escalation, inability to de-escalate under light continuous load, and false escalation when processing fits within the window. The new strategy uses actual processing_time_s feedback: - Escalation: after 2 consecutive batches where processing exceeds the window - De-escalation: after 5 consecutive batches with <70% window utilization - Idle fallback: wall-clock de-escalation after 3 idle windows (unchanged)

Add precondition assertions to transition tests (de-escalation, backlog draining, repeated shutter) to prevent false positives when the expected intermediate state is never reached. Parameterize light-load tests across utilization levels (20%-85%), jitter tests across RNG seeds, and severity tests across four overload intensities with min/max level bounds. Add stabilization-after-escalation test and consolidate related thin tests. Extract cyclic_cost helper. The precondition guards exposed a real issue: the old test_deescalates_after_step_down_to_light_load never actually triggered escalation (0.95s < 1s window), passing trivially. With a corrected cost function, the test reveals a batcher limitation where idle poll cycles between batches reset the consecutive-underloaded counter, preventing de-escalation under continuous light load. Marked xfail(strict=True). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The batcher cannot de-escalate when data is flowing continuously because idle poll cycles between batches reset the consecutive- underloaded counter. Four new tests cover the scope of this issue: - Light continuous load after heavy phase (level 1 -> 0) - Moderate continuous load after heavy phase (level 1 -> 0) - Multi-level de-escalation (level 2+ -> 0) - Partial de-escalation (level 2+ -> 1) All four fail, confirming the limitation. The higher the escalation level, the worse it gets: a larger window means more spare time, more idle polls, and less chance for the underload counter to accumulate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Idle poll cycles (report_batch(None)) between batches were resetting the consecutive-overloaded and consecutive-underloaded counters. At higher escalation levels, the large batch window means most of each cycle is spare time filled with idle polls, which prevented the underload counter from ever reaching the de-escalation threshold. The fix: idle polls no longer reset consecutive counters. Genuine idleness is already handled by the wall-clock fallback path, and the overload counter is properly reset by non-overloaded real batches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three blind spots in the scenario test suite: Dead zone (70-100% utilization at escalated level): When processing fills the escalated window without enough headroom (<70%) for de-escalation, the batcher stays stuck even if a lower level would suffice. Test documents this as current behavior — if the strategy is improved to probe lower levels, the expected final level should change. Jitter-induced sticky escalation: When mean processing equals the batch window, jitter causes escalation (~25% chance of two consecutive overloaded batches). At the escalated level, processing lands in the dead zone, making escalation permanent. Test documents this. Time-gap batches (message_count=0): The SimpleMessageBatcher can return empty batches during data gaps. Tests verify these don't disrupt ongoing escalation or de-escalation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The shutter-closed phase is not idle: cosmic background produces a continuous stream of ev44 messages with very few events. This means wall-clock idle de-escalation never applies; the batcher must de-escalate via the underload counter. Update existing shutter tests to use cosmic background (overhead=0.2, per_s=0.01) instead of idle_cost() for the off-phase, making the simulation more realistic. Add test_severe_overload_to_cosmic_background: after reaching level 2+ from severe overload, shutter close drops to cosmic background. Verifies de-escalation through all levels back to 0 via the underload path — the most operationally important recovery scenario. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Escalation still doubles the window (+2 half-steps), but de-escalation now reduces by a factor of 1/sqrt(2) (-1 half-step). Two de-escalation steps undo one escalation, providing natural damping. The batch window lives on a fixed grid of base * sqrt(2)^n values, avoiding floating-point drift. The asymmetric step sizes allow: - Faster convergence to the right level (smaller probing steps down) - Reduced dead zone (can explore windows between the old 2x levels) - Lower consecutive-underload threshold (3 instead of 5) since each step is safer Tuning changes: - DEESCALATION_HEADROOM_RATIO: 0.7 -> 0.75 - DEESCALATION_UNDERLOAD_THRESHOLD: 5 -> 3 - ESCALATION_HALF_STEPS replaces ESCALATION_LEVEL_JUMP The dead zone test now shows the batcher de-escalates from level 4 (4s) to level 3 (2.83s), previously stuck at level 4 (4s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

_set_half_step replaced the inner SimpleMessageBatcher on every level change, discarding messages buffered in _active_batch and _future_messages. Update the batch length in place instead so the current active batch completes normally and only the next boundary uses the new length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix dead-zone threshold references in docstrings (70% → 75% to match DEESCALATION_HEADROOM_RATIO=0.75) - Fix dead_zone_stuck docstring: 72.5% utilization at level 4 is underloaded (< 75%), not dead zone; batcher de-escalates to level 3 where it gets stuck at ~78% - Normalize level terminology to consistently mean half-steps matching state.level (e.g., level 4 = 4.0s window, not "level 2") - Rework test_no_oscillation_at_steady_load to use a load that actually triggers escalation, preventing the test from being vacuously true - Add test for non-default base_batch_length_s=2.0 to catch scaling bugs - Add re-escalation assertion to repeated_shutter_cycles to verify the batcher re-escalates during subsequent on-phases - Replace unittest.mock.patch with clock injection via constructor parameter on AdaptiveMessageBatcher, eliminating mock usage in both test files

SimonHeybrock · 2026-03-20T11:39:19Z

Scenario timeline visualizations

Script to generate timeline plots for all adaptive batching scenario tests:
https://gist.github.com/SimonHeybrock/53d183c337a74282493d59cfefbaebd1

Run with python plot_adaptive_batching_scenarios.py from the repo root (after pip install -e ".[test]"). Generates 25 PNG plots + an index.html for browsing.

Each plot shows 4 panels:

Level — batch level over time with assertion bounds
Processing time — dots colored by utilization class (green < 75%, yellow 75-100% dead zone, red > 100% overloaded) vs batch window
Utilization — ratio with headroom/dead-zone/overloaded bands
Backlog — accumulation with limit annotations

…test Tighten 8 limits that had excessive headroom (e.g., max_backlog 5.0→1.0 when actual is 0.4) and loosen boundary_oscillation max_oscillations 4→5 to avoid flakiness (old limit exactly matched worst-case across 50 seeds). Increase gc_jitter jitter_fraction from 0.5 to 1.2 so spikes actually enter the dead zone and occasionally exceed the batch window, testing that the batcher tolerates isolated overloads. Previously the test never left the headroom zone and was equivalent to a light-load test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge tests that run identical simulations but assert different facets into single tests with all assertions. This removes 6 tests and 130 lines without losing any coverage. Merges: - step_function_backlog + severity_moderate → test_moderate_overload - backlog_drains + steady_load_oscillation → test_moderate_overload_stabilizes_and_drains - severity_overhead_dominated + stabilization_after_step + backlog_peaks_and_decreases → test_overhead_dominated_overload - boundary_oscillation + jitter_sticky_escalation → test_boundary_jitter_escalates_and_sticks - 3 de-escalation tests → parametrized test_deescalates_when_load_drops Also removes no_escalation_when_fits (subsumed by light_load parametrization).

Many scenario tests ran far longer than needed, spending 50%+ of their simulation in a flat steady state after all interesting events completed. Trimmed durations while keeping 20-30s of headroom after the last event. Also lowered the cycles_after() threshold in the overhead-dominated stabilization check from 60s to 30s to match the shorter simulation.

SimonHeybrock · 2026-03-23T05:57:44Z

Scenario timeline plots

Generated from the scenario test suite. Each plot shows 4 panels: batch level, processing time vs window, utilization ratio, and backlog.

Step-function escalation

Non-default base batch length

No escalation when not needed

Steady overload

Creeping overload

De-escalation

Realistic shutter scenarios

Processing-time awareness

Dead zone (known limitation)

SimonHeybrock · 2026-03-25T09:20:39Z

Look at the plots posted in an issue comment instead of reviewing the full file!

jl-wynen

This looks like a band aid over a deeper problem. Can we optimise the slow code instead? That would also benefit other users, not just livedata.
Judging by your description, the problem is with codec that does not depend on the number of events. So is it sciline's scheduler, the job scheduler in livedata, or something deeper?

jl-wynen · 2026-03-26T09:13:58Z

+        old_length = self.batch_length_s
+        self._half_step = new_half_step
+        new_length = self.batch_length_s
+        logger.warning(


Why is this a warning? Do you consider a batch change to be a config error? It seems to me like these changes will be common and part of normal operation.

I do not consider it a config error in general. I expect in many cases that the batch size can be the minimum (1 second), but for larger or un-optimized data-reduction I expect that the batch size needs to be improved.

It seems to me like these changes will be common and part of normal operation.

I think the scenarios may give a wrong impression - I hope that we can stay at the minimum batch size almost always. Batch size increases are there to deal with spikes in the backlog (such as GC running), or as a way to keep operating without dropping data, until we have scaled our system (e.g., by running more backend workers).

SimonHeybrock · 2026-03-26T10:31:10Z

This looks like a band aid over a deeper problem.

It is not a band aid, it improves the service resilience, extending the envelope under which can operate reliably without dropping data or accumulating a backlog without ability to ever process it.

Can we optimise the slow code instead?

Did that already (see several recent PRs). But no matter how much we optimize there will always be cases where dealing with an accumulating backlog is important.

That would also benefit other users, not just livedata.

Did that many times in the past, e.g., improving speed of scipp.group.

Judging by your description, the problem is with codec that does not depend on the number of events.

I can't remember saying anything about codec?

So is it sciline's scheduler, the job scheduler in livedata, or something deeper?

All of the above? We are running a service that consumes, processes, and publish data. There are many contributions to the overall performance. Individual bottlenecks (low hanging fruit) have been optimized.

jl-wynen · 2026-03-26T13:58:02Z

I can't remember saying anything about codec?

Sorry, this is a typo, I mean 'code'.

I am trying to understand what parts of the codebase cause fluctuations in processing time. The description only mentions 'load'.

We have some procedure $P$ that we apply to the data $E$. For simplicity, let's say we apply it twice in succession: $P(E_1) \mathsf{then} P(E_2)$. Batching turns that into $P(E_1 \bigcup E_2)$. The run times of these are

$P(E_1) \mathsf{then} P(E_2) = 2\mathcal{O}(1) + 2\mathcal{O}(N) + 2\mathcal{\omega}(N^2)$
$P(E_1 \bigcup E_2) = \mathcal{O}(1) + \mathcal{O}(2N) + \mathcal{\omega}((2N)^2)$

With $\mathcal{\omega}((2N)^2) \ge 2\mathcal{\omega}(N^2)$ and $\mathcal{O}(2N) = 2\mathcal{O}(N)$, we get that in order for the batched approach to be faster, the $\mathcal{\omega}(N^2)$ must be negligible for the values $N$ that we use.

This means that we are optimising for code that does not depend on $N$. But the test scenarios include things like shutter opening and closing and cosmic background. Those only impact $N$, so batching does nothing here under the above assumption.

The only impact on time I can see (beyond noise) is CPU contention between threads if you run more workflows or visualisations. So are all test scenarios really realistic or are we only dealing with discreet, potentially large steps in load. And if so, can we better predict the load?

SimonHeybrock · 2026-03-27T04:59:26Z

I can't remember saying anything about codec?

Sorry, this is a typo, I mean 'code'.

I am trying to understand what parts of the codebase cause fluctuations in processing time. The description only mentions 'load'.

Things that cause changes in processing time:

Number of running workflows
Number of neutron events in the current stream
Random stuff that might make processing slower for a short time, such as GC.
Network and Kafka, e.g., if broker was busy and we then suddenly get a larger batch of messages

[... maths ...]

This means that we are optimising for code that does not depend on N . But the test scenarios include things like shutter opening and closing and cosmic background. Those only impact N , so batching does nothing here under the above assumption.

All our workflows have a significant constant per-call cost C (seconds) (e.g., per-pixel or allocating large arrays independent of event count), as well as per-event cost. Say the cost for processing all the events we get in 1 second is N (seconds), then we have (time to process 1 second of the stream):

Batch size 1 second: Processing time C + N
Batch size 2 seconds: Processing time C/2 + N

That is, we can deal with a higher event rate. It also helps in cases where C > 1 second.

The only impact on time I can see (beyond noise) is CPU contention between threads if you run more workflows or visualisations. So are all test scenarios really realistic or are we only dealing with discreet, potentially large steps in load.

The test scenarios were mainly written to exclude that the mechanism ends up in weird states such as oscillation, or getting stuck at unnecessarily high batch sizes after a period of high load.

And if so, can we better predict the load?

If we could, so what? Does that help with dealing with C > 1, or being able to process higher event rates?

jl-wynen · 2026-03-27T08:39:54Z

Number of neutron events in the current stream

This doesn't get better with increasing the time window for batches as both our equations show

Random stuff that might make processing slower for a short time, such as GC.

Network and Kafka, e.g., if broker was busy and we then suddenly get a larger batch of messages

Do these last longer than 1s? They should be short fluctuations that the adaptive batcher should ignore.

So we only have

Number of running workflows

So wouldn't it be enough to use something simple like batch_time = A * n_workflows with a factor A that we determine with benchmarks?

On a separate note, I think you are seeing the limitations of a monolithic application. It becomes increasingly difficult to scale as you add features. Did you reconsider micro services?

SimonHeybrock · 2026-03-27T08:48:28Z

Number of neutron events in the current stream

This doesn't get better with increasing the time window for batches as both our equations show

I did not say that. I said it influences the load. Decreasing the repeat rate of the constant cost will reduce the overall performance, making processing more events possible.

Random stuff that might make processing slower for a short time, such as GC.

Network and Kafka, e.g., if broker was busy and we then suddenly get a larger batch of messages

Do these last longer than 1s? They should be short fluctuations that the adaptive batcher should ignore.

It does not matter if it is longer than 1s. If it takes 100ms it eats into the 1 second budget we have to process a batch.

So we only have

Number of running workflows

So wouldn't it be enough to use something simple like batch_time = A * n_workflows with a factor A that we determine with benchmarks?

No, because we (1) do not know n_workflows ahead of time, (2) it still depends on the events rate.

On a separate note, I think you are seeing the limitations of a monolithic application. It becomes increasingly difficult to scale as you add features. Did you reconsider micro services?

Umm, I don't understand. Let us have a call.

jl-wynen

After an in person discussion, I don't have anything else intelligent to say. At least for now this seems to be the best approach. Maybe we can reconsider if we end up using smaller, distributed services for the different banks.

SimonHeybrock and others added 13 commits March 19, 2026 13:55

Initial adaptive batching scenario tests

9c0e80b

Add batch time reporting

a2b9232

Inline

b7e9714

SimonHeybrock and others added 3 commits March 23, 2026 04:57

SimonHeybrock marked this pull request as ready for review March 23, 2026 07:21

SimonHeybrock added this to Development Board Mar 23, 2026

github-project-automation Bot moved this to In progress in Development Board Mar 23, 2026

SimonHeybrock moved this from In progress to Selected in Development Board Mar 23, 2026

SimonHeybrock requested a review from jl-wynen March 25, 2026 09:20

SimonHeybrock commented Mar 25, 2026

View reviewed changes

jl-wynen reviewed Mar 26, 2026

View reviewed changes

jl-wynen approved these changes Mar 27, 2026

View reviewed changes

SimonHeybrock enabled auto-merge March 27, 2026 10:18

Merge branch 'main' into adaptive-batching

a7311c3

SimonHeybrock merged commit 1841191 into main Mar 27, 2026
4 checks passed

SimonHeybrock deleted the adaptive-batching branch March 27, 2026 10:24

github-project-automation Bot moved this from Selected to Done in Development Board Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add adaptive message batching to reduce overhead under load#821

Add adaptive message batching to reduce overhead under load#821
SimonHeybrock merged 17 commits into
mainfrom
adaptive-batching

SimonHeybrock commented Mar 19, 2026 •

edited

Loading

Uh oh!

SimonHeybrock commented Mar 20, 2026 •

edited

Loading

Uh oh!

SimonHeybrock commented Mar 23, 2026

Uh oh!

SimonHeybrock Mar 25, 2026 •

edited

Loading

Uh oh!

jl-wynen left a comment

Uh oh!

jl-wynen Mar 26, 2026

Uh oh!

SimonHeybrock Mar 26, 2026

Uh oh!

SimonHeybrock commented Mar 26, 2026 •

edited

Loading

Uh oh!

jl-wynen commented Mar 26, 2026

Uh oh!

SimonHeybrock commented Mar 27, 2026 •

edited

Loading

Uh oh!

jl-wynen commented Mar 27, 2026 •

edited

Loading

Uh oh!

SimonHeybrock commented Mar 27, 2026

Uh oh!

jl-wynen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SimonHeybrock commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Scenario tests

Test plan

Uh oh!

SimonHeybrock commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scenario timeline visualizations

Uh oh!

SimonHeybrock commented Mar 23, 2026

Scenario timeline plots

Step-function escalation

Non-default base batch length

No escalation when not needed

Steady overload

Creeping overload

De-escalation

Realistic shutter scenarios

Processing-time awareness

Dead zone (known limitation)

Uh oh!

SimonHeybrock Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jl-wynen left a comment

Choose a reason for hiding this comment

Uh oh!

jl-wynen Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

SimonHeybrock Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

SimonHeybrock commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jl-wynen commented Mar 26, 2026

Uh oh!

SimonHeybrock commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jl-wynen commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SimonHeybrock commented Mar 27, 2026

Uh oh!

jl-wynen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SimonHeybrock commented Mar 19, 2026 •

edited

Loading

SimonHeybrock commented Mar 20, 2026 •

edited

Loading

SimonHeybrock Mar 25, 2026 •

edited

Loading

SimonHeybrock commented Mar 26, 2026 •

edited

Loading

SimonHeybrock commented Mar 27, 2026 •

edited

Loading

jl-wynen commented Mar 27, 2026 •

edited

Loading