Skip to content

Add adaptive message batching to reduce overhead under load#821

Merged
SimonHeybrock merged 17 commits into
mainfrom
adaptive-batching
Mar 27, 2026
Merged

Add adaptive message batching to reduce overhead under load#821
SimonHeybrock merged 17 commits into
mainfrom
adaptive-batching

Conversation

@SimonHeybrock
Copy link
Copy Markdown
Member

@SimonHeybrock SimonHeybrock commented Mar 19, 2026

Summary

Under sustained load, fixed per-batch overhead (workflow graph evaluation, serialization, thread dispatch) dominates cycle time. AdaptiveMessageBatcher wraps SimpleMessageBatcher and widens the batch window when consecutive batches signal overload, then cautiously de-escalates when headroom is available. This is lossless: the dashboard updates less frequently but no data is dropped.

This might make #739 redundant, or at least less urgent.

Design

Uses a Multiplicative Increase, Additive Decrease (MIAD) strategy inspired by TCP congestion control:

  • Escalation (+2 half-steps = x2 window): Triggered after 2 consecutive batches where processing time exceeds the batch window. Fast response to overload.
  • De-escalation (-1 half-step = x1/√2 window): Triggered after 3 consecutive batches with >25% headroom. The asymmetric step sizes mean two de-escalation steps undo one escalation, providing natural damping.
  • Idle de-escalation: Wall-clock fallback after 3 idle windows with no data.
  • Batch window grid: base × √2^n (1.0s → 1.41s → 2.0s → 2.83s → 4.0s → ...), capped at base × 2^max_level. Fixed grid avoids floating-point drift.

Key properties:

  • Idle poll cycles between batches do not reset the consecutive counters, which is essential for de-escalation under continuous light load (e.g., cosmic background after shutter close).
  • Empty time-gap batches (message_count=0) are treated as no-ops and don't interfere with escalation or de-escalation.
  • batch_interval_s flows through ServiceStatus → x5f2 → BackendStatusWidget.

Scenario tests

Comprehensive simulation-based tests in adaptive_batching_scenarios_test.py drive a MessageBatcher through realistic load patterns and assert on observable properties (escalation time, backlog bounds, oscillation count, de-escalation). All acceptance thresholds are collected in a central LIMITS table for tuning.

These tests are very long, therefore I created plots for each of them, shown in a comment below. I recommend reviewing those instead of the testcases.

Scenarios covered: step-function overload at four severity levels, no-escalation under light load (20-85% utilization), GC jitter resilience, boundary oscillation, creeping overload, de-escalation under continuous load (light, moderate, multi-level, partial), shutter open/close with cosmic background, backlog draining, and the 70-100% utilization dead zone.

Test plan

  • Verify adaptive batching behavior on a live dashboard under load

🤖 Generated with Claude Code

SimonHeybrock and others added 13 commits March 19, 2026 13:55
Under sustained load the fixed per-batch overhead (workflow graph evaluation,
serialization, thread dispatch) dominates cycle time. AdaptiveMessageBatcher
wraps SimpleMessageBatcher and widens the batch window (1s → 2s → 4s → 8s)
when consecutive non-empty batches indicate the system cannot keep up, then
de-escalates when idle cycles show spare capacity. This is lossless — the
dashboard simply updates less frequently, but no data is dropped.

The current batch interval is reported in ServiceStatus and displayed in the
backend status widget when above the base 1s value.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… strategy

The old count-based heuristic used consecutive non-empty batches as a proxy
for overload, causing slow escalation, inability to de-escalate under light
continuous load, and false escalation when processing fits within the window.

The new strategy uses actual processing_time_s feedback:
- Escalation: after 2 consecutive batches where processing exceeds the window
- De-escalation: after 5 consecutive batches with <70% window utilization
- Idle fallback: wall-clock de-escalation after 3 idle windows (unchanged)
Add precondition assertions to transition tests (de-escalation, backlog
draining, repeated shutter) to prevent false positives when the expected
intermediate state is never reached. Parameterize light-load tests across
utilization levels (20%-85%), jitter tests across RNG seeds, and severity
tests across four overload intensities with min/max level bounds.

Add stabilization-after-escalation test and consolidate related thin
tests. Extract cyclic_cost helper.

The precondition guards exposed a real issue: the old
test_deescalates_after_step_down_to_light_load never actually triggered
escalation (0.95s < 1s window), passing trivially. With a corrected cost
function, the test reveals a batcher limitation where idle poll cycles
between batches reset the consecutive-underloaded counter, preventing
de-escalation under continuous light load. Marked xfail(strict=True).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The batcher cannot de-escalate when data is flowing continuously
because idle poll cycles between batches reset the consecutive-
underloaded counter. Four new tests cover the scope of this issue:

- Light continuous load after heavy phase (level 1 -> 0)
- Moderate continuous load after heavy phase (level 1 -> 0)
- Multi-level de-escalation (level 2+ -> 0)
- Partial de-escalation (level 2+ -> 1)

All four fail, confirming the limitation. The higher the escalation
level, the worse it gets: a larger window means more spare time,
more idle polls, and less chance for the underload counter to
accumulate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Idle poll cycles (report_batch(None)) between batches were resetting
the consecutive-overloaded and consecutive-underloaded counters.  At
higher escalation levels, the large batch window means most of each
cycle is spare time filled with idle polls, which prevented the
underload counter from ever reaching the de-escalation threshold.

The fix: idle polls no longer reset consecutive counters.  Genuine
idleness is already handled by the wall-clock fallback path, and the
overload counter is properly reset by non-overloaded real batches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three blind spots in the scenario test suite:

Dead zone (70-100% utilization at escalated level): When processing
fills the escalated window without enough headroom (<70%) for
de-escalation, the batcher stays stuck even if a lower level would
suffice. Test documents this as current behavior — if the strategy
is improved to probe lower levels, the expected final level should
change.

Jitter-induced sticky escalation: When mean processing equals the
batch window, jitter causes escalation (~25% chance of two consecutive
overloaded batches). At the escalated level, processing lands in the
dead zone, making escalation permanent. Test documents this.

Time-gap batches (message_count=0): The SimpleMessageBatcher can
return empty batches during data gaps. Tests verify these don't
disrupt ongoing escalation or de-escalation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The shutter-closed phase is not idle: cosmic background produces a
continuous stream of ev44 messages with very few events. This means
wall-clock idle de-escalation never applies; the batcher must
de-escalate via the underload counter.

Update existing shutter tests to use cosmic background (overhead=0.2,
per_s=0.01) instead of idle_cost() for the off-phase, making the
simulation more realistic.

Add test_severe_overload_to_cosmic_background: after reaching level
2+ from severe overload, shutter close drops to cosmic background.
Verifies de-escalation through all levels back to 0 via the underload
path — the most operationally important recovery scenario.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Escalation still doubles the window (+2 half-steps), but de-escalation
now reduces by a factor of 1/sqrt(2) (-1 half-step). Two de-escalation
steps undo one escalation, providing natural damping.

The batch window lives on a fixed grid of base * sqrt(2)^n values,
avoiding floating-point drift. The asymmetric step sizes allow:
- Faster convergence to the right level (smaller probing steps down)
- Reduced dead zone (can explore windows between the old 2x levels)
- Lower consecutive-underload threshold (3 instead of 5) since each
  step is safer

Tuning changes:
- DEESCALATION_HEADROOM_RATIO: 0.7 -> 0.75
- DEESCALATION_UNDERLOAD_THRESHOLD: 5 -> 3
- ESCALATION_HALF_STEPS replaces ESCALATION_LEVEL_JUMP

The dead zone test now shows the batcher de-escalates from level 4
(4s) to level 3 (2.83s), previously stuck at level 4 (4s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_set_half_step replaced the inner SimpleMessageBatcher on every level
change, discarding messages buffered in _active_batch and _future_messages.
Update the batch length in place instead so the current active batch
completes normally and only the next boundary uses the new length.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix dead-zone threshold references in docstrings (70% → 75% to match
  DEESCALATION_HEADROOM_RATIO=0.75)
- Fix dead_zone_stuck docstring: 72.5% utilization at level 4 is
  underloaded (< 75%), not dead zone; batcher de-escalates to level 3
  where it gets stuck at ~78%
- Normalize level terminology to consistently mean half-steps matching
  state.level (e.g., level 4 = 4.0s window, not "level 2")
- Rework test_no_oscillation_at_steady_load to use a load that actually
  triggers escalation, preventing the test from being vacuously true
- Add test for non-default base_batch_length_s=2.0 to catch scaling bugs
- Add re-escalation assertion to repeated_shutter_cycles to verify the
  batcher re-escalates during subsequent on-phases
- Replace unittest.mock.patch with clock injection via constructor
  parameter on AdaptiveMessageBatcher, eliminating mock usage in both
  test files
@SimonHeybrock
Copy link
Copy Markdown
Member Author

SimonHeybrock commented Mar 20, 2026

Scenario timeline visualizations

Script to generate timeline plots for all adaptive batching scenario tests:
https://gist.github.com/SimonHeybrock/53d183c337a74282493d59cfefbaebd1

Run with python plot_adaptive_batching_scenarios.py from the repo root (after pip install -e ".[test]"). Generates 25 PNG plots + an index.html for browsing.

Each plot shows 4 panels:

  • Level — batch level over time with assertion bounds
  • Processing time — dots colored by utilization class (green < 75%, yellow 75-100% dead zone, red > 100% overloaded) vs batch window
  • Utilization — ratio with headroom/dead-zone/overloaded bands
  • Backlog — accumulation with limit annotations

SimonHeybrock and others added 3 commits March 23, 2026 04:57
…test

Tighten 8 limits that had excessive headroom (e.g., max_backlog 5.0→1.0
when actual is 0.4) and loosen boundary_oscillation max_oscillations
4→5 to avoid flakiness (old limit exactly matched worst-case across
50 seeds).

Increase gc_jitter jitter_fraction from 0.5 to 1.2 so spikes actually
enter the dead zone and occasionally exceed the batch window, testing
that the batcher tolerates isolated overloads. Previously the test
never left the headroom zone and was equivalent to a light-load test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge tests that run identical simulations but assert different facets
into single tests with all assertions. This removes 6 tests and 130
lines without losing any coverage.

Merges:
- step_function_backlog + severity_moderate → test_moderate_overload
- backlog_drains + steady_load_oscillation → test_moderate_overload_stabilizes_and_drains
- severity_overhead_dominated + stabilization_after_step + backlog_peaks_and_decreases → test_overhead_dominated_overload
- boundary_oscillation + jitter_sticky_escalation → test_boundary_jitter_escalates_and_sticks
- 3 de-escalation tests → parametrized test_deescalates_when_load_drops

Also removes no_escalation_when_fits (subsumed by light_load parametrization).
Many scenario tests ran far longer than needed, spending 50%+ of their
simulation in a flat steady state after all interesting events completed.
Trimmed durations while keeping 20-30s of headroom after the last event.

Also lowered the cycles_after() threshold in the overhead-dominated
stabilization check from 60s to 30s to match the shorter simulation.
@SimonHeybrock
Copy link
Copy Markdown
Member Author

Scenario timeline plots

Generated from the scenario test suite. Each plot shows 4 panels: batch level, processing time vs window, utilization ratio, and backlog.

Step-function escalation

step_escalation_time
moderate_overload_step
overhead_dominated_step
severity_severe
severity_extreme

Non-default base batch length

non_default_base

No escalation when not needed

light_load_20
light_load_60
light_load_80
light_load_85
gc_jitter

Steady overload

steady_moderate_overload
boundary_jitter

Creeping overload

creeping_overload
mild_creeping_overload

De-escalation

deescalation_to_idle
deescalation_to_light
deescalation_moderate
multi_level_deescalation
partial_deescalation

Realistic shutter scenarios

shutter_open_close
repeated_shutter_cycles
severe_to_cosmic

Processing-time awareness

fast_escalation_clear_overload

Dead zone (known limitation)

dead_zone_stuck

@SimonHeybrock SimonHeybrock marked this pull request as ready for review March 23, 2026 07:21
@github-project-automation github-project-automation Bot moved this to In progress in Development Board Mar 23, 2026
@SimonHeybrock SimonHeybrock moved this from In progress to Selected in Development Board Mar 23, 2026
@SimonHeybrock SimonHeybrock requested a review from jl-wynen March 25, 2026 09:20
Copy link
Copy Markdown
Member Author

@SimonHeybrock SimonHeybrock Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the plots posted in an issue comment instead of reviewing the full file!

Copy link
Copy Markdown
Member

@jl-wynen jl-wynen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a band aid over a deeper problem. Can we optimise the slow code instead? That would also benefit other users, not just livedata.
Judging by your description, the problem is with codec that does not depend on the number of events. So is it sciline's scheduler, the job scheduler in livedata, or something deeper?

old_length = self.batch_length_s
self._half_step = new_half_step
new_length = self.batch_length_s
logger.warning(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a warning? Do you consider a batch change to be a config error? It seems to me like these changes will be common and part of normal operation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not consider it a config error in general. I expect in many cases that the batch size can be the minimum (1 second), but for larger or un-optimized data-reduction I expect that the batch size needs to be improved.

It seems to me like these changes will be common and part of normal operation.

I think the scenarios may give a wrong impression - I hope that we can stay at the minimum batch size almost always. Batch size increases are there to deal with spikes in the backlog (such as GC running), or as a way to keep operating without dropping data, until we have scaled our system (e.g., by running more backend workers).

@SimonHeybrock
Copy link
Copy Markdown
Member Author

SimonHeybrock commented Mar 26, 2026

This looks like a band aid over a deeper problem.

It is not a band aid, it improves the service resilience, extending the envelope under which can operate reliably without dropping data or accumulating a backlog without ability to ever process it.

Can we optimise the slow code instead?

Did that already (see several recent PRs). But no matter how much we optimize there will always be cases where dealing with an accumulating backlog is important.

That would also benefit other users, not just livedata.

Did that many times in the past, e.g., improving speed of scipp.group.

Judging by your description, the problem is with codec that does not depend on the number of events.

I can't remember saying anything about codec?

So is it sciline's scheduler, the job scheduler in livedata, or something deeper?

All of the above? We are running a service that consumes, processes, and publish data. There are many contributions to the overall performance. Individual bottlenecks (low hanging fruit) have been optimized.

@jl-wynen
Copy link
Copy Markdown
Member

I can't remember saying anything about codec?

Sorry, this is a typo, I mean 'code'.

I am trying to understand what parts of the codebase cause fluctuations in processing time. The description only mentions 'load'.

We have some procedure $P$ that we apply to the data $E$. For simplicity, let's say we apply it twice in succession: $P(E_1) \mathsf{then} P(E_2)$. Batching turns that into $P(E_1 \bigcup E_2)$. The run times of these are

  • $P(E_1) \mathsf{then} P(E_2) = 2\mathcal{O}(1) + 2\mathcal{O}(N) + 2\mathcal{\omega}(N^2)$
  • $P(E_1 \bigcup E_2) = \mathcal{O}(1) + \mathcal{O}(2N) + \mathcal{\omega}((2N)^2)$

With $\mathcal{\omega}((2N)^2) \ge 2\mathcal{\omega}(N^2)$ and $\mathcal{O}(2N) = 2\mathcal{O}(N)$, we get that in order for the batched approach to be faster, the $\mathcal{\omega}(N^2)$ must be negligible for the values $N$ that we use.

This means that we are optimising for code that does not depend on $N$. But the test scenarios include things like shutter opening and closing and cosmic background. Those only impact $N$, so batching does nothing here under the above assumption.

The only impact on time I can see (beyond noise) is CPU contention between threads if you run more workflows or visualisations. So are all test scenarios really realistic or are we only dealing with discreet, potentially large steps in load. And if so, can we better predict the load?

@SimonHeybrock
Copy link
Copy Markdown
Member Author

SimonHeybrock commented Mar 27, 2026

I can't remember saying anything about codec?

Sorry, this is a typo, I mean 'code'.

I am trying to understand what parts of the codebase cause fluctuations in processing time. The description only mentions 'load'.

Things that cause changes in processing time:

  • Number of running workflows
  • Number of neutron events in the current stream
  • Random stuff that might make processing slower for a short time, such as GC.
  • Network and Kafka, e.g., if broker was busy and we then suddenly get a larger batch of messages

[... maths ...]

This means that we are optimising for code that does not depend on N . But the test scenarios include things like shutter opening and closing and cosmic background. Those only impact N , so batching does nothing here under the above assumption.

All our workflows have a significant constant per-call cost C (seconds) (e.g., per-pixel or allocating large arrays independent of event count), as well as per-event cost. Say the cost for processing all the events we get in 1 second is N (seconds), then we have (time to process 1 second of the stream):

  • Batch size 1 second: Processing time C + N
  • Batch size 2 seconds: Processing time C/2 + N

That is, we can deal with a higher event rate. It also helps in cases where C > 1 second.

The only impact on time I can see (beyond noise) is CPU contention between threads if you run more workflows or visualisations. So are all test scenarios really realistic or are we only dealing with discreet, potentially large steps in load.

The test scenarios were mainly written to exclude that the mechanism ends up in weird states such as oscillation, or getting stuck at unnecessarily high batch sizes after a period of high load.

And if so, can we better predict the load?

If we could, so what? Does that help with dealing with C > 1, or being able to process higher event rates?

@jl-wynen
Copy link
Copy Markdown
Member

jl-wynen commented Mar 27, 2026

  • Number of neutron events in the current stream

This doesn't get better with increasing the time window for batches as both our equations show

  • Random stuff that might make processing slower for a short time, such as GC.
  • Network and Kafka, e.g., if broker was busy and we then suddenly get a larger batch of messages

Do these last longer than 1s? They should be short fluctuations that the adaptive batcher should ignore.

So we only have

  • Number of running workflows

So wouldn't it be enough to use something simple like batch_time = A * n_workflows with a factor A that we determine with benchmarks?

On a separate note, I think you are seeing the limitations of a monolithic application. It becomes increasingly difficult to scale as you add features. Did you reconsider micro services?

@SimonHeybrock
Copy link
Copy Markdown
Member Author

  • Number of neutron events in the current stream

This doesn't get better with increasing the time window for batches as both our equations show

I did not say that. I said it influences the load. Decreasing the repeat rate of the constant cost will reduce the overall performance, making processing more events possible.

  • Random stuff that might make processing slower for a short time, such as GC.
  • Network and Kafka, e.g., if broker was busy and we then suddenly get a larger batch of messages

Do these last longer than 1s? They should be short fluctuations that the adaptive batcher should ignore.

It does not matter if it is longer than 1s. If it takes 100ms it eats into the 1 second budget we have to process a batch.

So we only have

  • Number of running workflows

So wouldn't it be enough to use something simple like batch_time = A * n_workflows with a factor A that we determine with benchmarks?

No, because we (1) do not know n_workflows ahead of time, (2) it still depends on the events rate.

On a separate note, I think you are seeing the limitations of a monolithic application. It becomes increasingly difficult to scale as you add features. Did you reconsider micro services?

Umm, I don't understand. Let us have a call.

Copy link
Copy Markdown
Member

@jl-wynen jl-wynen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an in person discussion, I don't have anything else intelligent to say. At least for now this seems to be the best approach. Maybe we can reconsider if we end up using smaller, distributed services for the different banks.

@SimonHeybrock SimonHeybrock enabled auto-merge March 27, 2026 10:18
@SimonHeybrock SimonHeybrock merged commit 1841191 into main Mar 27, 2026
4 checks passed
@SimonHeybrock SimonHeybrock deleted the adaptive-batching branch March 27, 2026 10:24
@github-project-automation github-project-automation Bot moved this from Selected to Done in Development Board Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants