Add adaptive message batching to reduce overhead under load#821
Conversation
Under sustained load the fixed per-batch overhead (workflow graph evaluation, serialization, thread dispatch) dominates cycle time. AdaptiveMessageBatcher wraps SimpleMessageBatcher and widens the batch window (1s → 2s → 4s → 8s) when consecutive non-empty batches indicate the system cannot keep up, then de-escalates when idle cycles show spare capacity. This is lossless — the dashboard simply updates less frequently, but no data is dropped. The current batch interval is reported in ServiceStatus and displayed in the backend status widget when above the base 1s value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… strategy The old count-based heuristic used consecutive non-empty batches as a proxy for overload, causing slow escalation, inability to de-escalate under light continuous load, and false escalation when processing fits within the window. The new strategy uses actual processing_time_s feedback: - Escalation: after 2 consecutive batches where processing exceeds the window - De-escalation: after 5 consecutive batches with <70% window utilization - Idle fallback: wall-clock de-escalation after 3 idle windows (unchanged)
Add precondition assertions to transition tests (de-escalation, backlog draining, repeated shutter) to prevent false positives when the expected intermediate state is never reached. Parameterize light-load tests across utilization levels (20%-85%), jitter tests across RNG seeds, and severity tests across four overload intensities with min/max level bounds. Add stabilization-after-escalation test and consolidate related thin tests. Extract cyclic_cost helper. The precondition guards exposed a real issue: the old test_deescalates_after_step_down_to_light_load never actually triggered escalation (0.95s < 1s window), passing trivially. With a corrected cost function, the test reveals a batcher limitation where idle poll cycles between batches reset the consecutive-underloaded counter, preventing de-escalation under continuous light load. Marked xfail(strict=True). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The batcher cannot de-escalate when data is flowing continuously because idle poll cycles between batches reset the consecutive- underloaded counter. Four new tests cover the scope of this issue: - Light continuous load after heavy phase (level 1 -> 0) - Moderate continuous load after heavy phase (level 1 -> 0) - Multi-level de-escalation (level 2+ -> 0) - Partial de-escalation (level 2+ -> 1) All four fail, confirming the limitation. The higher the escalation level, the worse it gets: a larger window means more spare time, more idle polls, and less chance for the underload counter to accumulate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Idle poll cycles (report_batch(None)) between batches were resetting the consecutive-overloaded and consecutive-underloaded counters. At higher escalation levels, the large batch window means most of each cycle is spare time filled with idle polls, which prevented the underload counter from ever reaching the de-escalation threshold. The fix: idle polls no longer reset consecutive counters. Genuine idleness is already handled by the wall-clock fallback path, and the overload counter is properly reset by non-overloaded real batches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three blind spots in the scenario test suite: Dead zone (70-100% utilization at escalated level): When processing fills the escalated window without enough headroom (<70%) for de-escalation, the batcher stays stuck even if a lower level would suffice. Test documents this as current behavior — if the strategy is improved to probe lower levels, the expected final level should change. Jitter-induced sticky escalation: When mean processing equals the batch window, jitter causes escalation (~25% chance of two consecutive overloaded batches). At the escalated level, processing lands in the dead zone, making escalation permanent. Test documents this. Time-gap batches (message_count=0): The SimpleMessageBatcher can return empty batches during data gaps. Tests verify these don't disrupt ongoing escalation or de-escalation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The shutter-closed phase is not idle: cosmic background produces a continuous stream of ev44 messages with very few events. This means wall-clock idle de-escalation never applies; the batcher must de-escalate via the underload counter. Update existing shutter tests to use cosmic background (overhead=0.2, per_s=0.01) instead of idle_cost() for the off-phase, making the simulation more realistic. Add test_severe_overload_to_cosmic_background: after reaching level 2+ from severe overload, shutter close drops to cosmic background. Verifies de-escalation through all levels back to 0 via the underload path — the most operationally important recovery scenario. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Escalation still doubles the window (+2 half-steps), but de-escalation now reduces by a factor of 1/sqrt(2) (-1 half-step). Two de-escalation steps undo one escalation, providing natural damping. The batch window lives on a fixed grid of base * sqrt(2)^n values, avoiding floating-point drift. The asymmetric step sizes allow: - Faster convergence to the right level (smaller probing steps down) - Reduced dead zone (can explore windows between the old 2x levels) - Lower consecutive-underload threshold (3 instead of 5) since each step is safer Tuning changes: - DEESCALATION_HEADROOM_RATIO: 0.7 -> 0.75 - DEESCALATION_UNDERLOAD_THRESHOLD: 5 -> 3 - ESCALATION_HALF_STEPS replaces ESCALATION_LEVEL_JUMP The dead zone test now shows the batcher de-escalates from level 4 (4s) to level 3 (2.83s), previously stuck at level 4 (4s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_set_half_step replaced the inner SimpleMessageBatcher on every level change, discarding messages buffered in _active_batch and _future_messages. Update the batch length in place instead so the current active batch completes normally and only the next boundary uses the new length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix dead-zone threshold references in docstrings (70% → 75% to match DEESCALATION_HEADROOM_RATIO=0.75) - Fix dead_zone_stuck docstring: 72.5% utilization at level 4 is underloaded (< 75%), not dead zone; batcher de-escalates to level 3 where it gets stuck at ~78% - Normalize level terminology to consistently mean half-steps matching state.level (e.g., level 4 = 4.0s window, not "level 2") - Rework test_no_oscillation_at_steady_load to use a load that actually triggers escalation, preventing the test from being vacuously true - Add test for non-default base_batch_length_s=2.0 to catch scaling bugs - Add re-escalation assertion to repeated_shutter_cycles to verify the batcher re-escalates during subsequent on-phases - Replace unittest.mock.patch with clock injection via constructor parameter on AdaptiveMessageBatcher, eliminating mock usage in both test files
Scenario timeline visualizationsScript to generate timeline plots for all adaptive batching scenario tests: Run with Each plot shows 4 panels:
|
…test Tighten 8 limits that had excessive headroom (e.g., max_backlog 5.0→1.0 when actual is 0.4) and loosen boundary_oscillation max_oscillations 4→5 to avoid flakiness (old limit exactly matched worst-case across 50 seeds). Increase gc_jitter jitter_fraction from 0.5 to 1.2 so spikes actually enter the dead zone and occasionally exceed the batch window, testing that the batcher tolerates isolated overloads. Previously the test never left the headroom zone and was equivalent to a light-load test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge tests that run identical simulations but assert different facets into single tests with all assertions. This removes 6 tests and 130 lines without losing any coverage. Merges: - step_function_backlog + severity_moderate → test_moderate_overload - backlog_drains + steady_load_oscillation → test_moderate_overload_stabilizes_and_drains - severity_overhead_dominated + stabilization_after_step + backlog_peaks_and_decreases → test_overhead_dominated_overload - boundary_oscillation + jitter_sticky_escalation → test_boundary_jitter_escalates_and_sticks - 3 de-escalation tests → parametrized test_deescalates_when_load_drops Also removes no_escalation_when_fits (subsumed by light_load parametrization).
Many scenario tests ran far longer than needed, spending 50%+ of their simulation in a flat steady state after all interesting events completed. Trimmed durations while keeping 20-30s of headroom after the last event. Also lowered the cycles_after() threshold in the overhead-dominated stabilization check from 60s to 30s to match the shorter simulation.
Scenario timeline plotsGenerated from the scenario test suite. Each plot shows 4 panels: batch level, processing time vs window, utilization ratio, and backlog. Step-function escalationNon-default base batch lengthNo escalation when not neededSteady overloadCreeping overloadDe-escalationRealistic shutter scenariosProcessing-time awarenessDead zone (known limitation) |
There was a problem hiding this comment.
Look at the plots posted in an issue comment instead of reviewing the full file!
jl-wynen
left a comment
There was a problem hiding this comment.
This looks like a band aid over a deeper problem. Can we optimise the slow code instead? That would also benefit other users, not just livedata.
Judging by your description, the problem is with codec that does not depend on the number of events. So is it sciline's scheduler, the job scheduler in livedata, or something deeper?
| old_length = self.batch_length_s | ||
| self._half_step = new_half_step | ||
| new_length = self.batch_length_s | ||
| logger.warning( |
There was a problem hiding this comment.
Why is this a warning? Do you consider a batch change to be a config error? It seems to me like these changes will be common and part of normal operation.
There was a problem hiding this comment.
I do not consider it a config error in general. I expect in many cases that the batch size can be the minimum (1 second), but for larger or un-optimized data-reduction I expect that the batch size needs to be improved.
It seems to me like these changes will be common and part of normal operation.
I think the scenarios may give a wrong impression - I hope that we can stay at the minimum batch size almost always. Batch size increases are there to deal with spikes in the backlog (such as GC running), or as a way to keep operating without dropping data, until we have scaled our system (e.g., by running more backend workers).
It is not a band aid, it improves the service resilience, extending the envelope under which can operate reliably without dropping data or accumulating a backlog without ability to ever process it.
Did that already (see several recent PRs). But no matter how much we optimize there will always be cases where dealing with an accumulating backlog is important.
Did that many times in the past, e.g., improving speed of
I can't remember saying anything about codec?
All of the above? We are running a service that consumes, processes, and publish data. There are many contributions to the overall performance. Individual bottlenecks (low hanging fruit) have been optimized. |
Sorry, this is a typo, I mean 'code'. I am trying to understand what parts of the codebase cause fluctuations in processing time. The description only mentions 'load'. We have some procedure
With This means that we are optimising for code that does not depend on The only impact on time I can see (beyond noise) is CPU contention between threads if you run more workflows or visualisations. So are all test scenarios really realistic or are we only dealing with discreet, potentially large steps in load. And if so, can we better predict the load? |
Things that cause changes in processing time:
All our workflows have a significant constant per-call cost C (seconds) (e.g., per-pixel or allocating large arrays independent of event count), as well as per-event cost. Say the cost for processing all the events we get in 1 second is N (seconds), then we have (time to process 1 second of the stream):
That is, we can deal with a higher event rate. It also helps in cases where C > 1 second.
The test scenarios were mainly written to exclude that the mechanism ends up in weird states such as oscillation, or getting stuck at unnecessarily high batch sizes after a period of high load.
If we could, so what? Does that help with dealing with C > 1, or being able to process higher event rates? |
This doesn't get better with increasing the time window for batches as both our equations show
Do these last longer than 1s? They should be short fluctuations that the adaptive batcher should ignore. So we only have
So wouldn't it be enough to use something simple like On a separate note, I think you are seeing the limitations of a monolithic application. It becomes increasingly difficult to scale as you add features. Did you reconsider micro services? |
I did not say that. I said it influences the load. Decreasing the repeat rate of the constant cost will reduce the overall performance, making processing more events possible.
It does not matter if it is longer than 1s. If it takes 100ms it eats into the 1 second budget we have to process a batch.
No, because we (1) do not know
Umm, I don't understand. Let us have a call. |
jl-wynen
left a comment
There was a problem hiding this comment.
After an in person discussion, I don't have anything else intelligent to say. At least for now this seems to be the best approach. Maybe we can reconsider if we end up using smaller, distributed services for the different banks.

























Summary
Under sustained load, fixed per-batch overhead (workflow graph evaluation, serialization, thread dispatch) dominates cycle time.
AdaptiveMessageBatcherwrapsSimpleMessageBatcherand widens the batch window when consecutive batches signal overload, then cautiously de-escalates when headroom is available. This is lossless: the dashboard updates less frequently but no data is dropped.This might make #739 redundant, or at least less urgent.
Design
Uses a Multiplicative Increase, Additive Decrease (MIAD) strategy inspired by TCP congestion control:
base × √2^n(1.0s → 1.41s → 2.0s → 2.83s → 4.0s → ...), capped atbase × 2^max_level. Fixed grid avoids floating-point drift.Key properties:
message_count=0) are treated as no-ops and don't interfere with escalation or de-escalation.batch_interval_sflows throughServiceStatus→ x5f2 →BackendStatusWidget.Scenario tests
Comprehensive simulation-based tests in
adaptive_batching_scenarios_test.pydrive aMessageBatcherthrough realistic load patterns and assert on observable properties (escalation time, backlog bounds, oscillation count, de-escalation). All acceptance thresholds are collected in a centralLIMITStable for tuning.These tests are very long, therefore I created plots for each of them, shown in a comment below. I recommend reviewing those instead of the testcases.
Scenarios covered: step-function overload at four severity levels, no-escalation under light load (20-85% utilization), GC jitter resilience, boundary oscillation, creeping overload, de-escalation under continuous load (light, moderate, multi-level, partial), shutter open/close with cosmic background, backlog draining, and the 70-100% utilization dead zone.
Test plan
🤖 Generated with Claude Code