Skip to content

fix: ensure consistent node workload assignment by sorting nodes by gwid#1318

Open
ttimonen wants to merge 1 commit intopytest-dev:masterfrom
ttimonen:ttimonen-deterministic-loadscope-985
Open

fix: ensure consistent node workload assignment by sorting nodes by gwid#1318
ttimonen wants to merge 1 commit intopytest-dev:masterfrom
ttimonen:ttimonen-deterministic-loadscope-985

Conversation

@ttimonen
Copy link
Copy Markdown

Fixes the flaky test_workqueue_ordered_by_size test (#985 and #1248 )

Here's a quick checklist that should be present in PRs:

  • Make sure to include reasonable tests for your change if necessary
  • We use towncrier for changelog management, so please add a news file into the changelog folder.

…ateway ID.

Fixes the flaky test_workqueue_ordered_by_size test.
Copy link
Copy Markdown
Member

@RonnyPfannschmidt RonnyPfannschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one is interesting

Ordered at that particular place shouldn't be necessary

We may have a timing issue in worker startup

@ttimonen
Copy link
Copy Markdown
Author

ttimonen commented Apr 1, 2026

That one is interesting

Ordered at that particular place shouldn't be necessary

We may have a timing issue in worker startup

Let me try to expand on what I think happens here, based on my understanding.

  • call to test_workqueue_ordered_by_size(...) =>
  1. We create test_a.py and test_b.py.
  2. We invoke: pytest -n2 --dist=loadscope test_a.py test_b.py. A lot of things happen.

from xdist’s perspective:
a) pytest_runtestloop() is called, triggering creation of LoadScopeScheduling() object.
b) worker_workerready() is called, triggering a call to sched.add_node().
c) worker_collectionfinish() is called, triggering a call to sched.schedule().
Didn’t verify, but I’m assuming ordering of a,b,c.

Now the interesting part is probably on the event processing in the workerready -side. So in b the add_node adds the nodes to the assigned_work dictionary in arbitrary order. While python dicts are all ordered nowadays, the insertion order varies here. I.e. add_node calls come from the DSession’s event queue, and the two gw remotes are competing in which one will submit the “workerready” event first (irrespective of the order they were started). So add_node(gw0) and add_node(gw1) will be called in arbitrary order.
This is ok in isolation. However…

  1. output of 2. has lines like
    3a:
    [gw0] [ 3%] PASSED test_b.py::test[0]
    [gw1] [ 6%] PASSED test_a.py::test[0]
    ...
    or 3b
    [gw1] [ 3%] PASSED test_b.py::test[0]
    [gw0] [ 6%] PASSED test_a.py::test[0]
    ..
    depending on which worker became ready first.

To make output deterministic, we could either synthesize the gw identity based on their order of becoming ready (might have other side-effects though), assign work on name-order (which is done with this patch and seems like the least amount of friction), or do something else (like hide the id from output, since it’s semantically irrelevant (yet part of the implicit API now), or make the test ignore the gw id; bit ambiguous what the test would be testing in that case though).

@ttimonen
Copy link
Copy Markdown
Author

ttimonen commented Apr 5, 2026

A minimal example to reproduce the problem and the effectiveness of the fix:

[ testing]$ nix shell --impure --expr 'let p = import <nixpkgs> {}; in [ p.nushell (p.python3.withPackages (x: with x; [ pytest execnet pytest-xdist ])) ]' \
  -c nu  -c '1..100 | par-each {pytest acceptance_test.py -k test_workqueue_ordered_by_size | complete} | group-by --to-table exit_code | update items {|x| $x.items | length}'
╭───┬───────────┬───────╮
│ # │ exit_code │ items │
├───┼───────────┼───────┤
│ 0 │ 1         │    31 │
│ 1 │ 0         │    69 │
╰───┴───────────┴───────╯

Now replace the default pytest-xdist from current branch with PYTHONPATH=../src

[ testing]$ nix shell --impure --expr 'let p = import <nixpkgs> {}; in [ p.nushell (p.python3.withPackages (x: with x; [ pytest execnet pytest-xdist ])) ]' \
  -c nu  -c '1..100 | par-each {PYTHONPATH=../src pytest acceptance_test.py -k test_workqueue_ordered_by_size | complete} | group-by --to-table exit_code | update items {|x| $x.items | length}'
╭───┬───────────┬───────╮
│ # │ exit_code │ items │
├───┼───────────┼───────┤
│ 0 │ 0         │   100 │
╰───┴───────────┴───────╯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants