Skip to content

Fix cancel race and replace flaky test#5493

Open
adamnovak wants to merge 1 commit intomasterfrom
issues/5448-fix-cancel-race
Open

Fix cancel race and replace flaky test#5493
adamnovak wants to merge 1 commit intomasterfrom
issues/5448-fix-cancel-race

Conversation

@adamnovak
Copy link
Copy Markdown
Member

This is on top of #5480 so that CI had a faint hope of passing. Only the last commit really goes here.

This will fix #5448.

Changelog Entry

To be copied to the draft changelog by merger:

  • Toil server can no longer have workflows get stuck CANCELING for several seconds when canceled before they fully start.

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
  • Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

We used to handle this with a timeout, but we had a test to make sure
the timeout never actually had to happen. The timeout could in fact need
to happen, and the test was flaky.

I tried to get Anthropic Claude to solve this, and it noticed the race,
but was unable to come up with a design I liked for fixing it, so I did
it myself. I don't *really* like my design either, but at least it's
mine now.

This moves responsibility for marking a WES workflow as CANCELED from
the Celery task to the Celery task if it can and the ToilWorkflow
get_state() method otherwise. When somebody asks for the state of a
workflow, we ask Celery if it's actually stopped or not. If it has
stopped without error and is supposedly CANCELING, we declare it
CANCELED.

I'm removing the timeout-based way to go from CANCELING to CANCELED,
because if the task *is* still there and doing stuff, it can't really be
canceled yet.

It would still be nicer to have the responsibility in one place, but at
least this way I'm reducing and not increasing the number of weird
methods.

To test this I added a sleep that can make the cancel attempt win the
race, which involved adding an ugly argument to the fake-Celery code,
because we can't monkey-patch at class scope and expect a
Multiprocessing process to see it.
@adamnovak adamnovak force-pushed the issues/5448-fix-cancel-race branch from 4b7c31b to 2960d3b Compare April 17, 2026 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ToilWESServerWorkflowTest::test_run_and_cancel_workflows is flaky

1 participant