Skip to content

fix(workflow): add concurrency control and fix timer race conditions#1347

Open
KhawarHabibKhan wants to merge 3 commits intomicrosoft:mainfrom
KhawarHabibKhan:feat/workflow-concurrency-control
Open

fix(workflow): add concurrency control and fix timer race conditions#1347
KhawarHabibKhan wants to merge 3 commits intomicrosoft:mainfrom
KhawarHabibKhan:feat/workflow-concurrency-control

Conversation

@KhawarHabibKhan
Copy link
Copy Markdown

@KhawarHabibKhan KhawarHabibKhan commented Mar 13, 2026

Description

  • Add asyncio.Lock to guard step_n shared state in the workflow engine, preventing race conditions when step_semaphore > 1
  • Fix TOCTOU race in RDAgentTimer.is_timeout() by using cached _remain_time_duration instead of calling datetime.now() twice
  • Clamp _remain_time_duration to non-negative to prevent negative timer values propagating through the system
  • Add defensive clamp on MLflow remain_time and remain_percent metrics
  • Update DataScienceRDLoop._check_exit_conditions_on_step subclass override to match new async signature

Motivation and Context

When step_semaphore is configured > 1, multiple async steps can concurrently read/decrement step_n without synchronization, leading to over-execution or missed termination. The timer's is_timeout() also had a TOCTOU race where update_remain_time() and a separate datetime.now() comparison could disagree, and negative _remain_time_duration values could propagate into MLflow metrics.

Types of changes

  • Fix bugs
  • Add new feature

📚 Documentation preview 📚: https://RDAgent--1347.org.readthedocs.build/en/1347/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant