Skip to content

Backend time/failure reporting fixes for timeouts of local jobs#188

Merged
AlexJones0 merged 2 commits intolowRISC:masterfrom
AlexJones0:backend_state_fixes
May 1, 2026
Merged

Backend time/failure reporting fixes for timeouts of local jobs#188
AlexJones0 merged 2 commits intolowRISC:masterfrom
AlexJones0:backend_state_fixes

Conversation

@AlexJones0
Copy link
Copy Markdown
Contributor

@AlexJones0 AlexJones0 commented Apr 30, 2026

This PR fixes two bugs in the runtime backends:

  • One issue is inherited from the original launchers (whilst trying to maintain their behaviour), where any jobs that timeout do not report the DVSim-maintained runtime in the final results. We now always make sure to go through this path in the code, and make some refactoring improvements at the same time to centralize the timeout reporting logic to to reduce the potential for inconsistent failure messages in the future with more backends.
  • The other is a small bug introduced in the LocalRuntimeBackend, where the error message that was being reported was not correct ("Job killed!" instead of "Job timed out after X minutes").

See the commit messages for more info.

Example change:

 ### Test Results
 
 |  Stage  |  Name  | Tests                  |  Max Job Runtime  |  Simulated Time  |  Passing  |  Total  |  Pass Rate  |
 |:-------:|:------:|:-----------------------|:-----------------:|:----------------:|:---------:|:-------:|:-----------:|
-|         |        | ate_bootstrap_disjoint |      0.000s       |     0.000us      |     0     |    1    |   0.00 %    |
+|         |        | ate_bootstrap_disjoint |      6.007s       |     0.000us      |     0     |    1    |   0.00 %    |
 |         |        | **TOTAL**              |                   |                  |     0     |    1    |   0.00 %    |
 |         |        | **TOTAL**              |                   |                  |     0     |    1    |   0.00 %    |
 
 ## Coverage Results
 ### [Coverage Dashboard](/opentitan/scratch/master/chip_earlgrey_asic-sim-vcs/cov_report/dashboard.html)
 
 ## Failure Buckets
-* `Job killed!` has 1 failure:
+* `Job timed out after * minutes` has 1 failure:
     * Test ate_bootstrap_disjoint has 1 failure.
         * 0.ate_bootstrap_disjoint.58088541501576402788773390777979312745996416705647052408538425973305215300754\
           Log /opentitan/scratch/master/chip_earlgrey_asic-sim-vcs/0.ate_bootstrap_disjoint/latest/run.log

Copy link
Copy Markdown
Contributor

@hcallahan-lowrisc hcallahan-lowrisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Thanks @AlexJones0

Comment thread src/dvsim/runtime/backend.py Outdated
This fixes an issue inherited from the old base launcher in DVSim, but
only for the new runtime backend interface (i.e. not backported to the
launchers, due to their fragility).

Jobs that are killed after they time out directly emit a completion
event instead of going through the regular `_finish_job` call, which
mean that we do not get the job runtime / simulated time reported in the
final completed spec, which is not correct.

To alleviate this issue, change the LocalRuntimeBackend logic to always
go through this `RuntimeBackend._finish_job` interface, where an exit
code of `None` is now used to denote the fact that the process did not
finish naturally and instead timed out. This case is the first thing
checked before any log statuses, and so will always be reported as a
timeout failure without the different backends needing to implement it
manually (which should reduce some duplication in the future, and
potential for inconsistencies in the failure messages).

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This was a small bug introduced in the new local runtime backend.
Originally the `_kill_job` method was just used by `_kill_many` (for
when a SIGINT/SIGTERM is sent to the scheduler), so I used this
`kill_requested` flag on the handle to determine when to override the
kill reason.

However, I then reused this function for the timeout case to kill the
ongoing job, without realising that this flag was still being set in
that function. In reality, to fix this issue we should just set it
directly in `kill_many` to denote that we are manually trying to kill a
job.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
@AlexJones0 AlexJones0 force-pushed the backend_state_fixes branch from 5e27dff to e9a0d72 Compare April 30, 2026 16:50
Copy link
Copy Markdown
Contributor

@rswarbrick rswarbrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sorting this out.

The changes look right to me. For the "in reality..." comment on the second commit, should we add an issue tracking that this is worth doing?

@AlexJones0
Copy link
Copy Markdown
Contributor Author

For the "in reality..." comment on the second commit, should we add an issue tracking that this is worth doing?

I think my wording here might have been a bit misleading - that comment is what the commit is doing! No more issue to be tracked 😃

@AlexJones0 AlexJones0 added this pull request to the merge queue May 1, 2026
Merged via the queue into lowRISC:master with commit f734f22 May 1, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants