Skip to content

queue-runner: resolve CA derivations after dependency outputs land#1629

Open
amaanq wants to merge 6 commits intoNixOS:masterfrom
obsidiansystems:ca-resolve-two-phase
Open

queue-runner: resolve CA derivations after dependency outputs land#1629
amaanq wants to merge 6 commits intoNixOS:masterfrom
obsidiansystems:ca-resolve-two-phase

Conversation

@amaanq
Copy link
Copy Markdown
Member

@amaanq amaanq commented Mar 30, 2026

Instead of resolving at dispatch time and carrying two drv identities through the gRPC layer, resolve in succeed_step once outputs are in the DB. The resolved drv becomes a fresh Step that dispatch picks up normally; the builder only ever sees one drv. A new resolvedFromBuild/resolvedFromStep foreign key BuildSteps ties the resolved step back to the dependency that triggered it.

Comment thread subprojects/hydra/sql/hydra.sql Outdated
Copy link
Copy Markdown
Member

@Ericson2314 Ericson2314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the point where we try_resolve before, we do this:

  • If result is None or "same derivation":

    just continue with this step, same as we always did

  • If result is Some(drv) and that drv is not the same as the current drv:

    Step finished, status "resolved", fill in the DB pointer to the new step, and the new step now builds, this step is done.


the current "try resolve reverse dependencies" is a possible optimisation --- incrementally resolve reverse dependencies, reminiscent of union find --- but that's a more complex thing we shouldn't worry about at this time.

@amaanq amaanq force-pushed the ca-resolve-two-phase branch 2 times, most recently from ae692ac to 05eef39 Compare March 31, 2026 03:10
Comment thread subprojects/crates/db/src/models.rs Outdated
Comment thread subprojects/hydra/sql/hydra.sql Outdated
Comment thread subprojects/hydra-queue-runner/src/state/mod.rs Outdated
Comment thread subprojects/hydra-queue-runner/src/state/mod.rs Outdated
@Ericson2314 Ericson2314 force-pushed the ca-resolve-two-phase branch from 05eef39 to de9aed8 Compare March 31, 2026 18:45
@artemist artemist force-pushed the ca-resolve-two-phase branch 6 times, most recently from 0768499 to 80430bf Compare April 6, 2026 18:37
@artemist
Copy link
Copy Markdown
Member

artemist commented Apr 6, 2026

I think I have this working, but there are a few things I'm unclear on and would like to figure out:

  • Do the BuildSteps entries we create in the database properly map to the Step objects in the queue-runner, and do they have sufficient information to recover state after a restart and be shown in the UI?
  • How should we handle cached failures, and what is actually needed?
  • Are we doing any double-building, and how can we test that?

I also believe my change needs substantial refactoring, as I'm using only parts of several functions (e.g. create_step)

@artemist artemist force-pushed the ca-resolve-two-phase branch 2 times, most recently from 3c7b586 to 7eb2319 Compare April 7, 2026 18:49
@artemist
Copy link
Copy Markdown
Member

artemist commented Apr 7, 2026

Rebased onto #1642

@artemist artemist force-pushed the ca-resolve-two-phase branch from 7eb2319 to 62abe6f Compare April 7, 2026 20:03
Comment thread subprojects/hydra-queue-runner/src/state/mod.rs
Comment thread subprojects/hydra-queue-runner/src/state/mod.rs Outdated
Comment on lines +406 to +411
} else {
None
}
};

if let Some(resolved_path) = resolved {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can combine this with the previous branch? Or at least we could compute a "do we want to resolve" bool, and then have the actual resolving (which much succeed) and this stuff in the same branch.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doable, though it will take out a lock on the step's derivation for longer than strictly necessary.

Comment thread subprojects/hydra-queue-runner/src/state/mod.rs Outdated
resolved_result.set_start_time_now();
resolved_result.set_stop_time_now();
resolved_result.log_file.clone_from(&job.result.log_file);
finish_build_step(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to do everything in one transaction here. finish_build_step current creates its own transaction, so let's make take the transaction as an argument instead.


tx.set_resolved_to(build_id, step_nr, nr).await?;
tx.commit().await?;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO think about restart resiliency. I believe everything else after this point in this branch doesn't use the database. So that's good, we aren't logically splitting a transaction in two (assuming the previous stuff is done).

When is left to think about is ---- if the queue runner was restarted at this point, would the general "rebuild state machine from build database state" logic get us into the same step as what this code right here does?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make a variation of our CA tests where where the queue runner dies here and then restarts to test this

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it looks like hydra-queue-runner is making duplicate BuildStep rows in the database on restart. I'm unsure if this also occurs for non-CA derivations, but it should be fixed. Restart currently doesn't fail, but it leaves the DB in a not very clean state.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that may be an issue with CA builds, since in create_step we assume that all CAFloating derivations are unfinished, instead of checking if the outputs exist.

It is still true that the queue runner does not attempt to load steps from the database on restart, it recalculates them.

Comment on lines +458 to +460
Arc::new(parking_lot::RwLock::new(HashSet::new())),
Arc::new(parking_lot::RwLock::new(HashSet::new())),
Arc::new(parking_lot::RwLock::new(HashSet::new())),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a big fat comment why it is correct to not give this function the mutable state it expects

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment made

Comment on lines +489 to +490
// If we're the root of a build then we need to become the new root,
// lest the Step get garbage collected.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment should be clear this is about Arc keeping things alive, not store keeping ,drv files alive.

We should also be clear that in the non-root case, the idea is that we registered some reverse des above that will keep us alive. Maybe we should assert that we did in fact go through that loop.

Copy link
Copy Markdown
Member

@Ericson2314 Ericson2314 Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to figure out how to keep the .drv file alive for all resolved derivations, and the GC roots that nix-eval-jobs made will not do the job since these are new drvs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment should be done

Comment thread subprojects/hydra/sql/hydra.sql Outdated
foreign key (build) references Builds(id) on delete cascade,
foreign key (propagatedFrom) references Builds(id) on delete cascade
foreign key (propagatedFrom) references Builds(id) on delete cascade,
foreign key (build, resolvedToStep) references BuildSteps(build, stepnr) on delete cascade
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So long term, when the database distinguishes derivations from build attempts, this should actually point to a derivation, not another build attempt.

A build attempt samples the state of the database at that point to resolve, and creates a new (resolved) derivation. That derivation can be attempted to be built multiple times (We don't really care to privilage the first attempt with a foreign key). However, subsequent attempts to build the original derivation --- if we rebuild other upstream derivations and they are content-addressed and non-deterministic --- will result in a different resolved derivations, and that too can be attempted to build multiple times.

Memoizing realization in this way does give us something close a derived build trace, but the fact that our build trace is thus keyed off attempts not not derivations gives us some ability to represent (and not choke on) incoherence.

@artemist artemist force-pushed the ca-resolve-two-phase branch 5 times, most recently from 4be2bbb to 0344518 Compare April 13, 2026 14:31
@Ericson2314
Copy link
Copy Markdown
Member

Unfortunately, once #1667 is merged in, this PR will hang.

@Ericson2314 Ericson2314 force-pushed the ca-resolve-two-phase branch 2 times, most recently from 1f18bcf to f040eaa Compare April 14, 2026 12:15
@artemist artemist force-pushed the ca-resolve-two-phase branch 5 times, most recently from 90af869 to 3e6dae1 Compare April 16, 2026 14:52
@Ericson2314 Ericson2314 force-pushed the ca-resolve-two-phase branch from 3e6dae1 to 9c67fd6 Compare April 24, 2026 02:56
@Ericson2314 Ericson2314 force-pushed the ca-resolve-two-phase branch 3 times, most recently from 04e98b4 to df3f3e9 Compare April 25, 2026 17:55
Ericson2314 and others added 6 commits April 25, 2026 16:26
Compute URLs and title strings upfront instead of repeating inline
ternaries throughout the template. Use consistent `${...}` variable
interpolation style.

No behavior changes.
Soon, we're going to make a new build step page which will share a
number of things with the build page.

In order to get ready for that, extract the following into a new
`build-common.tt` for reuse:

- `renderOutputs`
- `renderStepStatus`
- `renderStepDuration`
- `renderStepMachine`
- `renderLogButtons`

Likewise, move the `showLog` helper out of `Build.pm` into
`Hydra::Helper::LogEndpoints`. Both are needed by the BuildStep
controller introduced in a later commit.

As a bonus, the new `renderLogButtons` is already used twice within
`build.tt` itself (for the build log and the runcommand log),
deduplicating previously identical button markup.

No behavior changes.
Add a `renderOutputsTable` block in `build-common.tt` that renders
outputs as a nested info-table with name and path columns. Use it on
the build detail page, replacing the old inline `renderOutputs` which
only showed paths.

I like this much better because I want to see the output names at a
glance, not just the store paths. This layout is especially good for CA
derivations where we don't know the output paths in advanced of building
them.
Build steps are, in my view, an important concept that Hydra doesn't
yet give enough attention. This is my attempt to rectify that.

A new detail page at `/build/:id/step/:stepnr` shows output paths,
derivation, system, machine, duration, status, and log links, reusing
the shared blocks extracted in the previous commit. Clicking anywhere in
the build steps table row now navigates to this page, instead of the
step log page. The step log pages also have a link back to this page.

Also reflecting giving build steps more status, introduce
`Hydra::Controller::BuildStep`, chained off `/build/buildChain`, giving
them a first-class controller.

For a bit of back story, note that in the future, we might switch
associating build steps more with derivations than builds. This
reflects that we don't really care *why* something was scheduled (the
build/root derivation) as much as *what* was scheduled, especially when
multiple new builds would "race" to schedule the same derivation. It
also bodes well for a future where Hydra can act as a Nix derivation
that receives ad-hoc build requests, so the "build" in this case would
be rather lacking in metadata. Both these scenarios point to a world
where `BuildStep`s become more important than `Build`s, building (ahem)
atop this refactor.

The step log handling is now better suited to live as part of this
controller. Accordingly, it is moved from
`/build/:id/nixlog/:stepnr[/raw|/tail]` to
`/build/:id/step/:stepnr/log[/raw|/tail]`. The old `nixlog` URLs are
preserved as 301 redirects in `Build.pm`. All templates updated to
generate the new canonical URLs.
Instead of resolving at StepInfo construction and carrying two drv identities
through the gRPC layer, resolve in realise_drv_on_valid_machine once all deps
are built. If resolution yields a different drv, the original step is marked
Resolved and a new DB step is created for the resolved drv with a resolvedTo FK
linking them. The builder only ever sees one drv.

We create a new Step for that resolution and bunt it back to the
scheduler. This grants us more flexibility in execution and the
method can be used in the future for dynamic derivations,
which won't map 1:1 with the original derivations.

We also now only create database steps when we are sure they are
necessary, reducing the number of duplicate `BuildStep` rows.

In order to make tests more consistent, CA derivations will fail if they
cannot be fully resolved. Otherwise, there could be inconsistent
successes depending on which builder a step was performed on.

As part of this, add local outputs to resolution table

With the current queue-runner design, all dependency outputs of a
CAFloating derivation must be recorded in the hydra database.

This is true for things built or substituted by hydra, but until now not
by things found on the local nix store. This may occur for outputs that
are part of the system configuration.

Therefore, add all local outputs that are not already in the database to
the resolution table upon creating a step. This makes it possible to
build derivations from `contentAddressedByDefault` nixpkgs.

Co-Authored-By: Artemis Tosini <artemis.tosini@obsidian.systems>
Co-Authored-By: John Ericson <John.Ericson@Obsidian.Systems>
Co-Authored-By: Amaan Qureshi <git@amaanq.com>
@Ericson2314 Ericson2314 force-pushed the ca-resolve-two-phase branch from df3f3e9 to 7c69e41 Compare April 26, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants