Skip to content

[WIP] Test: Optimize Semaphore CI for the uber calico image#12519

Draft
caseydavenport wants to merge 82 commits intoprojectcalico:masterfrom
caseydavenport:casey-uber-calico-ci
Draft

[WIP] Test: Optimize Semaphore CI for the uber calico image#12519
caseydavenport wants to merge 82 commits intoprojectcalico:masterfrom
caseydavenport:casey-uber-calico-ci

Conversation

@caseydavenport
Copy link
Copy Markdown
Member

Temporary PR to test CI pipeline changes on Semaphore. Includes the uber calico image changes plus CI optimizations stacked on top. Not intended for merge - will close once validated.

None

Introduce a single "calico" uber binary that uses Cobra subcommands to
consolidate multiple component entry points: goldmane, guardian,
whisker-backend, key-cert-provisioner, typha, kube-controllers,
check-status, apiserver, webhooks, dikastes, healthz, and csi-driver.

Each component's core logic is extracted into an importable package
(e.g., kube-controllers/pkg/kubecontrollers, webhooks/pkg/webhook)
so both the standalone binary and the uber binary can reuse it.
Existing standalone binaries continue to work unchanged.
Replace duplicated logic in component main.go files with thin wrappers
that call the extracted packages (app-policy/pkg/dikastes,
app-policy/pkg/healthz, pod2daemon/pkg/csi, key-cert-provisioner/pkg/keycert,
kube-controllers/pkg/kubecontrollers, webhooks/pkg/webhook).

This ensures both the standalone binaries and the uber binary share
the same code path, and eliminates the kube-controllers global init()
flag registration.
Add the calico/calico uber image to the kind cluster build and deploy
pipeline so it gets built and loaded alongside the per-component images.

Fix apiserver subcommand to use DisableFlagParsing and SetArgs so that
flags passed by the operator (e.g., --secure-port=5443) are forwarded
correctly to the inner server command instead of being rejected as
unknown flags on the outer cobra command.
The operator now deploys typha, kube-controllers, apiserver, and CSI
using the calico/calico uber image with subcommand dispatch. Remove
these individual images from KIND_CALICO_IMAGES (no need to load them)
and KIND_IMAGE_MARKERS (no need to build them) since the uber image
provides all of them.

Also pass REPO to build-operator.sh so it can be pointed at an
operator fork/branch that has the uber image render changes.
Add new uber binary subcommands:
- calico ctl: wraps calicoctl docopt-based CLI dispatch
- calico cni plugin/ipam: wraps CNI plugin and IPAM entry points
- calico flexvol: wraps flexvol driver (extracted to pod2daemon/pkg/flexvol)

Add CNI dispatch in main() based on binary name and CNI_COMMAND env var,
so the uber binary works when installed as a CNI plugin on the host.

Include csi-node-driver-registrar binary in the uber image by building
it from pod2daemon and copying into the Dockerfile.

Update kind cluster to build only 3 images: calico/calico (uber),
calico/node, and calico/whisker. All other components use the uber
image. Update calicoctl.yaml to use the uber image with calico ctl.
Add 'calico cni install' subcommand that wraps the CNI install package,
enabling the uber image to serve as the install-cni init container.

Add symlinks in the uber image Dockerfile from /opt/cni/bin/calico and
/opt/cni/bin/calico-ipam to /usr/bin/calico so the CNI install process
can find and copy the binaries to the host.
The base image (UBI 9 minimal) has no shell, so RUN commands fail.
Copy the uber binary directly to /opt/cni/bin/calico and
/opt/cni/bin/calico-ipam instead of creating symlinks.
The component keys in calico_versions.yml need to match those in the
operator's defaultImages map (e.g., 'calico/node' not 'node'). Also
remove entries for components not tracked by the operator (goldmane,
whisker, webhooks, etc.) since those are now in ignoredImages.
Use short-form key matching the new operator gen-versions format.
Typha uses docopt for CLI parsing via ParseCommandLineArgs(nil),
which reads os.Args. When invoked as "calico typha", docopt sees
"typha" as an unexpected argument. Reset os.Args to match the
standalone binary name so docopt parsing works correctly.
kube-controllers writes health status to /status/status.json and
profiling data to /profiles/. These directories need to exist in the
image with correct ownership for the non-root user (uid 999).
Add a "goldmane-check" subcommand to the uber binary that performs HTTP
health checks against goldmane's health endpoint. This replaces the
kubelet HTTP probes that fail on dual-stack clusters because the kubelet
resolves localhost to [::1] while goldmane binds to 127.0.0.1.

Also point the operator build at the casey-uber-operator-v2 branch which
knows how to deploy the uber image for all components.
- Restore DatastoreMigration controller that was accidentally dropped
  during the kube-controllers extract-to-package refactor
- Close resp.Body in goldmane-check health probe
- Handle error from GetString in dikastes client subcommand
- Remove dead image marker rules from lib.Makefile for components now
  provided by the uber image
- Revert unrelated Makefile changes (check-mockery-config removal,
  K8S_NETPOL_SUPPORTED_FEATURES change, kind-migration-test removal)
…regator

Restructure the uber binary CLI for consistency:

- Each component package now exports NewCommand() returning a *cobra.Command
  alongside its Run(ctx, Config) entry point. The uber binary wrappers in
  cmd/calico/ become trivial one-liners delegating to the package.

- Create a generic "calico health" command for HTTP health checks that works
  with any component using libcalico-go's HealthAggregator pattern. This
  replaces the goldmane-specific goldmane-check command.

- Fold the standalone "healthz" command into "calico dikastes health" since
  it's dikastes-specific (gRPC). Rename "check-status" to
  "kube-controllers-health" for clarity.

- Add HealthAggregator to kube-controllers alongside the existing file-based
  status, enabling HTTP health probes in the uber image.

- Fix goldmane HealthAggregator to bind all interfaces (empty host) instead
  of "localhost", fixing health probes on dual-stack clusters.

- Add Config structs to kube-controllers and CSI packages, replacing loose
  parameter lists with structured configuration.
Collapse the per-component wrapper files (goldmane.go, guardian.go, etc.)
into direct imports in main.go. Each package's NewCommand() is called
directly, eliminating 9 single-function files.

Also clean up the Dockerfile: remove unused /profiles directory that
no Go code references.
Replace typha's docopt-based CLI parsing with cobra. Typha only had two
flags (--config-file and --version), making this a straightforward
conversion.

- Add NewCommand() to typha/pkg/daemon/ with proper cobra flag handling
- Replace ParseCommandLineArgs with SetConfigFilePath for test use
- Add Run(ctx, configFile) as the clean programmatic entry point
- Remove InitializeAndServeForever (callers use NewCommand or Run directly)
- Remove the DisableFlagParsing shim — typha now uses cobra natively
- Update standalone binary and tests
Replace calico-node's custom flag-based dispatch with cobra subcommands
in a new node/pkg/node package. The command structure is:

  calico node felix              - run Felix policy agent
  calico node confd              - run confd configuration daemon
  calico node init               - privileged node initialisation
  calico node startup            - non-privileged startup routine
  calico node shutdown           - shutdown routine
  calico node health             - run health checks (felix/bird)
  calico node monitor-addresses  - monitor node IP changes
  calico node allocate-tunnel-addrs  - configure tunnel addresses
  calico node monitor-token      - watch k8s token changes
  calico node complete-startup   - mark node NetworkUnavailable=false
  calico node hostpath-init      - initialize hostpaths
  calico node status show|report - node status operations
  calico node bpf                - BPF debug tools
  calico node flows              - fetch/watch flow logs
  calico node version            - print version

The standalone calico-node binary includes a translateArgs shim that
maps legacy flag-style invocation (-felix, -confd, etc.) to cobra
subcommands for backward compatibility. Runit service scripts are
updated to use the new subcommand syntax directly.
Replace guardian's custom /health HTTP handler with the standard
libcalico-go HealthAggregator. This gives guardian the same /readiness
and /liveness endpoints as other components, and actually uses the
GUARDIAN_HEALTH_ENABLED and GUARDIAN_HEALTH_PORT config fields that
were previously defined but ignored (server was hardcoded to :9080).
Regenerate all generated files to pick up the new uber binary
dependencies. Add cmd/deps.txt for dependency tracking of the uber
binary. Fix lib.Makefile to read deps from cmd/deps.txt (matching
what gen-deps-files produces) instead of cmd/calico/deps.txt.
- Remove unused genericUnsupported function in pod2daemon/pkg/flexvol
  (lint failure from the package extraction)
- Update typha Makefile to use "calico-typha version" subcommand instead
  of "--version" flag (changed in the docopt-to-cobra migration)
Add cmd/calico to ImageReleaseDirs so the uber image is built and
published during releases and hashreleases. Add the uber image to the
archive images map for release tarballs.
The k8st tests previously exec'd into a dedicated calicoctl pod to run
calicoctl commands. Replace this with running "calico ctl" directly from
the test container, which has the uber binary volume-mounted.

- Add calicoctl binary name dispatch in main.go for backward compat
- Copy uber binary as /usr/bin/calicoctl in Dockerfile
- Mount uber binary into test container as /usr/local/bin/calico
- Change calicoctl() helper in utils.py to run locally instead of exec
- Remove calicoctl pod deployment from kind setup
- Replace "uber" with "combined" in Dockerfile labels and Makefile comments
- Remove translateArgs backward-compat shim from calico-node — runit
  scripts already use cobra subcommand syntax directly
- Add explanatory comments to the remaining shim files (apiserver,
  ctl, cni, kube-controllers-health) explaining why they exist
Build a full cobra command tree for calicoctl via NewCommand(), replacing
the docopt-based CLI parsing. The DisableFlagParsing shim in cmd/calico/ctl.go
is replaced with a direct import of NewCommand().

The cobra commands bridge to the existing common code via argsFromCRUDFlags()
which builds the map[string]any that ExecuteConfigCommand expects. For IPAM
commands, the internal functions are exported (CheckIPAM, ShowIP, SplitPool,
etc.) so cobra commands can call them directly.

Router commands (node, datastore, cluster) use a synthetic args bridge for
now — cobra flags are assembled into []string and passed to the existing
docopt-based functions. The old functions remain in their original files
for the standalone calicoctl binary.

Full subcommand tree:
  ctl create/apply/replace/delete/get/patch/label/validate
  ctl ipam check/release/show/split/configure
  ctl node status/diags/checksystem/run
  ctl datastore migrate export/import/lock/unlock
  ctl cluster diags
  ctl version
Add the combined calico binary to the node image at /usr/bin/calico.
The operator uses this for init containers, probes, and lifecycle hooks
via "calico node" subcommands. The binary is built once by cmd/calico
and copied into the node image during the Docker build, avoiding
duplicate compilation.
- Remove wait_pod_ready for the calicoctl pod (removed in earlier commit)
  which was hanging until timeout on every kind-up
- Pin goldmane to control-plane node so it starts as soon as that node's
  calico-node is ready, rather than waiting for a worker node
- Fix kube-controllers health check using wrong error variable for the
  apiserver readiness message (was using datastore check err, now uses
  result.Error() from the /healthz request)
- Include the configured log level value in CSI parse failure message
The node image now uses the uber calico binary built with CGO (for BPF
support) instead of a separate calico-node binary. This reduces the
number of binaries we produce and cuts CI build time.

Also removes the triple-copy from the uber Dockerfile — the CNI install
code now copies /usr/bin/calico to the host as both calico and
calico-ipam, saving ~300MB of image layer duplication.
Move all daemon/service entry points under 'calico component' to
reserve top-level subcommands for user-facing calicoctl operations.

Component commands:
  calico component felix / confd / typha / goldmane / ...
  calico component node startup / shutdown / init / ...

User-facing commands stay at root:
  calico ctl / health / version

Felix and confd are promoted from the node subtree to top-level
components since they are standalone daemons, not node lifecycle ops.
caseydavenport and others added 13 commits April 15, 2026 16:43
Drop redundant logrus alias, restore blank line before DatastoreDescribe,
place calico alphabetically in calico_versions.yml.
- Only consult CNI_COMMAND when basename is plain "calico" and no
  subcommand args were passed; a stray env var no longer hijacks
  calicoctl invocations.
- Preserve original argv[0] when dispatching from calicoctl to the ctl
  subcommand, so panic traces and argv[0]-based detection still see
  "calicoctl".
- Drop .exe basename cases from the Linux binary; calicoctl.exe never
  reaches it (there's no Windows calicoctl).
- Expose commands.MassageError and wrap all RunE in the ctl subtree
  with it, restoring the YAML/JSON error cleanup that the uber-binary
  path lost. The standalone calicoctl binary now reuses the exported
  function instead of its own copy.
…0001

No code reads or writes /profiles/*.pprof — the files were inherited from
kube-controllers Dockerfile copy-paste and never served a purpose. Drops
the chown 999 along with them.

USER 10001:10001 provides a non-root default in case the operator pod
spec omits a securityContext.
Covers the four load-bearing entry points: calicoctl basename, calico-ipam
basename, CNI_COMMAND env, and the plain-cobra fallback. Separates the
dispatch decision from the handler invocation so the rules can be unit
tested.

Adds a cmd/calico Semaphore block that runs "make ut" on changes.
Merge conflict: keep uber calico binary + CALICO_BUILD override in node/Makefile.

Fix k8st Permission denied: the calico binary mount pointed at
$(REPO_ROOT)/bin/calico-$(ARCH) which doesn't exist (Docker creates an
empty directory). The binary lives at cmd/calico/bin/calico-$(ARCH).

Fix cmd/calico UT build failure: go test with CGO_ENABLED=1 (the
container default) transitively compiles felix/bpf/libbpf which needs
libbpf.h. The test only covers pure Go dispatch logic, so disable CGO.
The libbpf package has a !cgo stub that satisfies the import.
cmd/calico/Makefile was not threading CROSS_CGO_LDFLAGS into its
CGO_LDFLAGS, so arm64 CGO builds reached GNU ld instead of lld and
failed with "unrecognised emulation mode: aarch64linux".  Mirror the
pattern from felix/Makefile so the cross linker is selected whenever
CROSS_SYSROOT is set.
This block was removed on master by 618537a when node switched to
native clang cross-compilation, but got re-introduced during a merge
restore.  Leaving it in forces docker to pull the ppc64le-flavoured
go-build image on an amd64 host with no binfmt/qemu registered,
failing with "Exec format error" at tini.
The install logic hardcoded binary names without the .exe suffix, so on
Windows it failed to find /opt/cni/bin/calico (which is actually
calico.exe). The install-cni init container exited and the Windows pod
never started, timing out the DaemonSet rollout.

Add the .exe suffix on Windows for both the source lookup and the target
filenames. While here, drop the duplicate calico-ipam.exe build step —
install.exe now copies calico.exe to both calico.exe and calico-ipam.exe
on the host, matching the Linux behaviour of sourcing a single binary.
# Conflicts:
#	hack/test/kind/deploy_resources.sh
Windows build and packaging now only require calico.exe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Windows felix FV setup bails out under set -e if image rollout or
infrastructure setup fails, which means get_logs never runs and the job
leaves no diagnostic data behind. Use an EXIT trap so the logs are
collected regardless of how the script exits.
Build a FIPS variant of the uber binary into bin/$(ARCH)-fips and tag
the image latest-fips so image-all's sub-image-fips-amd64 target works.
Forward FIPS from node/Makefile into the cmd/calico sub-make so the
node image picks up the boringcrypto-linked binary.
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone Apr 17, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 17, 2026
caseydavenport and others added 4 commits April 17, 2026 06:50
The post-install branch checked name == "calico", which never matches
on Windows where installNames is {calico.exe, calico-ipam.exe}. Use
installNames[0] to match the skip branch.
The pod-logs directory isn't uploaded as a Semaphore artifact, so when
the rollout times out before tests run there's nothing in the job log
explaining why. Tee container logs to stdout and add describe ds,
describe pods, events, and a fetch of CalicoWindows/logs off the VM.
Pre-build and cache uber/node/whisker images in the Prerequisites stage,
then load them in E2E and KinD test blocks instead of rebuilding from
source. Consolidate 12 push-images promotions into one, move multi-arch
builds from per-component blocks into cmd/calico, and share a single
build cache group across components that are now in the uber binary.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kube-controllers, typha, and cni-plugin FV tests now dispatch into the
combined calico/calico image via subcommands (component kube-controllers,
component typha, component cni install) rather than running per-component
images. The uber image is loaded from GCS in the prologues, so these
blocks no longer have to build their own image.

Flannel-migration FVs keep their dedicated image because they need
kubectl inside the container.
caseydavenport and others added 8 commits April 17, 2026 08:43
The Build: uber image block was gated on changes to cmd/calico, but the
per-component CI blocks now unconditionally load the cached image in their
prologues. When a PR doesn't touch cmd/calico, nothing populates the cache
and the load step fails.

Drop the gate so the image is always built and cached. Also switch the
save/load to a temp file + zstd --rm pattern so a partial docker save
can't silently produce a truncated tarball that decompresses cleanly but
has no manifest.
The test VM extracts working-copy.tgz but has no .git directory, so
git rev-parse --show-toplevel fails and set -e aborts the script
before any cached images are loaded.
The consolidated calico/calico image does not ship /etc/calico/typha.cfg,
so typha uses the default LogFilePath of /var/log/calico/typha.log, which
does not exist in the image and fatals at startup. Override the setting
via env var in the k8sfv run-test script.
The consolidated calico/calico image sets USER 10001, which can't write
to host tempdirs owned by the CI user. Match production, where the
operator sets runAsUser=0 on the install-cni container, by passing
--user 0 in the test's docker run.
Fix silent failure in cached-image upload/load path by splitting the
docker save | zstd pipe into discrete steps and loading images from a
file rather than stdin. Rename artifacts from .tar.zstd to .tar.zst.

Consolidate the six Felix FV blocks into a single "Felix: FV" block
with one job per test configuration. The blocks shared trigger,
dependencies, secrets, and prologue/epilogue, so the split was only
adding YAML noise and DAG nodes.

Split "Node: Build" (which ran make ci) into a pair of parallel jobs
running static-checks and ut. The image and image-windows targets from
make ci are already covered by the Prerequisites "Build: node image"
block and the "Node: multi-arch build" block. Rename the block to
"Node: Static checks and UT" to reflect what it actually does and
point "Node: multi-arch build" at Prerequisites directly.
Cache calico/go-build to GCS keyed by GO_BUILD_VER so downstream jobs
docker-load the 3.5GB image from the cache instead of pulling cold from
Docker Hub on every run. A new prerequisite block populates the cache
on miss; the global prologue best-effort loads from the cache before
any make commands on x86_64 agents.

Split the ClusterNetworkPolicy e2e tests into their own parallel job
running against the Felix routing variant only (previously run twice,
once per routing variant, serial with the conformance suite).
Calling exit 0 on a cache hit terminates Semaphore's command shell
before the job's post-command bookkeeping can run, so Semaphore marks
the step as failed even though our logic succeeded. Use if/else instead
so control falls through naturally.
The cmd/calico multi-arch build already cross-compiles the uber binary
for arm64, ppc64le, and s390x. All Felix Go code is linked into the uber
binary, so the Felix-specific multi-arch build doesn't add coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-pr-required Change is not yet documented release-note-required Change has user-facing impact (no matter how small)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants