[WIP] Test: Optimize Semaphore CI for the uber calico image#12519
Draft
caseydavenport wants to merge 82 commits intoprojectcalico:masterfrom
Draft
[WIP] Test: Optimize Semaphore CI for the uber calico image#12519caseydavenport wants to merge 82 commits intoprojectcalico:masterfrom
caseydavenport wants to merge 82 commits intoprojectcalico:masterfrom
Conversation
Introduce a single "calico" uber binary that uses Cobra subcommands to consolidate multiple component entry points: goldmane, guardian, whisker-backend, key-cert-provisioner, typha, kube-controllers, check-status, apiserver, webhooks, dikastes, healthz, and csi-driver. Each component's core logic is extracted into an importable package (e.g., kube-controllers/pkg/kubecontrollers, webhooks/pkg/webhook) so both the standalone binary and the uber binary can reuse it. Existing standalone binaries continue to work unchanged.
Replace duplicated logic in component main.go files with thin wrappers that call the extracted packages (app-policy/pkg/dikastes, app-policy/pkg/healthz, pod2daemon/pkg/csi, key-cert-provisioner/pkg/keycert, kube-controllers/pkg/kubecontrollers, webhooks/pkg/webhook). This ensures both the standalone binaries and the uber binary share the same code path, and eliminates the kube-controllers global init() flag registration.
Add the calico/calico uber image to the kind cluster build and deploy pipeline so it gets built and loaded alongside the per-component images. Fix apiserver subcommand to use DisableFlagParsing and SetArgs so that flags passed by the operator (e.g., --secure-port=5443) are forwarded correctly to the inner server command instead of being rejected as unknown flags on the outer cobra command.
The operator now deploys typha, kube-controllers, apiserver, and CSI using the calico/calico uber image with subcommand dispatch. Remove these individual images from KIND_CALICO_IMAGES (no need to load them) and KIND_IMAGE_MARKERS (no need to build them) since the uber image provides all of them. Also pass REPO to build-operator.sh so it can be pointed at an operator fork/branch that has the uber image render changes.
Add new uber binary subcommands: - calico ctl: wraps calicoctl docopt-based CLI dispatch - calico cni plugin/ipam: wraps CNI plugin and IPAM entry points - calico flexvol: wraps flexvol driver (extracted to pod2daemon/pkg/flexvol) Add CNI dispatch in main() based on binary name and CNI_COMMAND env var, so the uber binary works when installed as a CNI plugin on the host. Include csi-node-driver-registrar binary in the uber image by building it from pod2daemon and copying into the Dockerfile. Update kind cluster to build only 3 images: calico/calico (uber), calico/node, and calico/whisker. All other components use the uber image. Update calicoctl.yaml to use the uber image with calico ctl.
Add 'calico cni install' subcommand that wraps the CNI install package, enabling the uber image to serve as the install-cni init container. Add symlinks in the uber image Dockerfile from /opt/cni/bin/calico and /opt/cni/bin/calico-ipam to /usr/bin/calico so the CNI install process can find and copy the binaries to the host.
The base image (UBI 9 minimal) has no shell, so RUN commands fail. Copy the uber binary directly to /opt/cni/bin/calico and /opt/cni/bin/calico-ipam instead of creating symlinks.
The component keys in calico_versions.yml need to match those in the operator's defaultImages map (e.g., 'calico/node' not 'node'). Also remove entries for components not tracked by the operator (goldmane, whisker, webhooks, etc.) since those are now in ignoredImages.
This reverts commit c89b8c2.
Use short-form key matching the new operator gen-versions format.
Typha uses docopt for CLI parsing via ParseCommandLineArgs(nil), which reads os.Args. When invoked as "calico typha", docopt sees "typha" as an unexpected argument. Reset os.Args to match the standalone binary name so docopt parsing works correctly.
kube-controllers writes health status to /status/status.json and profiling data to /profiles/. These directories need to exist in the image with correct ownership for the non-root user (uid 999).
Add a "goldmane-check" subcommand to the uber binary that performs HTTP health checks against goldmane's health endpoint. This replaces the kubelet HTTP probes that fail on dual-stack clusters because the kubelet resolves localhost to [::1] while goldmane binds to 127.0.0.1. Also point the operator build at the casey-uber-operator-v2 branch which knows how to deploy the uber image for all components.
- Restore DatastoreMigration controller that was accidentally dropped during the kube-controllers extract-to-package refactor - Close resp.Body in goldmane-check health probe - Handle error from GetString in dikastes client subcommand - Remove dead image marker rules from lib.Makefile for components now provided by the uber image - Revert unrelated Makefile changes (check-mockery-config removal, K8S_NETPOL_SUPPORTED_FEATURES change, kind-migration-test removal)
…regator Restructure the uber binary CLI for consistency: - Each component package now exports NewCommand() returning a *cobra.Command alongside its Run(ctx, Config) entry point. The uber binary wrappers in cmd/calico/ become trivial one-liners delegating to the package. - Create a generic "calico health" command for HTTP health checks that works with any component using libcalico-go's HealthAggregator pattern. This replaces the goldmane-specific goldmane-check command. - Fold the standalone "healthz" command into "calico dikastes health" since it's dikastes-specific (gRPC). Rename "check-status" to "kube-controllers-health" for clarity. - Add HealthAggregator to kube-controllers alongside the existing file-based status, enabling HTTP health probes in the uber image. - Fix goldmane HealthAggregator to bind all interfaces (empty host) instead of "localhost", fixing health probes on dual-stack clusters. - Add Config structs to kube-controllers and CSI packages, replacing loose parameter lists with structured configuration.
Collapse the per-component wrapper files (goldmane.go, guardian.go, etc.) into direct imports in main.go. Each package's NewCommand() is called directly, eliminating 9 single-function files. Also clean up the Dockerfile: remove unused /profiles directory that no Go code references.
Replace typha's docopt-based CLI parsing with cobra. Typha only had two flags (--config-file and --version), making this a straightforward conversion. - Add NewCommand() to typha/pkg/daemon/ with proper cobra flag handling - Replace ParseCommandLineArgs with SetConfigFilePath for test use - Add Run(ctx, configFile) as the clean programmatic entry point - Remove InitializeAndServeForever (callers use NewCommand or Run directly) - Remove the DisableFlagParsing shim — typha now uses cobra natively - Update standalone binary and tests
Replace calico-node's custom flag-based dispatch with cobra subcommands in a new node/pkg/node package. The command structure is: calico node felix - run Felix policy agent calico node confd - run confd configuration daemon calico node init - privileged node initialisation calico node startup - non-privileged startup routine calico node shutdown - shutdown routine calico node health - run health checks (felix/bird) calico node monitor-addresses - monitor node IP changes calico node allocate-tunnel-addrs - configure tunnel addresses calico node monitor-token - watch k8s token changes calico node complete-startup - mark node NetworkUnavailable=false calico node hostpath-init - initialize hostpaths calico node status show|report - node status operations calico node bpf - BPF debug tools calico node flows - fetch/watch flow logs calico node version - print version The standalone calico-node binary includes a translateArgs shim that maps legacy flag-style invocation (-felix, -confd, etc.) to cobra subcommands for backward compatibility. Runit service scripts are updated to use the new subcommand syntax directly.
Replace guardian's custom /health HTTP handler with the standard libcalico-go HealthAggregator. This gives guardian the same /readiness and /liveness endpoints as other components, and actually uses the GUARDIAN_HEALTH_ENABLED and GUARDIAN_HEALTH_PORT config fields that were previously defined but ignored (server was hardcoded to :9080).
Regenerate all generated files to pick up the new uber binary dependencies. Add cmd/deps.txt for dependency tracking of the uber binary. Fix lib.Makefile to read deps from cmd/deps.txt (matching what gen-deps-files produces) instead of cmd/calico/deps.txt.
- Remove unused genericUnsupported function in pod2daemon/pkg/flexvol (lint failure from the package extraction) - Update typha Makefile to use "calico-typha version" subcommand instead of "--version" flag (changed in the docopt-to-cobra migration)
Add cmd/calico to ImageReleaseDirs so the uber image is built and published during releases and hashreleases. Add the uber image to the archive images map for release tarballs.
The k8st tests previously exec'd into a dedicated calicoctl pod to run calicoctl commands. Replace this with running "calico ctl" directly from the test container, which has the uber binary volume-mounted. - Add calicoctl binary name dispatch in main.go for backward compat - Copy uber binary as /usr/bin/calicoctl in Dockerfile - Mount uber binary into test container as /usr/local/bin/calico - Change calicoctl() helper in utils.py to run locally instead of exec - Remove calicoctl pod deployment from kind setup
- Replace "uber" with "combined" in Dockerfile labels and Makefile comments - Remove translateArgs backward-compat shim from calico-node — runit scripts already use cobra subcommand syntax directly - Add explanatory comments to the remaining shim files (apiserver, ctl, cni, kube-controllers-health) explaining why they exist
Build a full cobra command tree for calicoctl via NewCommand(), replacing the docopt-based CLI parsing. The DisableFlagParsing shim in cmd/calico/ctl.go is replaced with a direct import of NewCommand(). The cobra commands bridge to the existing common code via argsFromCRUDFlags() which builds the map[string]any that ExecuteConfigCommand expects. For IPAM commands, the internal functions are exported (CheckIPAM, ShowIP, SplitPool, etc.) so cobra commands can call them directly. Router commands (node, datastore, cluster) use a synthetic args bridge for now — cobra flags are assembled into []string and passed to the existing docopt-based functions. The old functions remain in their original files for the standalone calicoctl binary. Full subcommand tree: ctl create/apply/replace/delete/get/patch/label/validate ctl ipam check/release/show/split/configure ctl node status/diags/checksystem/run ctl datastore migrate export/import/lock/unlock ctl cluster diags ctl version
Add the combined calico binary to the node image at /usr/bin/calico. The operator uses this for init containers, probes, and lifecycle hooks via "calico node" subcommands. The binary is built once by cmd/calico and copied into the node image during the Docker build, avoiding duplicate compilation.
- Remove wait_pod_ready for the calicoctl pod (removed in earlier commit) which was hanging until timeout on every kind-up - Pin goldmane to control-plane node so it starts as soon as that node's calico-node is ready, rather than waiting for a worker node
- Fix kube-controllers health check using wrong error variable for the apiserver readiness message (was using datastore check err, now uses result.Error() from the /healthz request) - Include the configured log level value in CSI parse failure message
The node image now uses the uber calico binary built with CGO (for BPF support) instead of a separate calico-node binary. This reduces the number of binaries we produce and cuts CI build time. Also removes the triple-copy from the uber Dockerfile — the CNI install code now copies /usr/bin/calico to the host as both calico and calico-ipam, saving ~300MB of image layer duplication.
Move all daemon/service entry points under 'calico component' to reserve top-level subcommands for user-facing calicoctl operations. Component commands: calico component felix / confd / typha / goldmane / ... calico component node startup / shutdown / init / ... User-facing commands stay at root: calico ctl / health / version Felix and confd are promoted from the node subtree to top-level components since they are standalone daemons, not node lifecycle ops.
Drop redundant logrus alias, restore blank line before DatastoreDescribe, place calico alphabetically in calico_versions.yml.
- Only consult CNI_COMMAND when basename is plain "calico" and no subcommand args were passed; a stray env var no longer hijacks calicoctl invocations. - Preserve original argv[0] when dispatching from calicoctl to the ctl subcommand, so panic traces and argv[0]-based detection still see "calicoctl". - Drop .exe basename cases from the Linux binary; calicoctl.exe never reaches it (there's no Windows calicoctl). - Expose commands.MassageError and wrap all RunE in the ctl subtree with it, restoring the YAML/JSON error cleanup that the uber-binary path lost. The standalone calicoctl binary now reuses the exported function instead of its own copy.
…0001 No code reads or writes /profiles/*.pprof — the files were inherited from kube-controllers Dockerfile copy-paste and never served a purpose. Drops the chown 999 along with them. USER 10001:10001 provides a non-root default in case the operator pod spec omits a securityContext.
Covers the four load-bearing entry points: calicoctl basename, calico-ipam basename, CNI_COMMAND env, and the plain-cobra fallback. Separates the dispatch decision from the handler invocation so the rules can be unit tested. Adds a cmd/calico Semaphore block that runs "make ut" on changes.
# Conflicts: # webhooks/cmd/main.go
Merge conflict: keep uber calico binary + CALICO_BUILD override in node/Makefile. Fix k8st Permission denied: the calico binary mount pointed at $(REPO_ROOT)/bin/calico-$(ARCH) which doesn't exist (Docker creates an empty directory). The binary lives at cmd/calico/bin/calico-$(ARCH). Fix cmd/calico UT build failure: go test with CGO_ENABLED=1 (the container default) transitively compiles felix/bpf/libbpf which needs libbpf.h. The test only covers pure Go dispatch logic, so disable CGO. The libbpf package has a !cgo stub that satisfies the import.
cmd/calico/Makefile was not threading CROSS_CGO_LDFLAGS into its CGO_LDFLAGS, so arm64 CGO builds reached GNU ld instead of lld and failed with "unrecognised emulation mode: aarch64linux". Mirror the pattern from felix/Makefile so the cross linker is selected whenever CROSS_SYSROOT is set.
This block was removed on master by 618537a when node switched to native clang cross-compilation, but got re-introduced during a merge restore. Leaving it in forces docker to pull the ppc64le-flavoured go-build image on an amd64 host with no binfmt/qemu registered, failing with "Exec format error" at tini.
The install logic hardcoded binary names without the .exe suffix, so on Windows it failed to find /opt/cni/bin/calico (which is actually calico.exe). The install-cni init container exited and the Windows pod never started, timing out the DaemonSet rollout. Add the .exe suffix on Windows for both the source lookup and the target filenames. While here, drop the duplicate calico-ipam.exe build step — install.exe now copies calico.exe to both calico.exe and calico-ipam.exe on the host, matching the Linux behaviour of sourcing a single binary.
# Conflicts: # hack/test/kind/deploy_resources.sh
Windows build and packaging now only require calico.exe. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Windows felix FV setup bails out under set -e if image rollout or infrastructure setup fails, which means get_logs never runs and the job leaves no diagnostic data behind. Use an EXIT trap so the logs are collected regardless of how the script exits.
Build a FIPS variant of the uber binary into bin/$(ARCH)-fips and tag the image latest-fips so image-all's sub-image-fips-amd64 target works. Forward FIPS from node/Makefile into the cmd/calico sub-make so the node image picks up the boringcrypto-linked binary.
The post-install branch checked name == "calico", which never matches
on Windows where installNames is {calico.exe, calico-ipam.exe}. Use
installNames[0] to match the skip branch.
The pod-logs directory isn't uploaded as a Semaphore artifact, so when the rollout times out before tests run there's nothing in the job log explaining why. Tee container logs to stdout and add describe ds, describe pods, events, and a fetch of CalicoWindows/logs off the VM.
Pre-build and cache uber/node/whisker images in the Prerequisites stage, then load them in E2E and KinD test blocks instead of rebuilding from source. Consolidate 12 push-images promotions into one, move multi-arch builds from per-component blocks into cmd/calico, and share a single build cache group across components that are now in the uber binary. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kube-controllers, typha, and cni-plugin FV tests now dispatch into the combined calico/calico image via subcommands (component kube-controllers, component typha, component cni install) rather than running per-component images. The uber image is loaded from GCS in the prologues, so these blocks no longer have to build their own image. Flannel-migration FVs keep their dedicated image because they need kubectl inside the container.
848043a to
e4bba42
Compare
The Build: uber image block was gated on changes to cmd/calico, but the per-component CI blocks now unconditionally load the cached image in their prologues. When a PR doesn't touch cmd/calico, nothing populates the cache and the load step fails. Drop the gate so the image is always built and cached. Also switch the save/load to a temp file + zstd --rm pattern so a partial docker save can't silently produce a truncated tarball that decompresses cleanly but has no manifest.
The test VM extracts working-copy.tgz but has no .git directory, so git rev-parse --show-toplevel fails and set -e aborts the script before any cached images are loaded.
The consolidated calico/calico image does not ship /etc/calico/typha.cfg, so typha uses the default LogFilePath of /var/log/calico/typha.log, which does not exist in the image and fatals at startup. Override the setting via env var in the k8sfv run-test script.
The consolidated calico/calico image sets USER 10001, which can't write to host tempdirs owned by the CI user. Match production, where the operator sets runAsUser=0 on the install-cni container, by passing --user 0 in the test's docker run.
Fix silent failure in cached-image upload/load path by splitting the docker save | zstd pipe into discrete steps and loading images from a file rather than stdin. Rename artifacts from .tar.zstd to .tar.zst. Consolidate the six Felix FV blocks into a single "Felix: FV" block with one job per test configuration. The blocks shared trigger, dependencies, secrets, and prologue/epilogue, so the split was only adding YAML noise and DAG nodes. Split "Node: Build" (which ran make ci) into a pair of parallel jobs running static-checks and ut. The image and image-windows targets from make ci are already covered by the Prerequisites "Build: node image" block and the "Node: multi-arch build" block. Rename the block to "Node: Static checks and UT" to reflect what it actually does and point "Node: multi-arch build" at Prerequisites directly.
Cache calico/go-build to GCS keyed by GO_BUILD_VER so downstream jobs docker-load the 3.5GB image from the cache instead of pulling cold from Docker Hub on every run. A new prerequisite block populates the cache on miss; the global prologue best-effort loads from the cache before any make commands on x86_64 agents. Split the ClusterNetworkPolicy e2e tests into their own parallel job running against the Felix routing variant only (previously run twice, once per routing variant, serial with the conformance suite).
Calling exit 0 on a cache hit terminates Semaphore's command shell before the job's post-command bookkeeping can run, so Semaphore marks the step as failed even though our logic succeeded. Use if/else instead so control falls through naturally.
The cmd/calico multi-arch build already cross-compiles the uber binary for arm64, ppc64le, and s390x. All Felix Go code is linked into the uber binary, so the Felix-specific multi-arch build doesn't add coverage.
1ebf66d to
6acb8de
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Temporary PR to test CI pipeline changes on Semaphore. Includes the uber calico image changes plus CI optimizations stacked on top. Not intended for merge - will close once validated.