E2E infra improvements + Maglev E2E code improvements:#12477
E2E infra improvements + Maglev E2E code improvements:#12477aaaaaaaalex wants to merge 2 commits intoprojectcalico:masterfrom
Conversation
aaaaaaaalex
commented
Apr 15, 2026
- Adds Prometheus metric scraper util
- Adds UpdateFelixConfig util for serial tests
- Makes Maglev tests serial
- Adds BPF NAT-map-reading utils
- Adds ExitError formatter so stderr is printed in text (rather than bytes)
- Replaces 'context.TODO's in Maglev tests
* Adds Prometheus metric scraper util * Adds UpdateFelixConfig util for serial tests * Makes Maglev tests serial * Adds BPF NAT-map-reading utils * Adds ExitError formatter so stderr is printed in text (rather than bytes) * Replaces 'context.TODO's in Maglev tests
There was a problem hiding this comment.
Pull request overview
Improves the Calico E2E test infrastructure and Maglev E2E coverage by adding reusable utilities (metrics scraping, FelixConfiguration mutation helpers, BPF NAT map inspection) and tightening Maglev test execution characteristics/output formatting.
Changes:
- Added E2E utilities for Prometheus metrics scraping, FelixConfiguration mutation/restore, and BPF NAT map backend discovery.
- Improved Maglev E2E tests: run serially, replace
context.TODO()with time-bounded contexts, and improve failure output forexec.ExitError. - Updated E2E build wiring to enable CGO + libbpf for E2E test binaries that import Felix BPF packages.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| e2e/pkg/utils/metrics/metrics.go | New Prometheus metrics enablement + in-cluster scraper utility pod and metric polling helper. |
| e2e/pkg/utils/format/format.go | Adds a Gomega custom formatter to print exec.ExitError stderr as text. |
| e2e/pkg/utils/format/doc.go | Package documentation for the new format utilities. |
| e2e/pkg/utils/felix.go | Adds UpdateFelixConfig helper to mutate/restore global FelixConfiguration (intended for serial tests). |
| e2e/pkg/utils/conncheck/conncheck.go | Exports CreateClientPod for reuse by other E2E utilities. |
| e2e/pkg/utils/bpf.go | Adds BPF NAT map reading utilities using bpftool --json and Felix map parsing. |
| e2e/pkg/tests/networking/maglev.go | Makes Maglev tests serial, improves context handling and error messages, and registers the ExitError formatter. |
| e2e/deps.txt | Adds dependencies needed by the new BPF-related imports. |
| e2e/Makefile | Enables CGO + links against libbpf for building E2E test binaries. |
| LIBBPF_A=../felix/bpf-gpl/libbpf/src/$(ARCH)/libbpf.a | ||
| CGO_LDFLAGS="-L$(LIBBPF_CONTAINER_PATH)/$(ARCH) -lbpf -lelf -lz" | ||
| CGO_CFLAGS="-I$(LIBBPF_CONTAINER_PATH) -I$(BPFGPL_CONTAINER_PATH)" | ||
| DOCKER_RUN_CGO=$(DOCKER_RUN) -e CGO_ENABLED=1 -e CGO_LDFLAGS=$(CGO_LDFLAGS) -e CGO_CFLAGS=$(CGO_CFLAGS) |
There was a problem hiding this comment.
DOCKER_RUN_CGO forces CGO_ENABLED=1 with libbpf flags unconditionally. That will likely break e2e builds on arches where Felix disables CGO/libbpf (Felix gates to amd64 arm64 ppc64le). Suggest adding similar ifeq gating here so non-supported ARCH values fall back to the non-CGO build path and don't depend on $(LIBBPF_A).
| errOutFile := fmt.Sprintf("metrics-err-%s-%d", metricName, time.Now().UnixNano()) | ||
| url := "http://" + net.JoinHostPort(scraper.metricsEPIP, strconv.Itoa(scraper.metricsEPPort)) + "/metrics" | ||
| output, err := conncheck.ExecInPod(scraper.pod, "sh", "-c", | ||
| fmt.Sprintf("wget -qO- %s 2>/tmp/%s", url, errOutFile)) | ||
| if err != nil { | ||
| catCmd := fmt.Sprintf("cat /tmp/%s", errOutFile) | ||
| errmsg, _ := conncheck.ExecInPod(scraper.pod, "sh", "-c", catCmd) | ||
| return 0, fmt.Errorf("failed to scrape metrics at %s: %w: %s", url, err, errmsg) |
There was a problem hiding this comment.
This command builds a shell string using metricName as part of a /tmp filename and doesn’t quote either the URL or filename. Even if metric names are usually safe, this is brittle (invalid filename chars / accidental shell metacharacters) and each call leaves a new /tmp/metrics-err-* file behind during polling. Consider avoiding the temp file (capture stderr directly), or strictly sanitizing/quoting the filename and removing it after use.
| errOutFile := fmt.Sprintf("metrics-err-%s-%d", metricName, time.Now().UnixNano()) | |
| url := "http://" + net.JoinHostPort(scraper.metricsEPIP, strconv.Itoa(scraper.metricsEPPort)) + "/metrics" | |
| output, err := conncheck.ExecInPod(scraper.pod, "sh", "-c", | |
| fmt.Sprintf("wget -qO- %s 2>/tmp/%s", url, errOutFile)) | |
| if err != nil { | |
| catCmd := fmt.Sprintf("cat /tmp/%s", errOutFile) | |
| errmsg, _ := conncheck.ExecInPod(scraper.pod, "sh", "-c", catCmd) | |
| return 0, fmt.Errorf("failed to scrape metrics at %s: %w: %s", url, err, errmsg) | |
| url := "http://" + net.JoinHostPort(scraper.metricsEPIP, strconv.Itoa(scraper.metricsEPPort)) + "/metrics" | |
| output, err := conncheck.ExecInPod(scraper.pod, "sh", "-c", | |
| fmt.Sprintf("wget -qO- '%s' 2>&1", url)) | |
| if err != nil { | |
| return 0, fmt.Errorf("failed to scrape metrics at %s: %w: %s", url, err, output) |
| err := f.ClientSet.CoreV1().Pods(pod.Namespace).Delete( | ||
| cleanupCtx, pod.Name, metav1.DeleteOptions{}) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to delete metrics scraper pod: %w", err) | ||
| } |
There was a problem hiding this comment.
The cleanup function fails the test if pod deletion returns an error, including a benign NotFound (e.g. if the pod was already deleted due to namespace cleanup). Consider ignoring apierrors.IsNotFound(err) here (similar to conncheck.(*connectionTester).stop) so cleanup is idempotent.
|
|
||
| if svcIP.To4() == nil { | ||
| return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum, | ||
| nat.FrontendMapV6(), nat.MapMemV6Iter, | ||
| nat.BackendMapV6(), nat.BackendMapMemV6Iter) | ||
| } | ||
| return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum, |
There was a problem hiding this comment.
servicePort is an int but is cast to uint16 without validation. If a caller accidentally passes an out-of-range port (>65535 or <0), it will wrap and silently query the wrong NAT entry. Consider validating 0 <= servicePort && servicePort <= 65535 and returning a clear error otherwise.
| if svcIP.To4() == nil { | |
| return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum, | |
| nat.FrontendMapV6(), nat.MapMemV6Iter, | |
| nat.BackendMapV6(), nat.BackendMapMemV6Iter) | |
| } | |
| return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum, | |
| if servicePort < 0 || servicePort > 65535 { | |
| return nil, fmt.Errorf("invalid service port %d: must be between 0 and 65535", servicePort) | |
| } | |
| svcPort := uint16(servicePort) | |
| if svcIP.To4() == nil { | |
| return findServiceBackends(pod, svcIP, svcPort, protoNum, | |
| nat.FrontendMapV6(), nat.MapMemV6Iter, | |
| nat.BackendMapV6(), nat.BackendMapMemV6Iter) | |
| } | |
| return findServiceBackends(pod, svcIP, svcPort, protoNum, |
| @@ -256,25 +255,40 @@ func (m *MaglevTests) parseBackendResponse(output string) (string, error) { | |||
| backendName := strings.TrimSpace(response.Output) | |||
|
|
|||
| // Verify this matches our expected backend pod naming pattern (backend-pod-0 to backend-pod-19) | |||
There was a problem hiding this comment.
The comment says backend pods are named backend-pod-0 to backend-pod-19, but DeployBackendPods currently creates pods numbered from 1..N. Update the comment (or the pod naming) so it matches the actual range being deployed/validated.
| // Verify this matches our expected backend pod naming pattern (backend-pod-0 to backend-pod-19) | |
| // Verify this matches our expected backend pod naming pattern (backend-pod-N). |
| // To avoid cleanups clobbering each-other, they should be applied from latest-to-oldest (e.g via DeferCleanup). | ||
| // NOTE! Since Felixconfig is a global resource, mutations should only be done in serial tests. Cleanup order cannot | ||
| // be guaranteed in parallel tests. | ||
| func UpdateFelixConfig(cli ctrlclient.Client, mutate func(spec *v3.FelixConfigurationSpec)) (cleanup func() error, err error) { |
There was a problem hiding this comment.
I'm actually intentionally removing this - there's a new generic version of this that we can / should use instead.
calico/e2e/pkg/utils/config.go
Lines 26 to 76 in 1be532d
This allows us to use this for other resources as well (e.g., BGPConfiguration, which prompted it)
There was a problem hiding this comment.
Using the generic function now instead. Removed felix.go
| // createClientPod creates a long lived pod that sleeps, and can be used to execute connection tests as a client. | ||
| func createClientPod(f *framework.Framework, namespace *v1.Namespace, baseName string, labels map[string]string, customizer func(pod *v1.Pod)) (*v1.Pod, error) { | ||
| // CreateClientPod creates a long lived pod that sleeps, and can be used to execute connection tests as a client. | ||
| func CreateClientPod(f *framework.Framework, namespace *v1.Namespace, baseName string, labels map[string]string, customizer func(pod *v1.Pod)) (*v1.Pod, error) { |
There was a problem hiding this comment.
This is a breach of the intended API surface - we should talk about if this is needed and how we can avoid this.