E2E infra improvements + Maglev E2E code improvements: by aaaaaaaalex · Pull Request #12477 · projectcalico/calico

aaaaaaaalex · 2026-04-15T11:20:50Z

Adds Prometheus metric scraper util
Adds UpdateFelixConfig util for serial tests
Makes Maglev tests serial
Adds BPF NAT-map-reading utils
Adds ExitError formatter so stderr is printed in text (rather than bytes)
Replaces 'context.TODO's in Maglev tests

* Adds Prometheus metric scraper util * Adds UpdateFelixConfig util for serial tests * Makes Maglev tests serial * Adds BPF NAT-map-reading utils * Adds ExitError formatter so stderr is printed in text (rather than bytes) * Replaces 'context.TODO's in Maglev tests

Copilot

Pull request overview

Improves the Calico E2E test infrastructure and Maglev E2E coverage by adding reusable utilities (metrics scraping, FelixConfiguration mutation helpers, BPF NAT map inspection) and tightening Maglev test execution characteristics/output formatting.

Changes:

Added E2E utilities for Prometheus metrics scraping, FelixConfiguration mutation/restore, and BPF NAT map backend discovery.
Improved Maglev E2E tests: run serially, replace context.TODO() with time-bounded contexts, and improve failure output for exec.ExitError.
Updated E2E build wiring to enable CGO + libbpf for E2E test binaries that import Felix BPF packages.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
e2e/pkg/utils/metrics/metrics.go	New Prometheus metrics enablement + in-cluster scraper utility pod and metric polling helper.
e2e/pkg/utils/format/format.go	Adds a Gomega custom formatter to print `exec.ExitError` stderr as text.
e2e/pkg/utils/format/doc.go	Package documentation for the new `format` utilities.
e2e/pkg/utils/felix.go	Adds `UpdateFelixConfig` helper to mutate/restore global FelixConfiguration (intended for serial tests).
e2e/pkg/utils/conncheck/conncheck.go	Exports `CreateClientPod` for reuse by other E2E utilities.
e2e/pkg/utils/bpf.go	Adds BPF NAT map reading utilities using `bpftool --json` and Felix map parsing.
e2e/pkg/tests/networking/maglev.go	Makes Maglev tests serial, improves context handling and error messages, and registers the ExitError formatter.
e2e/deps.txt	Adds dependencies needed by the new BPF-related imports.
e2e/Makefile	Enables CGO + links against libbpf for building E2E test binaries.

Copilot · 2026-04-15T11:25:48Z

+LIBBPF_A=../felix/bpf-gpl/libbpf/src/$(ARCH)/libbpf.a
+CGO_LDFLAGS="-L$(LIBBPF_CONTAINER_PATH)/$(ARCH) -lbpf -lelf -lz"
+CGO_CFLAGS="-I$(LIBBPF_CONTAINER_PATH) -I$(BPFGPL_CONTAINER_PATH)"
+DOCKER_RUN_CGO=$(DOCKER_RUN) -e CGO_ENABLED=1 -e CGO_LDFLAGS=$(CGO_LDFLAGS) -e CGO_CFLAGS=$(CGO_CFLAGS)


DOCKER_RUN_CGO forces CGO_ENABLED=1 with libbpf flags unconditionally. That will likely break e2e builds on arches where Felix disables CGO/libbpf (Felix gates to amd64 arm64 ppc64le). Suggest adding similar ifeq gating here so non-supported ARCH values fall back to the non-CGO build path and don't depend on $(LIBBPF_A).

Copilot · 2026-04-15T11:25:49Z

+		errOutFile := fmt.Sprintf("metrics-err-%s-%d", metricName, time.Now().UnixNano())
+		url := "http://" + net.JoinHostPort(scraper.metricsEPIP, strconv.Itoa(scraper.metricsEPPort)) + "/metrics"
+		output, err := conncheck.ExecInPod(scraper.pod, "sh", "-c",
+			fmt.Sprintf("wget -qO- %s 2>/tmp/%s", url, errOutFile))
+		if err != nil {
+			catCmd := fmt.Sprintf("cat /tmp/%s", errOutFile)
+			errmsg, _ := conncheck.ExecInPod(scraper.pod, "sh", "-c", catCmd)
+			return 0, fmt.Errorf("failed to scrape metrics at %s: %w: %s", url, err, errmsg)


This command builds a shell string using metricName as part of a /tmp filename and doesn’t quote either the URL or filename. Even if metric names are usually safe, this is brittle (invalid filename chars / accidental shell metacharacters) and each call leaves a new /tmp/metrics-err-* file behind during polling. Consider avoiding the temp file (capture stderr directly), or strictly sanitizing/quoting the filename and removing it after use.

Suggested change

errOutFile := fmt.Sprintf("metrics-err-%s-%d", metricName, time.Now().UnixNano())

url := "http://" + net.JoinHostPort(scraper.metricsEPIP, strconv.Itoa(scraper.metricsEPPort)) + "/metrics"

output, err := conncheck.ExecInPod(scraper.pod, "sh", "-c",

fmt.Sprintf("wget -qO- %s 2>/tmp/%s", url, errOutFile))

if err != nil {

catCmd := fmt.Sprintf("cat /tmp/%s", errOutFile)

errmsg, _ := conncheck.ExecInPod(scraper.pod, "sh", "-c", catCmd)

return 0, fmt.Errorf("failed to scrape metrics at %s: %w: %s", url, err, errmsg)

url := "http://" + net.JoinHostPort(scraper.metricsEPIP, strconv.Itoa(scraper.metricsEPPort)) + "/metrics"

output, err := conncheck.ExecInPod(scraper.pod, "sh", "-c",

fmt.Sprintf("wget -qO- '%s' 2>&1", url))

if err != nil {

return 0, fmt.Errorf("failed to scrape metrics at %s: %w: %s", url, err, output)

Copilot · 2026-04-15T11:25:50Z

+		err := f.ClientSet.CoreV1().Pods(pod.Namespace).Delete(
+			cleanupCtx, pod.Name, metav1.DeleteOptions{})
+		if err != nil {
+			return fmt.Errorf("failed to delete metrics scraper pod: %w", err)
+		}


The cleanup function fails the test if pod deletion returns an error, including a benign NotFound (e.g. if the pod was already deleted due to namespace cleanup). Consider ignoring apierrors.IsNotFound(err) here (similar to conncheck.(*connectionTester).stop) so cleanup is idempotent.

Copilot · 2026-04-15T11:25:50Z

+
+	if svcIP.To4() == nil {
+		return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum,
+			nat.FrontendMapV6(), nat.MapMemV6Iter,
+			nat.BackendMapV6(), nat.BackendMapMemV6Iter)
+	}
+	return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum,


servicePort is an int but is cast to uint16 without validation. If a caller accidentally passes an out-of-range port (>65535 or <0), it will wrap and silently query the wrong NAT entry. Consider validating 0 <= servicePort && servicePort <= 65535 and returning a clear error otherwise.

Suggested change

if svcIP.To4() == nil {

return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum,

nat.FrontendMapV6(), nat.MapMemV6Iter,

nat.BackendMapV6(), nat.BackendMapMemV6Iter)

}

return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum,

if servicePort < 0 || servicePort > 65535 {

return nil, fmt.Errorf("invalid service port %d: must be between 0 and 65535", servicePort)

}

svcPort := uint16(servicePort)

if svcIP.To4() == nil {

return findServiceBackends(pod, svcIP, svcPort, protoNum,

nat.FrontendMapV6(), nat.MapMemV6Iter,

nat.BackendMapV6(), nat.BackendMapMemV6Iter)

}

return findServiceBackends(pod, svcIP, svcPort, protoNum,

Copilot · 2026-04-15T11:25:50Z

@@ -256,25 +255,40 @@ func (m *MaglevTests) parseBackendResponse(output string) (string, error) {
 	backendName := strings.TrimSpace(response.Output)

 	// Verify this matches our expected backend pod naming pattern (backend-pod-0 to backend-pod-19)


The comment says backend pods are named backend-pod-0 to backend-pod-19, but DeployBackendPods currently creates pods numbered from 1..N. Update the comment (or the pod naming) so it matches the actual range being deployed/validated.

Suggested change

// Verify this matches our expected backend pod naming pattern (backend-pod-0 to backend-pod-19)

// Verify this matches our expected backend pod naming pattern (backend-pod-N).

caseydavenport · 2026-04-15T14:11:50Z

+// To avoid cleanups clobbering each-other, they should be applied from latest-to-oldest (e.g via DeferCleanup).
+// NOTE! Since Felixconfig is a global resource, mutations should only be done in serial tests. Cleanup order cannot
+// be guaranteed in parallel tests.
+func UpdateFelixConfig(cli ctrlclient.Client, mutate func(spec *v3.FelixConfigurationSpec)) (cleanup func() error, err error) {


I'm actually intentionally removing this - there's a new generic version of this that we can / should use instead.

calico/e2e/pkg/utils/config.go

Lines 26 to 76 in 1be532d

// ConfigureWithCleanup fetches an object by key, applies a mutation, and

// returns a cleanup function that restores the original state. If the object

// doesn't exist, it is created with the mutation applied and the cleanup

// function deletes it.

//

// Typical usage with Ginkgo:

//

// restore, err := utils.ConfigureWithCleanup(cli, ctrlclient.ObjectKey{Name: "default"}, &v3.IPAMConfiguration{}, func(cfg *v3.IPAMConfiguration) {

// cfg.Spec.StrictAffinity = false

// })

// Expect(err).NotTo(HaveOccurred())

// DeferCleanup(restore)

func ConfigureWithCleanup[T ctrlclient.Object](cli ctrlclient.Client, key ctrlclient.ObjectKey, obj T, mutate func(T)) (func(), error) {

ctx := context.Background()

err := cli.Get(ctx, key, obj)

if apierrors.IsNotFound(err) {

obj.SetName(key.Name)

obj.SetNamespace(key.Namespace)

mutate(obj)

if err := cli.Create(ctx, obj); err != nil {

return nil, fmt.Errorf("failed to create %T: %w", obj, err)

}

return func() {

if err := cli.Delete(context.Background(), obj); err != nil && !apierrors.IsNotFound(err) {

framework.Logf("WARNING: failed to delete %T %s: %v", obj, key, err)

}

}, nil

}

if err != nil {

return nil, fmt.Errorf("failed to get %T: %w", obj, err)

}

original := obj.DeepCopyObject().(T)

mutate(obj)

if err := cli.Update(ctx, obj); err != nil {

return nil, fmt.Errorf("failed to update %T: %w", obj, err)

}

return func() {

// Re-fetch to get the current resourceVersion before restoring.

if err := cli.Get(context.Background(), key, obj); err != nil {

framework.Logf("WARNING: failed to get %T for restoration: %v", obj, err)

return

}

original.SetResourceVersion(obj.GetResourceVersion())

if err := cli.Update(context.Background(), original); err != nil {

framework.Logf("WARNING: failed to restore %T: %v", obj, err)

}

}, nil

}

This allows us to use this for other resources as well (e.g., BGPConfiguration, which prompted it)

Using the generic function now instead. Removed felix.go

caseydavenport · 2026-04-15T14:12:50Z

-// createClientPod creates a long lived pod that sleeps, and can be used to execute connection tests as a client.
-func createClientPod(f *framework.Framework, namespace *v1.Namespace, baseName string, labels map[string]string, customizer func(pod *v1.Pod)) (*v1.Pod, error) {
+// CreateClientPod creates a long lived pod that sleeps, and can be used to execute connection tests as a client.
+func CreateClientPod(f *framework.Framework, namespace *v1.Namespace, baseName string, labels map[string]string, customizer func(pod *v1.Pod)) (*v1.Pod, error) {


This is a breach of the intended API surface - we should talk about if this is needed and how we can avoid this.

aaaaaaaalex self-assigned this Apr 15, 2026

Copilot AI review requested due to automatic review settings April 15, 2026 11:20

aaaaaaaalex requested a review from a team as a code owner April 15, 2026 11:20

aaaaaaaalex added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact labels Apr 15, 2026

marvin-tigera added this to the Calico v3.33.0 milestone Apr 15, 2026

Copilot started reviewing on behalf of aaaaaaaalex April 15, 2026 11:21 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

caseydavenport reviewed Apr 15, 2026

View reviewed changes

Remove UpdateFelixConfig, use generic ConfigureWithCleanup instead

2ae8e34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E infra improvements + Maglev E2E code improvements:#12477

E2E infra improvements + Maglev E2E code improvements:#12477
aaaaaaaalex wants to merge 2 commits intoprojectcalico:masterfrom
aaaaaaaalex:e2e/maglev-improvements

aaaaaaaalex commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

caseydavenport Apr 15, 2026

Uh oh!

aaaaaaaalex Apr 15, 2026

Uh oh!

caseydavenport Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -256,25 +255,40 @@ func (m *MaglevTests) parseBackendResponse(output string) (string, error) {
		backendName := strings.TrimSpace(response.Output)

		// Verify this matches our expected backend pod naming pattern (backend-pod-0 to backend-pod-19)

	// ConfigureWithCleanup fetches an object by key, applies a mutation, and
	// returns a cleanup function that restores the original state. If the object
	// doesn't exist, it is created with the mutation applied and the cleanup
	// function deletes it.
	//
	// Typical usage with Ginkgo:
	//
	// restore, err := utils.ConfigureWithCleanup(cli, ctrlclient.ObjectKey{Name: "default"}, &v3.IPAMConfiguration{}, func(cfg *v3.IPAMConfiguration) {
	// cfg.Spec.StrictAffinity = false
	// })
	// Expect(err).NotTo(HaveOccurred())
	// DeferCleanup(restore)
	func ConfigureWithCleanup[T ctrlclient.Object](cli ctrlclient.Client, key ctrlclient.ObjectKey, obj T, mutate func(T)) (func(), error) {
	ctx := context.Background()

	err := cli.Get(ctx, key, obj)
	if apierrors.IsNotFound(err) {
	obj.SetName(key.Name)
	obj.SetNamespace(key.Namespace)
	mutate(obj)
	if err := cli.Create(ctx, obj); err != nil {
	return nil, fmt.Errorf("failed to create %T: %w", obj, err)
	}
	return func() {
	if err := cli.Delete(context.Background(), obj); err != nil && !apierrors.IsNotFound(err) {
	framework.Logf("WARNING: failed to delete %T %s: %v", obj, key, err)
	}
	}, nil
	}
	if err != nil {
	return nil, fmt.Errorf("failed to get %T: %w", obj, err)
	}

	original := obj.DeepCopyObject().(T)
	mutate(obj)
	if err := cli.Update(ctx, obj); err != nil {
	return nil, fmt.Errorf("failed to update %T: %w", obj, err)
	}

	return func() {
	// Re-fetch to get the current resourceVersion before restoring.
	if err := cli.Get(context.Background(), key, obj); err != nil {
	framework.Logf("WARNING: failed to get %T for restoration: %v", obj, err)
	return
	}
	original.SetResourceVersion(obj.GetResourceVersion())
	if err := cli.Update(context.Background(), original); err != nil {
	framework.Logf("WARNING: failed to restore %T: %v", obj, err)
	}
	}, nil
	}

Conversation

aaaaaaaalex commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

caseydavenport Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

aaaaaaaalex Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

caseydavenport Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants