Skip to content

E2E infra improvements + Maglev E2E code improvements:#12477

Open
aaaaaaaalex wants to merge 2 commits intoprojectcalico:masterfrom
aaaaaaaalex:e2e/maglev-improvements
Open

E2E infra improvements + Maglev E2E code improvements:#12477
aaaaaaaalex wants to merge 2 commits intoprojectcalico:masterfrom
aaaaaaaalex:e2e/maglev-improvements

Conversation

@aaaaaaaalex
Copy link
Copy Markdown
Contributor

  • Adds Prometheus metric scraper util
  • Adds UpdateFelixConfig util for serial tests
  • Makes Maglev tests serial
  • Adds BPF NAT-map-reading utils
  • Adds ExitError formatter so stderr is printed in text (rather than bytes)
  • Replaces 'context.TODO's in Maglev tests

* Adds Prometheus metric scraper util
* Adds UpdateFelixConfig util for serial tests
* Makes Maglev tests serial
* Adds BPF NAT-map-reading utils
* Adds ExitError formatter so stderr is printed in text (rather than bytes)
* Replaces 'context.TODO's in Maglev tests
@aaaaaaaalex aaaaaaaalex self-assigned this Apr 15, 2026
Copilot AI review requested due to automatic review settings April 15, 2026 11:20
@aaaaaaaalex aaaaaaaalex requested a review from a team as a code owner April 15, 2026 11:20
@aaaaaaaalex aaaaaaaalex added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact labels Apr 15, 2026
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone Apr 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the Calico E2E test infrastructure and Maglev E2E coverage by adding reusable utilities (metrics scraping, FelixConfiguration mutation helpers, BPF NAT map inspection) and tightening Maglev test execution characteristics/output formatting.

Changes:

  • Added E2E utilities for Prometheus metrics scraping, FelixConfiguration mutation/restore, and BPF NAT map backend discovery.
  • Improved Maglev E2E tests: run serially, replace context.TODO() with time-bounded contexts, and improve failure output for exec.ExitError.
  • Updated E2E build wiring to enable CGO + libbpf for E2E test binaries that import Felix BPF packages.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
e2e/pkg/utils/metrics/metrics.go New Prometheus metrics enablement + in-cluster scraper utility pod and metric polling helper.
e2e/pkg/utils/format/format.go Adds a Gomega custom formatter to print exec.ExitError stderr as text.
e2e/pkg/utils/format/doc.go Package documentation for the new format utilities.
e2e/pkg/utils/felix.go Adds UpdateFelixConfig helper to mutate/restore global FelixConfiguration (intended for serial tests).
e2e/pkg/utils/conncheck/conncheck.go Exports CreateClientPod for reuse by other E2E utilities.
e2e/pkg/utils/bpf.go Adds BPF NAT map reading utilities using bpftool --json and Felix map parsing.
e2e/pkg/tests/networking/maglev.go Makes Maglev tests serial, improves context handling and error messages, and registers the ExitError formatter.
e2e/deps.txt Adds dependencies needed by the new BPF-related imports.
e2e/Makefile Enables CGO + links against libbpf for building E2E test binaries.

Comment thread e2e/Makefile
Comment on lines +10 to +13
LIBBPF_A=../felix/bpf-gpl/libbpf/src/$(ARCH)/libbpf.a
CGO_LDFLAGS="-L$(LIBBPF_CONTAINER_PATH)/$(ARCH) -lbpf -lelf -lz"
CGO_CFLAGS="-I$(LIBBPF_CONTAINER_PATH) -I$(BPFGPL_CONTAINER_PATH)"
DOCKER_RUN_CGO=$(DOCKER_RUN) -e CGO_ENABLED=1 -e CGO_LDFLAGS=$(CGO_LDFLAGS) -e CGO_CFLAGS=$(CGO_CFLAGS)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOCKER_RUN_CGO forces CGO_ENABLED=1 with libbpf flags unconditionally. That will likely break e2e builds on arches where Felix disables CGO/libbpf (Felix gates to amd64 arm64 ppc64le). Suggest adding similar ifeq gating here so non-supported ARCH values fall back to the non-CGO build path and don't depend on $(LIBBPF_A).

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +119
errOutFile := fmt.Sprintf("metrics-err-%s-%d", metricName, time.Now().UnixNano())
url := "http://" + net.JoinHostPort(scraper.metricsEPIP, strconv.Itoa(scraper.metricsEPPort)) + "/metrics"
output, err := conncheck.ExecInPod(scraper.pod, "sh", "-c",
fmt.Sprintf("wget -qO- %s 2>/tmp/%s", url, errOutFile))
if err != nil {
catCmd := fmt.Sprintf("cat /tmp/%s", errOutFile)
errmsg, _ := conncheck.ExecInPod(scraper.pod, "sh", "-c", catCmd)
return 0, fmt.Errorf("failed to scrape metrics at %s: %w: %s", url, err, errmsg)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command builds a shell string using metricName as part of a /tmp filename and doesn’t quote either the URL or filename. Even if metric names are usually safe, this is brittle (invalid filename chars / accidental shell metacharacters) and each call leaves a new /tmp/metrics-err-* file behind during polling. Consider avoiding the temp file (capture stderr directly), or strictly sanitizing/quoting the filename and removing it after use.

Suggested change
errOutFile := fmt.Sprintf("metrics-err-%s-%d", metricName, time.Now().UnixNano())
url := "http://" + net.JoinHostPort(scraper.metricsEPIP, strconv.Itoa(scraper.metricsEPPort)) + "/metrics"
output, err := conncheck.ExecInPod(scraper.pod, "sh", "-c",
fmt.Sprintf("wget -qO- %s 2>/tmp/%s", url, errOutFile))
if err != nil {
catCmd := fmt.Sprintf("cat /tmp/%s", errOutFile)
errmsg, _ := conncheck.ExecInPod(scraper.pod, "sh", "-c", catCmd)
return 0, fmt.Errorf("failed to scrape metrics at %s: %w: %s", url, err, errmsg)
url := "http://" + net.JoinHostPort(scraper.metricsEPIP, strconv.Itoa(scraper.metricsEPPort)) + "/metrics"
output, err := conncheck.ExecInPod(scraper.pod, "sh", "-c",
fmt.Sprintf("wget -qO- '%s' 2>&1", url))
if err != nil {
return 0, fmt.Errorf("failed to scrape metrics at %s: %w: %s", url, err, output)

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +103
err := f.ClientSet.CoreV1().Pods(pod.Namespace).Delete(
cleanupCtx, pod.Name, metav1.DeleteOptions{})
if err != nil {
return fmt.Errorf("failed to delete metrics scraper pod: %w", err)
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup function fails the test if pod deletion returns an error, including a benign NotFound (e.g. if the pod was already deleted due to namespace cleanup). Consider ignoring apierrors.IsNotFound(err) here (similar to conncheck.(*connectionTester).stop) so cleanup is idempotent.

Copilot uses AI. Check for mistakes.
Comment thread e2e/pkg/utils/bpf.go
Comment on lines +51 to +57

if svcIP.To4() == nil {
return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum,
nat.FrontendMapV6(), nat.MapMemV6Iter,
nat.BackendMapV6(), nat.BackendMapMemV6Iter)
}
return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum,
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

servicePort is an int but is cast to uint16 without validation. If a caller accidentally passes an out-of-range port (>65535 or <0), it will wrap and silently query the wrong NAT entry. Consider validating 0 <= servicePort && servicePort <= 65535 and returning a clear error otherwise.

Suggested change
if svcIP.To4() == nil {
return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum,
nat.FrontendMapV6(), nat.MapMemV6Iter,
nat.BackendMapV6(), nat.BackendMapMemV6Iter)
}
return findServiceBackends(pod, svcIP, uint16(servicePort), protoNum,
if servicePort < 0 || servicePort > 65535 {
return nil, fmt.Errorf("invalid service port %d: must be between 0 and 65535", servicePort)
}
svcPort := uint16(servicePort)
if svcIP.To4() == nil {
return findServiceBackends(pod, svcIP, svcPort, protoNum,
nat.FrontendMapV6(), nat.MapMemV6Iter,
nat.BackendMapV6(), nat.BackendMapMemV6Iter)
}
return findServiceBackends(pod, svcIP, svcPort, protoNum,

Copilot uses AI. Check for mistakes.
@@ -256,25 +255,40 @@ func (m *MaglevTests) parseBackendResponse(output string) (string, error) {
backendName := strings.TrimSpace(response.Output)

// Verify this matches our expected backend pod naming pattern (backend-pod-0 to backend-pod-19)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says backend pods are named backend-pod-0 to backend-pod-19, but DeployBackendPods currently creates pods numbered from 1..N. Update the comment (or the pod naming) so it matches the actual range being deployed/validated.

Suggested change
// Verify this matches our expected backend pod naming pattern (backend-pod-0 to backend-pod-19)
// Verify this matches our expected backend pod naming pattern (backend-pod-N).

Copilot uses AI. Check for mistakes.
Comment thread e2e/pkg/utils/felix.go Outdated
// To avoid cleanups clobbering each-other, they should be applied from latest-to-oldest (e.g via DeferCleanup).
// NOTE! Since Felixconfig is a global resource, mutations should only be done in serial tests. Cleanup order cannot
// be guaranteed in parallel tests.
func UpdateFelixConfig(cli ctrlclient.Client, mutate func(spec *v3.FelixConfigurationSpec)) (cleanup func() error, err error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually intentionally removing this - there's a new generic version of this that we can / should use instead.

// ConfigureWithCleanup fetches an object by key, applies a mutation, and
// returns a cleanup function that restores the original state. If the object
// doesn't exist, it is created with the mutation applied and the cleanup
// function deletes it.
//
// Typical usage with Ginkgo:
//
// restore, err := utils.ConfigureWithCleanup(cli, ctrlclient.ObjectKey{Name: "default"}, &v3.IPAMConfiguration{}, func(cfg *v3.IPAMConfiguration) {
// cfg.Spec.StrictAffinity = false
// })
// Expect(err).NotTo(HaveOccurred())
// DeferCleanup(restore)
func ConfigureWithCleanup[T ctrlclient.Object](cli ctrlclient.Client, key ctrlclient.ObjectKey, obj T, mutate func(T)) (func(), error) {
ctx := context.Background()
err := cli.Get(ctx, key, obj)
if apierrors.IsNotFound(err) {
obj.SetName(key.Name)
obj.SetNamespace(key.Namespace)
mutate(obj)
if err := cli.Create(ctx, obj); err != nil {
return nil, fmt.Errorf("failed to create %T: %w", obj, err)
}
return func() {
if err := cli.Delete(context.Background(), obj); err != nil && !apierrors.IsNotFound(err) {
framework.Logf("WARNING: failed to delete %T %s: %v", obj, key, err)
}
}, nil
}
if err != nil {
return nil, fmt.Errorf("failed to get %T: %w", obj, err)
}
original := obj.DeepCopyObject().(T)
mutate(obj)
if err := cli.Update(ctx, obj); err != nil {
return nil, fmt.Errorf("failed to update %T: %w", obj, err)
}
return func() {
// Re-fetch to get the current resourceVersion before restoring.
if err := cli.Get(context.Background(), key, obj); err != nil {
framework.Logf("WARNING: failed to get %T for restoration: %v", obj, err)
return
}
original.SetResourceVersion(obj.GetResourceVersion())
if err := cli.Update(context.Background(), original); err != nil {
framework.Logf("WARNING: failed to restore %T: %v", obj, err)
}
}, nil
}

This allows us to use this for other resources as well (e.g., BGPConfiguration, which prompted it)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the generic function now instead. Removed felix.go

// createClientPod creates a long lived pod that sleeps, and can be used to execute connection tests as a client.
func createClientPod(f *framework.Framework, namespace *v1.Namespace, baseName string, labels map[string]string, customizer func(pod *v1.Pod)) (*v1.Pod, error) {
// CreateClientPod creates a long lived pod that sleeps, and can be used to execute connection tests as a client.
func CreateClientPod(f *framework.Framework, namespace *v1.Namespace, baseName string, labels map[string]string, customizer func(pod *v1.Pod)) (*v1.Pod, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breach of the intended API surface - we should talk about if this is needed and how we can avoid this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants