Skip to content

fix(operator): add startup probe to proxyrunner deployment#5300

Merged
ChrisJBurns merged 2 commits into
stacklok:mainfrom
gabrielcosi:proxy-startup-probe
May 18, 2026
Merged

fix(operator): add startup probe to proxyrunner deployment#5300
ChrisJBurns merged 2 commits into
stacklok:mainfrom
gabrielcosi:proxy-startup-probe

Conversation

@gabrielcosi
Copy link
Copy Markdown
Contributor

Summary

The proxyrunner only opens its 8080 listener once the upstream MCP pod is Ready. With only a LivenessProbe (60s effective budget), cold starts slower than that are killed before the listener is up, which causes spurious proxy restarts during normal rollouts and slow MCP startup.

  • Add a StartupProbe to the proxyrunner container in both deploymentForMCPServer and deploymentForMCPRemoteProxy. While the startup probe is active, liveness and readiness are suspended; once it passes, the existing probes take over unchanged.
  • Defaults: PeriodSeconds=5, TimeoutSeconds=3, FailureThreshold=18. 90s cold-start budget, 30s above the current effective liveness budget.

Fixes #5299

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

Does this introduce a user-facing change?

The proxyrunner pod now has a StartupProbe. Existing liveness and readiness behaviour is unchanged. Net effect: proxy pods no longer restart during slow upstream cold starts.

Special notes for reviewers

Probe parameters are still hardcoded. A proper fix would expose them as overrides on MCPServer.spec.resourceOverrides.proxyDeployment and at the operator level; that is described as a follow-up in the linked issue and intentionally not part of this PR.

@github-actions github-actions Bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels May 18, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.37%. Comparing base (5aa045b) to head (61d7bcd).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5300      +/-   ##
==========================================
+ Coverage   68.36%   68.37%   +0.01%     
==========================================
  Files         620      620              
  Lines       63419    63431      +12     
==========================================
+ Hits        43357    43374      +17     
+ Misses      16823    16817       -6     
- Partials     3239     3240       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ChrisJBurns ChrisJBurns merged commit f78bb42 into stacklok:main May 18, 2026
51 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proxyrunner liveness probe restarts the pod when upstream MCPServer takes >60s to become Ready

2 participants