Skip to content

feat(ci): add retry logic and metrics to critical workflows#17561

Closed
BrianCLong wants to merge 8 commits intomainfrom
feat/ci-reliability-retry-metrics-17180110948053763867
Closed

feat(ci): add retry logic and metrics to critical workflows#17561
BrianCLong wants to merge 8 commits intomainfrom
feat/ci-reliability-retry-metrics-17180110948053763867

Conversation

@BrianCLong
Copy link
Copy Markdown
Owner

@BrianCLong BrianCLong commented Feb 1, 2026

User description

This PR addresses CI reliability issues by adding shell-based retry loops to critical dependency installation and audit steps in GitHub Actions workflows. It also integrates a metrics collection job to track runner performance and queue times.

Key changes:

  • ci.yml: Added retries to lint, typecheck, unit-tests, soc-controls. Added ci-metrics job.
  • ci-verify.yml: Added retries to security-scan (install & audit), governance-checks, provenance, schema-validation, compliance-evidence. Added ci-metrics job.
  • _reusable-ga-readiness.yml: Added retries to pnpm install and npm audit.

PR created automatically by Jules for task 17180110948053763867 started by @BrianCLong


PR Type

Enhancement


Description

  • Add retry logic (3 attempts, 15s delay) to pnpm install across all workflows

  • Add retry logic to pnpm audit and npm audit steps for resilience

  • Integrate ci-metrics job in ci.yml and ci-verify.yml workflows

  • Improve CI reliability by handling transient network failures


Diagram Walkthrough

flowchart LR
  A["Dependency Installation"] -->|"3 retries, 15s delay"| B["pnpm install"]
  C["Security Audits"] -->|"3 retries, 15s delay"| D["pnpm/npm audit"]
  E["CI Workflows"] -->|"collect metrics"| F["ci-metrics job"]
  B --> G["Improved Reliability"]
  D --> G
  F --> G
Loading

File Walkthrough

Relevant files
Enhancement
_reusable-ga-readiness.yml
Add retry logic to dependency and audit steps                       

.github/workflows/_reusable-ga-readiness.yml

  • Added retry loop (3 attempts, 15s delay) to pnpm install
    --frozen-lockfile
  • Added retry loop (3 attempts, 15s delay) to npm audit
    --audit-level=high
  • Improves resilience against transient network failures in GA readiness
    checks
+2/-2     
ci-verify.yml
Add retries and metrics to verification workflow                 

.github/workflows/ci-verify.yml

  • Added retry loop (3 attempts, 15s delay) to pnpm install
    --frozen-lockfile in 5 jobs
  • Modified pnpm audit --audit-level critical to use retry loop with
    error handling
  • Added ci-metrics job that depends on all verification jobs and runs
    always
  • Improves CI reliability for security scanning, governance, provenance,
    schema validation, and compliance jobs
+17/-10 
ci.yml
Add retries and metrics to main CI workflow                           

.github/workflows/ci.yml

  • Added retry loop (3 attempts, 15s delay) to pnpm install
    --frozen-lockfile in 4 jobs
  • Added ci-metrics job that depends on all main CI jobs and runs always
  • Applies retries to lint, typecheck, unit-tests, and soc-controls jobs
  • Enhances CI reliability and enables metrics collection for performance
    tracking
+12/-4   

Summary by CodeRabbit

  • Chores
    • Improved CI/CD robustness with automatic retries for dependency installs and security audits across many workflows.
    • Added CI metrics collection and consolidated metrics reporting.
    • Made steps more resilient (continue-on-error tolerances, retries) and added explicit tooling/version/setup steps for pnpm, Node, and OPA.
    • Pinned several action versions, adopted pnpm in more jobs, refined artifact naming, adjusted SBOM/report paths, and added minor debug/test fixture steps.

- Implements retry logic (3 attempts, 15s delay) for `pnpm install` in `ci.yml`, `ci-verify.yml`, and `_reusable-ga-readiness.yml`.
- Implements retry logic for `pnpm/npm audit` steps to mitigate network flakes.
- Adds `ci-metrics` job to `ci.yml` and `ci-verify.yml` utilizing `_reusable-ci-metrics.yml` to capture queue times and performance data.

Co-authored-by: BrianCLong <6404035+BrianCLong@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

@qodo-code-review
Copy link
Copy Markdown

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Opaque retry failures: The new retry loops for pnpm install fail with a generic exit 1 and no explicit
attempt/failure context (e.g., attempt number, final failure message), reducing actionable
debugging information when dependency installation repeatedly fails.

Referred Code
- run: for i in 1 2 3; do pnpm install --frozen-lockfile && exit 0 || sleep 15; done; exit 1
- run: pnpm run lint

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 1, 2026

Important

Review skipped

Too many files!

This PR contains 298 files, which is 148 over the limit of 150.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cda7a23a-772a-4947-ac55-0ee81efdcade

📥 Commits

Reviewing files that changed from the base of the PR and between 28bc734 and bebef8a.

📒 Files selected for processing (298)
  • .agentic-prompts/ci-ops-runbook.md
  • .agentic-prompts/task-19016-frontier-closure.md
  • .archive/v039/client/tsconfig.json
  • .archive/v039/server/package.json
  • .archive/v039/server/tsconfig.json
  • .artifacts/pr/schema.json
  • .ci/cosign-policy.sh
  • .ci/detections_unit.py
  • .ci/evidence_validate.py
  • .ci/scripts/release/evidence_packager.ts
  • .ci/supplychain_delta_check.py
  • .disabled/adc/tsconfig.json
  • .disabled/afl-store/tsconfig.json
  • .disabled/atl/tsconfig.json
  • .disabled/cfa-tdw/tsconfig.json
  • .dockerignore
  • .doclinkignore
  • .env.example
  • .github/.pre-commit-config.yaml
  • .github/ISSUE_TEMPLATE/agent-task.yml
  • .github/ISSUE_TEMPLATE/backlog-item.yml
  • .github/ISSUE_TEMPLATE/bootcamp-task.yaml
  • .github/ISSUE_TEMPLATE/bug.yaml
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • .github/ISSUE_TEMPLATE/capture-issue.md
  • .github/ISSUE_TEMPLATE/chore.yml
  • .github/ISSUE_TEMPLATE/config.yml
  • .github/ISSUE_TEMPLATE/dev_environment.yml
  • .github/ISSUE_TEMPLATE/docs_request.yml
  • .github/ISSUE_TEMPLATE/dsr.yml
  • .github/ISSUE_TEMPLATE/epic-eclipse-spiderfoot-rf.yml
  • .github/ISSUE_TEMPLATE/epic.yml
  • .github/ISSUE_TEMPLATE/feature-case-first-investigation-ux-palette.yml
  • .github/ISSUE_TEMPLATE/feature-evidence-integrity-gate-antigravity.yml
  • .github/ISSUE_TEMPLATE/feature-parity-kernel-codex.yml
  • .github/ISSUE_TEMPLATE/feature.yaml
  • .github/ISSUE_TEMPLATE/feature_request.yml
  • .github/ISSUE_TEMPLATE/ga_gates.yml
  • .github/ISSUE_TEMPLATE/incident.yaml
  • .github/ISSUE_TEMPLATE/incident.yml
  • .github/ISSUE_TEMPLATE/postmortem.yml
  • .github/ISSUE_TEMPLATE/promise-epic.yml
  • .github/ISSUE_TEMPLATE/promise-feature.yml
  • .github/ISSUE_TEMPLATE/release_regression.yaml
  • .github/ISSUE_TEMPLATE/roadmap-prompt.yml
  • .github/ISSUE_TEMPLATE/security-issue.yml
  • .github/ISSUE_TEMPLATE/spike.yml
  • .github/ISSUE_TEMPLATE/translation_request.yml
  • .github/ISSUE_TEMPLATE/triage.yml
  • .github/ISSUE_TEMPLATE/user_story.yml
  • .github/MILESTONES/ai-ethics-ga.yaml
  • .github/MILESTONES/declarative-pipelines-ga.yml
  • .github/MILESTONES/ga-cogops.yml
  • .github/MILESTONES/ga-infra-selfservice.yml
  • .github/MILESTONES/ga/ai_adoption.required_artifacts.json
  • .github/MILESTONES/infowar-sitrep-ga.yml
  • .github/MILESTONES/required_checks.todo.md
  • .github/MILESTONES/self_flow_ga.yml
  • .github/MILESTONES/semantic-search-ga.yml
  • .github/SECURITY.md
  • .github/actionlint.yaml
  • .github/actions/abp-build/action.yml
  • .github/actions/backlog-guard/action.yml
  • .github/actions/docker-build-push/action.yml
  • .github/actions/fabric-warm/action.yml
  • .github/actions/helm-deploy/action.yml
  • .github/actions/maestro-gate-check/action.yml
  • .github/actions/maestro-run/action.yml
  • .github/actions/release-bundle/action.yml
  • .github/actions/setup-pnpm/action.yml
  • .github/actions/setup-toolchain/action.yml
  • .github/actions/setup-turbo/action.yml
  • .github/actions/setup/action.yml
  • .github/actions/sigstore-verify/action.yml
  • .github/actions/verify-workflow-versions/action.yml
  • .github/actions/verify-workflow-versions/index.cjs
  • .github/assignees.yml
  • .github/auto-assign.yml
  • .github/auto-reviewers.yml
  • .github/ci-cost-policy.yml
  • .github/ci/action-pinning-allowlist.yml
  • .github/ci/docker-compose.ci.yml
  • .github/ci/permissions-allowlist.yml
  • .github/codeql/codeql-config.yml
  • .github/compose/pg_neo.yml
  • .github/container-structure-test.yaml
  • .github/copilot-instructions.md
  • .github/copilot-instructions.yml
  • .github/ct-helm.yaml
  • .github/dependabot.yml
  • .github/flake-registry.json
  • .github/governance/branch_protection_rules.json
  • .github/k6/intelgraph-canary-validation.js
  • .github/k6/rollout-canary.js
  • .github/kube-linter-config.yaml
  • .github/labeler.yml
  • .github/labels.json
  • .github/labels.yml
  • .github/merge-engine/README.md
  • .github/merge-engine/config.yml
  • .github/milestones.yml
  • .github/policies/agent-runtime/tool_access_policy.yaml
  • .github/policies/agent-security.rego
  • .github/policies/agent_governance.yml
  • .github/policies/ai-usage.rego
  • .github/policies/canonical-path-exceptions.json
  • .github/policies/dependency-cos.yml
  • .github/policies/dependency-worldmodel.yml
  • .github/policies/infra/README.md
  • .github/policies/infra/cost_guardrails.rego
  • .github/policies/infra/deny-by-default.rego
  • .github/policies/infra/dependency_allowedlist.rego
  • .github/policies/infra/environment_scope.rego
  • .github/policies/infra/resource_naming.rego
  • .github/policies/jurisdiction.policy.json
  • .github/policies/media-claims.policy.json
  • .github/policies/personal-intelligence.policy.md
  • .github/policies/pipeline-schema.rego
  • .github/policies/regulatory-early-warning-policy.rego
  • .github/policies/self_flow_policy.rego
  • .github/policies/slsa-spdx.rego
  • .github/policies/supplychain/verify.rego
  • .github/policies/supplychain/verify_test.rego
  • .github/policies/task-thread-access.rego
  • .github/protection-rules.yml
  • .github/pull_request_template.md
  • .github/release-drafter.yml
  • .github/required-checks.yml
  • .github/required_checks.todo.md
  • .github/roadmap_calendar.yml
  • .github/roadmap_mapping.yml
  • .github/roadmap_seeds.yml
  • .github/scripts/check-never-log.ts
  • .github/scripts/evidence-emit.ts
  • .github/scripts/infra-verify.ts
  • .github/scripts/issue-queue-bot/__tests__/bot.test.cjs
  • .github/scripts/issue-queue-bot/__tests__/classifier.test.cjs
  • .github/scripts/issue-queue-bot/__tests__/queueBot.test.js
  • .github/scripts/issue-queue-bot/bot.cjs
  • .github/scripts/issue-queue-bot/classifier.cjs
  • .github/scripts/issue-queue-bot/index.js
  • .github/scripts/issue-queue-bot/package.json
  • .github/scripts/issue-queue-bot/rules.json
  • .github/scripts/issue-queue-bot/run.cjs
  • .github/scripts/issue-queue-bot/run.js
  • .github/scripts/merge-engine/apply_labels.sh
  • .github/scripts/merge-engine/gh_pr_inventory.sh
  • .github/scripts/merge-engine/triage_prs.py
  • .github/scripts/merge-train-autopilot.sh
  • .github/scripts/never-log-scan.ts
  • .github/scripts/process-pr-batch.sh
  • .github/scripts/sigstore/verify.sh
  • .github/scripts/validate-evidence-schemas.mjs
  • .github/scripts/validate-evidence.ts
  • .github/scripts/verify-canonical-structure.cjs
  • .github/scripts/verify-dependency-delta.ts
  • .github/scripts/verify-evidence.mjs
  • .github/scripts/verify-regulatory-ew-evidence.ts
  • .github/scripts/verify-workflow-graphs.mjs
  • .github/scripts/verify_evidence_index.ts
  • .github/scripts/verify_self_flow.ts
  • .github/security-waivers.yml
  • .github/settings.yml
  • .github/stale.yml
  • .github/summit/README.md
  • .github/summit/agents/architectureDriftAgent.ts
  • .github/summit/agents/observabilityRollupAgent.ts
  • .github/summit/agents/readinessAgent.ts
  • .github/summit/agents/securityPostureAgent.ts
  • .github/summit/agents/triageAgent.ts
  • .github/summit/dashboards/engineering-health.json
  • .github/summit/dashboards/merge-readiness.json
  • .github/summit/dashboards/security-posture.json
  • .github/summit/event-router/routeEvent.ts
  • .github/summit/lib/artifacts.ts
  • .github/summit/lib/context.ts
  • .github/summit/policies/readiness-policy.json
  • .github/workflows/.archive/_auth-oidc.yml
  • .github/workflows/.archive/_deploy.yml
  • .github/workflows/.archive/_reusable-aws.yml
  • .github/workflows/.archive/_reusable-build.yml
  • .github/workflows/.archive/_reusable-ci-fast.yml
  • .github/workflows/.archive/_reusable-ci-metrics.yml
  • .github/workflows/.archive/_reusable-ci-perf.yml
  • .github/workflows/.archive/_reusable-ci.yml
  • .github/workflows/.archive/_reusable-governance-gate.yml
  • .github/workflows/.archive/_reusable-node-pnpm-setup.yml
  • .github/workflows/.archive/_reusable-release.yml
  • .github/workflows/.archive/_reusable-security-compliance.yml
  • .github/workflows/.archive/_reusable-setup.yml
  • .github/workflows/.archive/_reusable-slsa-build.yml
  • .github/workflows/.archive/_reusable-test-suite.yml
  • .github/workflows/.archive/_reusable-test.yml
  • .github/workflows/.archive/_reusable-toolchain-setup.yml
  • .github/workflows/.archive/a11y-lab.yml
  • .github/workflows/.archive/abac-policy.yml
  • .github/workflows/.archive/accessibility.yml
  • .github/workflows/.archive/admin-cli.yml
  • .github/workflows/.archive/agent-guardrails.yml
  • .github/workflows/.archive/agentic-lifecycle.yml
  • .github/workflows/.archive/agentic-plan-gate.yml
  • .github/workflows/.archive/agentic-policy-check.yml
  • .github/workflows/.archive/agentic-policy-drift.yml
  • .github/workflows/.archive/agentic-task-orchestrator.yml
  • .github/workflows/.archive/ai-assist-gates.yml
  • .github/workflows/.archive/ai-copilot-canary.yml
  • .github/workflows/.archive/ai-governance.yml
  • .github/workflows/.archive/ai-refactor-dryrun.yml
  • .github/workflows/.archive/airgap-deployment.yml
  • .github/workflows/.archive/alert-hygiene.yml
  • .github/workflows/.archive/api-determinism-check.yml
  • .github/workflows/.archive/api-docs-sync.yml
  • .github/workflows/.archive/api-docs-validation.yml
  • .github/workflows/.archive/api-docs.yml
  • .github/workflows/.archive/api-lint.yml
  • .github/workflows/.archive/archsim.yml
  • .github/workflows/.archive/audit-artifacts.yml
  • .github/workflows/.archive/audit-branch-protections.yml
  • .github/workflows/.archive/audit-ci.yml
  • .github/workflows/.archive/audit-exception-expiry.yml
  • .github/workflows/.archive/audit.strict.nightly.yml
  • .github/workflows/.archive/auto-approve-prs.yml
  • .github/workflows/.archive/auto-draft-release.yml
  • .github/workflows/.archive/auto-enqueue.yml
  • .github/workflows/.archive/auto-fix-vulnerabilities.yml
  • .github/workflows/.archive/auto-green.yml
  • .github/workflows/.archive/auto-remediation.yml
  • .github/workflows/.archive/auto-resolve-conflicts.yml
  • .github/workflows/.archive/auto-rollback.yml
  • .github/workflows/.archive/auto-triage-blockers.yml
  • .github/workflows/.archive/automated-backups.yml
  • .github/workflows/.archive/autotriage-ci.yml
  • .github/workflows/.archive/azure-turin-v7-drift.yml
  • .github/workflows/.archive/backup-dr.yml
  • .github/workflows/.archive/backup-restore-validation.yml
  • .github/workflows/.archive/backup-verify.yml
  • .github/workflows/.archive/bidirectional-sync.yml
  • .github/workflows/.archive/branch-lifecycle.yml
  • .github/workflows/.archive/branch-protection-drift.yml
  • .github/workflows/.archive/branch-protection-reconcile.yml
  • .github/workflows/.archive/build-cache.yml
  • .github/workflows/.archive/build-images.yml
  • .github/workflows/.archive/build.yml
  • .github/workflows/.archive/ci-actionlint.yml
  • .github/workflows/.archive/ci-backbone.yml
  • .github/workflows/.archive/ci-cd.yml
  • .github/workflows/.archive/ci-comprehensive.yml
  • .github/workflows/.archive/ci-core.yml
  • .github/workflows/.archive/ci-e2e-full.yml
  • .github/workflows/.archive/ci-e2e-smoke.yml
  • .github/workflows/.archive/ci-evidence-verify.yml
  • .github/workflows/.archive/ci-governance.yml
  • .github/workflows/.archive/ci-health-monitor.yml
  • .github/workflows/.archive/ci-image.yml
  • .github/workflows/.archive/ci-intelgraph-server.yml
  • .github/workflows/.archive/ci-legacy.yml
  • .github/workflows/.archive/ci-main.yml
  • .github/workflows/.archive/ci-modernized.yml
  • .github/workflows/.archive/ci-performance-k6.yml
  • .github/workflows/.archive/ci-platform.yml
  • .github/workflows/.archive/ci-post-merge.yml
  • .github/workflows/.archive/ci-pr-gate.yml
  • .github/workflows/.archive/ci-pr.yml
  • .github/workflows/.archive/ci-preflight.yml
  • .github/workflows/.archive/ci-rdp-gates.yml
  • .github/workflows/.archive/ci-repo-hygiene.yml
  • .github/workflows/.archive/ci-runner-drift.yml
  • .github/workflows/.archive/ci-sanity.yml
  • .github/workflows/.archive/ci-security.yml
  • .github/workflows/.archive/ci-sgf.yml
  • .github/workflows/.archive/ci-sharded-example.yml
  • .github/workflows/.archive/ci-signal-gate.yml
  • .github/workflows/.archive/ci-supply-chain.yml
  • .github/workflows/.archive/ci-template-optimized.yml
  • .github/workflows/.archive/ci-test.yml
  • .github/workflows/.archive/ci-trusted.yml
  • .github/workflows/.archive/ci-workflow-diff.yml
  • .github/workflows/.archive/ci-zap.yml
  • .github/workflows/.archive/ci.pr.scoped.yml
  • .github/workflows/.archive/ci.switchboard.yml
  • .github/workflows/.archive/ci.unified.yml
  • .github/workflows/.archive/ci.yml
  • .github/workflows/.archive/ci_baseline.yml
  • .github/workflows/.archive/ci_eval.yml
  • .github/workflows/.archive/ci_governance.yml
  • .github/workflows/.archive/ci_observability.yml
  • .github/workflows/.archive/ci_perf.yml
  • .github/workflows/.archive/ci_policy.yml
  • .github/workflows/.archive/ci_provenance.yml
  • .github/workflows/.archive/ci_sdk.yml
  • .github/workflows/.archive/ci_supplychain_foundation.yml
  • .github/workflows/.archive/cicd-observer.yml
  • .github/workflows/.archive/cli.yml
  • .github/workflows/.archive/client-ci.yml
  • .github/workflows/.archive/client-typecheck.yml
  • .github/workflows/.archive/code-quality-gates.yml
  • .github/workflows/.archive/codedata.yml
  • .github/workflows/.archive/codeql-analysis.yml

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Multiple GitHub Actions workflows were changed: many install/audit steps now use 3-attempt retry loops; several workflows added or wired a ci-metrics job; numerous action pins, pnpm setup/version changes, permission/error-handling tweaks, and a few conditional guards and artifact/name adjustments were applied.

Changes

Cohort / File(s) Summary
Retry: installs & audits
.github/workflows/_reusable-ga-readiness.yml, .github/workflows/ci-verify.yml, .github/workflows/ci.yml, .github/workflows/golden-path-e2e.yml, .github/workflows/schema-diff.yml
Replaced direct pnpm install --frozen-lockfile and some npm audit/install calls with 3-iteration retry loops (sleep 15s between attempts; exit on first success, fail after final attempt).
CI metrics integration
.github/workflows/ci-verify.yml, .github/workflows/ci.yml, .github/workflows/_reusable-ci-metrics.yml
Added ci-metrics job wired to a reusable metrics workflow; adjusted quoting/heredoc and pinned upload-artifact action commit in the reusable workflow.
pnpm setup / versioning
.github/workflows/ga-evidence.yml, .github/workflows/security-regressions.yml, .github/workflows/post-release-canary.yml, .github/workflows/golden-path-e2e.yml, .github/workflows/graph-sync.yml
Added or standardized pnpm setup steps and explicit version pins (e.g., version: 9.12.0) and consolidated with: inputs for setup steps.
Action pinning & swaps
.github/workflows/ci-actionlint.yml, .github/workflows/supply-chain-integrity.yml, .github/workflows/ci-security.yml, .github/workflows/reusable/canary-rollback.yml, .github/workflows/ci-signal-gate.yml
Pinned several actions to specific commit SHAs and replaced/swapped some actions (e.g., actionlint -> reviewdog), and updated many upload-artifact refs to a commit hash. Review for reproducibility implications.
Error-handling & permissions
.github/workflows/auto-enqueue.yml, .github/workflows/ci-signal-gate.yml
Removed checks read permission; made gh pr checks tolerant (`
Artifact naming & report logic
.github/workflows/ci-security.yml, .github/workflows/schema-diff.yml, .github/workflows/pr-quality-gate.yml
Split and renamed security-report artifacts per tool; updated artifact download patterns; schema-diff expanded PR comment generation and breaking-change gating — inspect PR-comment and breaking-change logic closely.
Conditional / small control changes
.github/workflows/subsumption-bundle-verify.yml, .github/workflows/release-policy-tests.yml, .agentic-prompts/task-11847-fix-jest-esm.md
Added file-existence guard for subsumption verification; added PyYAML install and a debug fixtures step; minor TypeScript/Jest mock quoting/style tweaks.
Formatting / minor workflow tweaks
.github/workflows/_reusable-ci-metrics.yml, .github/workflows/graph-sync.yml, .github/workflows/golden-path-e2e.yml
Quoting/formatting, heredoc delimiter changes, cron/branch filter quoting normalization, and Playwright install command adjusted to use pnpm exec.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped three times, then waited — neat,
Pipelines retry until they meet.
Metrics hum and artifacts sing,
Tests and reports — a joyous spring! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description includes a user-provided summary, PR type, detailed description with key changes, and a mermaid diagram. However, it does not follow the required template structure, particularly missing explicit Risk & Surface, Assumption Ledger, Security Impact, and Green CI Contract Checklist sections. Complete the PR description using the provided template: add Risk Level and Surface Area selections, Assumption Ledger details, Security Impact assessment, and Green CI Contract Checklist items.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main objective: adding retry logic and metrics to critical CI workflows, which aligns with the primary changes across multiple workflow files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/ci-reliability-retry-metrics-17180110948053763867

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Centralize retry logic into reusable action

Create a reusable GitHub composite action to encapsulate the duplicated
shell-based retry logic. This centralizes the retry mechanism, improving
maintainability and simplifying workflow files.

Examples:

.github/workflows/ci.yml [51]
      - run: for i in 1 2 3; do pnpm install --frozen-lockfile && exit 0 || sleep 15; done; exit 1
.github/workflows/ci-verify.yml [40]
        run: for i in 1 2 3; do pnpm install --frozen-lockfile && exit 0 || sleep 15; done; exit 1

Solution Walkthrough:

Before:

# In .github/workflows/ci.yml
- name: Install dependencies
  run: for i in 1 2 3; do pnpm install --frozen-lockfile && exit 0 || sleep 15; done; exit 1

# In .github/workflows/ci-verify.yml
- name: Install dependencies
  run: for i in 1 2 3; do pnpm install --frozen-lockfile && exit 0 || sleep 15; done; exit 1

# ... and in 8 other places

After:

# New file: .github/actions/retry/action.yml
name: 'Retry Step'
inputs:
  run:
    required: true
runs:
  using: "composite"
  steps:
    - shell: bash
      run: |
        for i in 1 2 3; do ${{ inputs.run }} && exit 0 || sleep 15; done; exit 1

# In all workflow files:
- name: Install dependencies
  uses: ./.github/actions/retry
  with:
    run: pnpm install --frozen-lockfile
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies significant code duplication of the retry logic across multiple workflows and proposes a robust solution using a reusable composite action, which greatly improves maintainability.

Medium
Possible issue
Fix misleading error message on audit failure

Adjust the pnpm audit retry logic to only report "Critical CVEs detected" if the
command fails after all retries, preventing misleading error messages from
transient network issues.

.github/workflows/ci-verify.yml [42-49]

 - name: Dependency audit (critical CVEs only)
   run: |
     echo "=== Scanning for Critical CVEs ==="
-    for i in 1 2 3; do pnpm audit --audit-level critical && exit 0 || { echo "Retry $i..."; sleep 15; }; done
-    echo "::error::Critical CVEs detected - CI BLOCKED"
-    echo "::error::Run 'pnpm audit' locally and document exceptions if needed"
-    exit 1
+    if ! (for i in 1 2 3; do pnpm audit --audit-level critical && exit 0 || { echo "Retry $i..."; sleep 15; }; done; exit 1); then
+      echo "::error::Critical CVEs detected - CI BLOCKED"
+      echo "::error::Run 'pnpm audit' locally and document exceptions if needed"
+      exit 1
+    fi
   continue-on-error: false # BLOCKING: critical CVEs block merge
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a logic flaw where a transient error would be misreported as a security vulnerability, improving the accuracy and reliability of the CI feedback.

Medium
General
Use exponential backoff for retries

Implement exponential backoff in the pnpm install retry logic, increasing the
wait time after each failed attempt to better handle transient network issues.

.github/workflows/ci-verify.yml [40]

-for i in 1 2 3; do pnpm install --frozen-lockfile && exit 0 || sleep 15; done; exit 1
+for i in 1 2 3; do
+  echo "Attempt $i/3: pnpm install --frozen-lockfile"
+  pnpm install --frozen-lockfile && exit 0
+  backoff=$((15 * 2**(i-1)))
+  echo "Install failed on attempt $i, retrying in ${backoff}s..."
+  sleep "${backoff}"
+done
+exit 1

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 5

__

Why: The suggestion proposes using exponential backoff, which is a standard best practice for retry logic to handle transient failures more gracefully and reduce server load.

Low
Add logging to retry loops

Add logging to the pnpm install retry loop to show the attempt number and a
failure message, which will aid in debugging.

.github/workflows/_reusable-ga-readiness.yml [92]

-for i in 1 2 3; do pnpm install --frozen-lockfile && exit 0 || sleep 15; done; exit 1
+for i in 1 2 3; do
+  echo "Attempt $i/3: pnpm install --frozen-lockfile"
+  pnpm install --frozen-lockfile && exit 0
+  echo "Install failed on attempt $i, retrying in 15s..."
+  sleep 15
+done
+exit 1

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 4

__

Why: The suggestion improves debuggability by adding explicit logging for each retry attempt, which is helpful for diagnosing transient CI failures, though it is a minor enhancement.

Low
  • More

Copy link
Copy Markdown
Owner Author

@BrianCLong BrianCLong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed. Retry loops for pnpm install/audit and new ci-metrics job look fine. Confirmed _reusable-ci-metrics.yml exists. Ready for human approval.

BrianCLong and others added 6 commits February 3, 2026 11:17
- Fix: Swap `pnpm/action-setup` and `actions/setup-node` order in `golden-path-e2e.yml` to support caching.
- Feat: Add retry logic (3 attempts, 15s delay) to `pnpm install` in `ci.yml`, `ci-verify.yml`, `golden-path-e2e.yml`, and `_reusable-ga-readiness.yml`.
- Feat: Add retry logic to `pnpm/npm audit` steps.
- Feat: Add `ci-metrics` job to `ci.yml` and `ci-verify.yml` to track runner performance.

Co-authored-by: BrianCLong <6404035+BrianCLong@users.noreply.github.com>
… verification

- Fix: Swap `pnpm/action-setup` and `actions/setup-node` order in `golden-path-e2e.yml` to support caching.
- Feat: Add retry logic (3 attempts, 15s delay) to `pnpm install` in `ci.yml`, `ci-verify.yml`, `golden-path-e2e.yml`, and `_reusable-ga-readiness.yml`.
- Feat: Add retry logic to `pnpm/npm audit` steps.
- Feat: Add `ci-metrics` job to `ci.yml` and `ci-verify.yml` utilizing `_reusable-ci-metrics.yml`.
- Fix: Add missing `.github/workflows/_reusable-ci-metrics.yml` file.
- Fix: Update `scripts/verify_evidence.py` to ignore `ga`, `bundles`, and `ai-influence-ops` directories to prevent false positives in evidence verification.

Co-authored-by: BrianCLong <6404035+BrianCLong@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@TopicalitySummit TopicalitySummit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Bulk approval phase.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
.agentic-prompts/task-11847-fix-jest-esm.md (1)

1-297: ⚠️ Potential issue | 🟡 Minor

Clarify why this documentation file is included in a metrics/retry logic PR.

This file documents the Jest ESM configuration task (#11847), which is an active infrastructure concern in the repo (referenced in ci-legacy.yml and validated in CI workflows). However, the PR objectives focus on CI retry logic and metrics collection (#17561).

The modifications here are minimal formatting changes (primarily quote style in code examples) that don't add meaningful value to the documented task. If this file's inclusion is intentional—perhaps as part of a broader testing infrastructure update—explain the relationship to the PR objectives. Otherwise, consider removing it to keep the PR focused.

.github/workflows/auto-enqueue.yml (1)

32-36: ⚠️ Potential issue | 🟡 Minor

checks variable is captured but never used.

The checks variable is assigned on line 32 but not referenced in the conditional on line 34. This appears to be either dead code or an incomplete implementation where required checks were intended to gate the enqueue.

If the intent is to verify required checks pass before enqueueing:

-         if echo "$labels" | grep -q "queue:ready" && [ "$approvals" -ge 1 ]; then
+         # Verify all required checks passed (no "fail" or "pending" in output)
+         if echo "$labels" | grep -q "queue:ready" && [ "$approvals" -ge 1 ] && ! echo "$checks" | grep -qE 'fail|pending'; then

If checks verification is not needed, remove the unused variable to avoid confusion.

.github/workflows/_reusable-ci-metrics.yml (1)

52-52: ⚠️ Potential issue | 🟠 Major

Job output artifact_name will always be empty due to step configuration error.

The job output at line 52 references steps.upload.outputs.artifact_name, but:

  1. The upload step (lines 164-170) uses actions/upload-artifact which does not output artifact_name — it outputs artifact-id and artifact-url
  2. The step at lines 172-174 writes artifact_name to $GITHUB_OUTPUT but lacks an id, making it inaccessible

Consumers of outputs.metrics_artifact_name (line 29-31) will receive an empty value.

🐛 Proposed fix: merge the output into the upload step or add an id

Option 1: Add id to the output step and fix the job output reference:

      - name: Output Artifact Name
+       id: artifact-name
        run: |
          echo "artifact_name=ci-metrics-${{ github.run_id }}-${{ github.run_attempt }}" >> $GITHUB_OUTPUT

And update line 52:

-     artifact_name: ${{ steps.upload.outputs.artifact_name }}
+     artifact_name: ${{ steps.artifact-name.outputs.artifact_name }}

Option 2: Remove the extra step and set artifact_name directly in the metrics step:

      - name: Collect Workflow Metrics
        id: metrics
        ...
        run: |
          ...
          echo "duration_minutes=${DURATION_MINUTES}" >> $GITHUB_OUTPUT
          echo "success_rate=${SUCCESS_RATE}" >> $GITHUB_OUTPUT
+         echo "artifact_name=ci-metrics-${{ github.run_id }}-${{ github.run_attempt }}" >> $GITHUB_OUTPUT

And update line 52:

-     artifact_name: ${{ steps.upload.outputs.artifact_name }}
+     artifact_name: ${{ steps.metrics.outputs.artifact_name }}

Also applies to: 164-174

.github/workflows/supply-chain-integrity.yml (1)

45-54: ⚠️ Potential issue | 🔴 Critical

Correct the pinned SHAs: they do not match the claimed versions.

The workflow pins to incorrect commit SHAs. Verification against official GitHub releases shows critical mismatches:

  • actions/checkout: Pinned SHA 34e114876b0b11c390a56381ad16ebd13914f8d5 does not match v4.1.7's actual SHA 6ccd57f4c5...
  • actions/setup-node: Pinned SHA 65d868f8d4d85d7d4abb7de0875cde3fcc8798f5 does not match v4.0.3's actual SHA 1e60f620b9541d16bece96c5465dc8ee9832be0b
  • actions/upload-artifact: Pinned SHA b7c566a772e6b6bfb58ed0dc250532a479d7789f does not match v4.3.3's actual SHA 65462800fd760344b1a7b4382951275a0abb4808 (lines 126, 134, 217, 226, 235)

These mismatches mean the locked commits do not correspond to the tagged versions, undermining the supply chain security goal. Update all pinned SHAs to the correct commit hashes for their claimed versions.

🧹 Nitpick comments (1)
.github/workflows/supply-chain-integrity.yml (1)

44-47: Clarify the comment: SHA pins are immutable, not branch tips.

The comment "(pinned to v4 branch tip)" is misleading. A SHA pin is a fixed, immutable reference to a specific commit—not a branch tip, which moves over time. Consider simplifying to just the version number for consistency with other pinned actions in this file.

📝 Suggested comment fix
       - name: Checkout code
-        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.1.7 (pinned to v4 branch tip)
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.1.7

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/pr-quality-gate.yml (1)

81-94: ⚠️ Potential issue | 🔴 Critical

SBOM file paths are inconsistent — upload will fail to find the generated file.

The generate-sbom.sh script outputs files matching the pattern {ARTIFACT_NAME}-{service_name}-{VERSION}.cdx.json. With the parameters passed (summit-platform ci-build artifacts/sbom), it generates files like summit-platform-main-ci-build.cdx.json.

However, the upload step references artifacts/sbom/sbom.json, which generate-sbom.sh never creates. This causes:

  • Upload step will fail or silently skip (actions/upload-artifact@v4 doesn't upload missing paths by default)
  • The SBOM artifact is never retained

Additionally, the hardcoded summit-platform-main-ci-build.cdx.json in the policy-check env var assumes a plain Dockerfile or Dockerfile.main exists in the repo. If the repo structure differs, the SBOM file won't be found and the check will warn.

Fix: Either update the upload path to match what generate-sbom.sh actually produces (e.g., artifacts/sbom/summit-platform-*-ci-build.cdx.json), or modify generate-sbom.sh to also output or symlink to sbom.json. Also verify the Dockerfile naming convention matches the hardcoded main service name, or make SBOM_FILE dynamic.

🤖 Fix all issues with AI agents
In @.github/workflows/ci-security.yml:
- Line 83: Update the inline version comment on every "uses:
actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f" occurrence (12
places) from "# v4.1.0" to "# v6.0.0"; then verify the GitHub-hosted
ubuntu-22.04 runners meet the minimum runner version requirement (>= 2.327.1)
for upload-artifact v6.0.0 and confirm there are no incompatibilities between
upload-artifact v6.0.0 and the pinned "download-artifact" action (v4.1.8)
referenced elsewhere.

In @.github/workflows/ci-verify.yml:
- Around line 312-316: The ga-evidence-completeness GitHub Actions job currently
sets cache: 'pnpm' while using actions/setup-node but never configures pnpm
(missing pnpm/action-setup) and never runs pnpm install; fix by either removing
the pnpm cache entry from the ga-evidence-completeness job (preferred since it
doesn't call pnpm) or, if pnpm is actually needed, add a pnpm/action-setup step
before actions/setup-node so the pnpm store path can be resolved; update the job
definition around the cache: 'pnpm' and actions/setup-node entries accordingly.
- Around line 147-151: The mcp-ux-lint and ga-evidence-completeness jobs use
actions/setup-node@v4 with cache: 'pnpm' but are missing the pnpm/action-setup
step; add a step using pnpm/action-setup (e.g., uses: pnpm/action-setup@v2)
immediately before the actions/setup-node@v4 step in both the mcp-ux-lint and
ga-evidence-completeness job definitions so pnpm is installed before the node
setup and pnpm cache action runs.

In @.github/workflows/mvp4-gate.yml:
- Line 59: The comment on the setup-node pin "uses:
actions/setup-node@6044e13b5dc448c55e2357c09f80417699197238" is wrong (it says
"# v6"); update that inline comment to the correct release label used for that
commit (e.g., "# v4" or "# v4.0.3") and decide on a consistent pinning strategy
across workflows (either change the other occurrences in build-lint-strict and
quarantine-tests to the same commit hash or make them all use the same tag like
"@v4") so all three setup-node references use the same strategy for consistency
and reproducibility.

In @.github/workflows/release-policy-tests.yml:
- Around line 61-62: The CI contains a temporary "Debug Fixtures" step that runs
ls -R on scripts/release/tests/fixtures and can fail the job if the directory is
missing; either remove the "Debug Fixtures" step entirely (if it was for
temporary debugging) or make it non-fatal by guarding the command so it only
runs when the directory exists or by marking the step as non-fatal (e.g., use a
test like check for directory existence before listing, or set the step to
continue-on-error) — edit the step named "Debug Fixtures" to implement one of
these fixes.
- Around line 58-59: Replace the loose "pip install PyYAML" invocation with a
pinned version and retry logic: change the "pip install PyYAML" command to
install a specific version (e.g., PyYAML==6.0 to match other workflows) and wrap
it with the same retry/backoff mechanism used in ci-verify.yml so the job
retries transient download failures; locate the line containing "pip install
PyYAML" and update it to use the pinned version and the repository's standard
retry pattern.

In @.github/workflows/schema-diff.yml:
- Around line 59-62: The "Install dependencies" step uses plain "pnpm install"
and lacks retry logic and the --frozen-lockfile flag; replace that run block so
the workflow retries the install up to 3 times with a 15s delay and runs "pnpm
install --frozen-lockfile" (and then "pnpm add -g ts-node typescript") inside
the retry loop, mirroring the retry behavior and flag used in ci.yml /
ci-verify.yml / _reusable-ga-readiness.yml to prevent lockfile changes and
transient network failures.
🧹 Nitpick comments (1)
.github/workflows/schema-diff.yml (1)

52-52: Inconsistent action pinning: pnpm/action-setup@v4 uses a mutable tag.

Other actions in this workflow are pinned to commit SHAs (e.g., upload-artifact, download-artifact, github-script), but pnpm/action-setup uses a mutable @v4 tag. For supply-chain consistency, consider pinning to a specific commit hash here as well.

@BrianCLong BrianCLong force-pushed the feat/ci-reliability-retry-metrics-17180110948053763867 branch from d81508e to 06b6595 Compare February 8, 2026 12:49
BrianCLong added a commit that referenced this pull request Feb 22, 2026
Copy link
Copy Markdown
Collaborator

@TopicalitySummit TopicalitySummit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Governed exception sweep: approved for merge queue progression.

@BrianCLong BrianCLong force-pushed the feat/ci-reliability-retry-metrics-17180110948053763867 branch from 85aa24e to bebef8a Compare March 8, 2026 15:20
@BrianCLong
Copy link
Copy Markdown
Owner Author

Temporarily closing to reduce Actions queue saturation and unblock #22241. Reopen after the golden-main convergence PR merges.

1 similar comment
@BrianCLong
Copy link
Copy Markdown
Owner Author

Temporarily closing to reduce Actions queue saturation and unblock #22241. Reopen after the golden-main convergence PR merges.

@BrianCLong BrianCLong closed this Mar 30, 2026
auto-merge was automatically disabled March 30, 2026 14:02

Pull request was closed

@google-labs-jules
Copy link
Copy Markdown
Contributor

Temporarily closing to reduce Actions queue saturation and unblock #22241. Reopen after the golden-main convergence PR merges.

Understood. Acknowledging that this work is temporarily closed to unblock the queue, and I will stop work on this task until it's reopened.

1 similar comment
@google-labs-jules
Copy link
Copy Markdown
Contributor

Temporarily closing to reduce Actions queue saturation and unblock #22241. Reopen after the golden-main convergence PR merges.

Understood. Acknowledging that this work is temporarily closed to unblock the queue, and I will stop work on this task until it's reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants