fix: harden AWS release deployment workflow for go-live execution#22206
fix: harden AWS release deployment workflow for go-live execution#22206BrianCLong wants to merge 3 commits intomainfrom
Conversation
|
Note Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported. |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 3 minutes and 31 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughThe GitHub Actions workflow for AWS deployment was restructured: renamed to reflect release-based triggers, changed from push-on-main to release events with optional inputs, replaced reusable pre-deploy gate with inline checks, and enhanced infrastructure and Kubernetes deployment stages with conditional execution, improved credential handling, and expanded deployment verification. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c28137fe9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
.github/workflows/deploy-aws.yml
Outdated
| helm upgrade --install maestro charts/universal-app \ | ||
| --set image.repository=$ECR_REGISTRY/summit/maestro \ | ||
| --set image.tag=${{ github.sha }} \ | ||
| --set service.targetPort=8001 \ | ||
| --set ingress.enabled=true \ | ||
| --set ingress.hosts[0].host=api.summit.internal \ | ||
| --set ingress.hosts[0].paths[0].path=/maestro \ | ||
| --set ingress.hosts[0].paths[0].pathType=Prefix \ | ||
| --namespace default | ||
|
|
||
| # Deploy Prov Ledger | ||
| --namespace "$K8S_NAMESPACE" \ | ||
| --set image.repository="$ECR_REGISTRY/summit/maestro" \ | ||
| --set image.tag="$RELEASE_TAG" \ | ||
| --set service.targetPort=8001 |
There was a problem hiding this comment.
Preserve Maestro ingress overrides during Helm upgrade
This Helm invocation no longer sets ingress.enabled=true (or the hosts/paths overrides) for maestro; because charts/universal-app/values.yaml defaults ingress.enabled to false, running this workflow will render the ingress manifest absent and Helm will remove the existing ingress on upgrade. In environments that rely on the /maestro ingress route, this change causes an external availability outage immediately after deployment.
Useful? React with 👍 / 👎.
| - name: Upload governance evidence | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: aws-deploy-evidence-${{ github.run_id }} | ||
| path: evidence-artifacts/ |
There was a problem hiding this comment.
Upload deployment evidence even when rollout fails
The artifact upload runs only on success by default, so any failure in rollout verification or verify-deployment.sh prevents evidence-artifacts/ from being published. That removes the caller/cluster evidence needed for incident triage and compliance in exactly the failure scenarios where it is most valuable; adding if: always() (or moving upload earlier) keeps evidence available for failed runs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
.github/workflows/deploy-aws.yml (2)
199-204: Always upload the evidence collected for failed deploys.If rollout verification or the smoke check fails, this step is skipped and the caller/cluster snapshots are lost for the exact attempt you need to debug. Make the upload unconditional, and consider including
github.run_attemptso reruns keep distinct evidence.Suggested change
- name: Upload governance evidence + if: ${{ always() }} uses: actions/upload-artifact@v4 with: - name: aws-deploy-evidence-${{ github.run_id }} + name: aws-deploy-evidence-${{ github.run_id }}-${{ github.run_attempt }} path: evidence-artifacts/ retention-days: 30🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/deploy-aws.yml around lines 199 - 204, The "Upload governance evidence" workflow step currently gets skipped on failed deploys; make the artifact upload unconditional by removing any surrounding if: or conditional job/step guards so the actions/upload-artifact@v4 step always runs, and update the artifact name expression to include the run attempt (e.g., aws-deploy-evidence-${{ github.run_id }}-attempt-${{ github.run_attempt }}) so reruns produce distinct evidence; keep the same path (evidence-artifacts/) and retention-days: 30.
122-136: Make Terraform consume the same region the rest of the workflow uses.The AWS CLI/actions read
AWS_REGION, butterraform/environments/prod/main.tfis wired tovar.aws_region. If prod is not already setting that variable via tfvars, this job is either non-deterministic or will fail under-input=false. ExportTF_VAR_aws_regionhere so the provider uses the same region as the rest of the deploy.Suggested change
- name: Terraform apply + env: + TF_VAR_aws_region: ${{ env.AWS_REGION }} working-directory: terraform/environments/prod run: | set -euo pipefail terraform init -input=false terraform apply -auto-approve -input=false🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/deploy-aws.yml around lines 122 - 136, The Terraform step is not receiving the workflow AWS region, causing indeterministic runs; set the TF_VAR_aws_region environment variable from the workflow's AWS_REGION so Terraform's var.aws_region matches the actions' region. Update the "Terraform apply" step (or its environment for that job) to export or define TF_VAR_aws_region="${{ env.AWS_REGION }}" before running terraform init/apply so the provider uses the same region as the aws-actions/configure-aws-credentials step.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/deploy-aws.yml:
- Around line 19-20: The concurrency group currently uses a dynamic key ("group:
deploy-aws-${{ github.ref_name || github.sha }}") which allows separate runs to
race; change the concurrency group to a fixed name (for example
"deploy-aws-production") so all production deployments serialize into the same
group, keeping or adjusting cancel-in-progress as desired; update the
concurrency block that defines group/cancel-in-progress to use the fixed string
instead of the expression.
- Around line 6-16: The workflow currently rebuilds and repushes unconditionally
when workflow_dispatch.image_tag is supplied; change the build-and-push logic so
manual runs that specify image_tag do not overwrite existing tags: modify the
build/push job or steps (the steps that perform the build and ECR push) to run
only when no image_tag was passed (e.g., add an if condition checking
github.event.inputs.image_tag is empty) or, if you must allow a push, first
validate the checked-out ref matches the requested tag (use git to resolve the
ref/tag and compare to github.event.inputs.image_tag) and fail if they differ;
update the conditional on the build/push steps or job to use github.event_name
== 'workflow_dispatch' && github.event.inputs.image_tag == '' (or the equivalent
validation check) so manual deploys with image_tag run in deploy-only mode or
safely validate before pushing.
- Around line 168-196: The rollout checks fail because the chart's templates
render deployments as "<release>-universal-app" while the later kubectl checks
expect "<release>" names; update each Helm invocation (the helm upgrade
--install calls for maestro, prov-ledger, and policy-lac) to set the release
name as the intended deployment basename by adding --set
fullnameOverride=<release-name> (i.e., --set fullnameOverride=maestro for the
maestro release, --set fullnameOverride=prov-ledger for prov-ledger, and --set
fullnameOverride=policy-lac for policy-lac) so the generated Deployment names
match the kubectl rollout status checks.
- Around line 88-97: The Dependency audit step currently runs safety without
using maestro's pinned deps; update the "Dependency audit" step so when
matrix.service == "maestro" you either install maestro's locked deps (e.g., run
python -m pip install --upgrade pip && pip install -r maestro/requirements.txt)
before invoking safety or call safety against the manifest (e.g., safety check
-r maestro/requirements.txt) so safety actually scans maestro's dependency
manifest instead of the environment's default packages.
---
Nitpick comments:
In @.github/workflows/deploy-aws.yml:
- Around line 199-204: The "Upload governance evidence" workflow step currently
gets skipped on failed deploys; make the artifact upload unconditional by
removing any surrounding if: or conditional job/step guards so the
actions/upload-artifact@v4 step always runs, and update the artifact name
expression to include the run attempt (e.g., aws-deploy-evidence-${{
github.run_id }}-attempt-${{ github.run_attempt }}) so reruns produce distinct
evidence; keep the same path (evidence-artifacts/) and retention-days: 30.
- Around line 122-136: The Terraform step is not receiving the workflow AWS
region, causing indeterministic runs; set the TF_VAR_aws_region environment
variable from the workflow's AWS_REGION so Terraform's var.aws_region matches
the actions' region. Update the "Terraform apply" step (or its environment for
that job) to export or define TF_VAR_aws_region="${{ env.AWS_REGION }}" before
running terraform init/apply so the provider uses the same region as the
aws-actions/configure-aws-credentials step.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b153727a-feb9-4764-aac2-edc9acfc75f6
📒 Files selected for processing (1)
.github/workflows/deploy-aws.yml
|
Queue note: branch-local workflow issues have been repaired and the AWS deploy workflow was tightened in |
|
Closing in favor of #22241 for the active governed surface. The remaining non-main delta only targeted the archived deploy workflow path and was intentionally excluded from the convergence train. |
Motivation
workflow_dispatchso go-live promotions are auditable and repeatable.Description
.github/workflows/deploy-aws.ymlto trigger onrelease.publishedandworkflow_dispatchwith inputsimage_tagandapply_infraand normalized concurrency and env vars such asECR_REGISTRY,EKS_CLUSTER,K8S_NAMESPACE, andAWS_ROLE_NAME.preflightjob that validates presence ofsecrets.AWS_ACCOUNT_ID,terraform/environments/prod/main.tf,charts/universal-app/Chart.yaml, andscripts/verify-deployment.shto fail early when prerequisites are missing.build-and-pushwith explicitpnpm/nodesetup, dependency audit per-service (safetyfor Python andpnpm auditfor Node), consistent ECR tag resolution (RELEASE_TAG), and robustdocker build/pushsemantics formaestro,prov-ledger, andpolicy-lac.deploy-infracontrolled byapply_infrato run Terraform interraform/environments/prod, and enhanceddeploy-k8sto configure Helm/kubectl, update kubeconfig, deploy workloads using theRELEASE_TAG, verify rollouts withkubectl rollout status, run./scripts/verify-deployment.sh, and upload governance evidence artifacts.Testing
yaml.safe_loadwhich succeeded (YAML_OK).terraform/environments/prod/main.tf,charts/universal-app/Chart.yaml, andscripts/verify-deployment.sh(all present).actionlint .github/workflows/deploy-aws.ymlbutactionlintis not installed in this environment so static GitHub Actions linting was not performed (recommend runningactionlintin CI or locally before merge).{"agent":"codex","change_type":"patch","affected_paths":[".github/workflows/deploy-aws.yml"],"ci_checks":["yaml-parse","file-existence","actionlint:skipped"],"rollback":"re-run workflow with previous image_tag or use
helm rollbackfor affected releases","post_deploy_watch":"60m"}Codex Task
Summary by CodeRabbit