Skip to content

fix: harden AWS release deployment workflow for go-live execution#22206

Closed
BrianCLong wants to merge 3 commits intomainfrom
codex/deploy-release-on-aws
Closed

fix: harden AWS release deployment workflow for go-live execution#22206
BrianCLong wants to merge 3 commits intomainfrom
codex/deploy-release-on-aws

Conversation

@BrianCLong
Copy link
Copy Markdown
Owner

@BrianCLong BrianCLong commented Mar 29, 2026

Motivation

  • Make the AWS release path deterministic and safe to run from a published release or manually via workflow_dispatch so go-live promotions are auditable and repeatable.
  • Fail fast on missing deployment prerequisites and capture governance evidence for traceability during production deploys.
  • Reduce operational risk by consolidating build/push, optional infra apply, and k8s rollout verification into a single hardened workflow.

Description

  • Reworked .github/workflows/deploy-aws.yml to trigger on release.published and workflow_dispatch with inputs image_tag and apply_infra and normalized concurrency and env vars such as ECR_REGISTRY, EKS_CLUSTER, K8S_NAMESPACE, and AWS_ROLE_NAME.
  • Added a preflight job that validates presence of secrets.AWS_ACCOUNT_ID, terraform/environments/prod/main.tf, charts/universal-app/Chart.yaml, and scripts/verify-deployment.sh to fail early when prerequisites are missing.
  • Hardened build-and-push with explicit pnpm/node setup, dependency audit per-service (safety for Python and pnpm audit for Node), consistent ECR tag resolution (RELEASE_TAG), and robust docker build/push semantics for maestro, prov-ledger, and policy-lac.
  • Added optional deploy-infra controlled by apply_infra to run Terraform in terraform/environments/prod, and enhanced deploy-k8s to configure Helm/kubectl, update kubeconfig, deploy workloads using the RELEASE_TAG, verify rollouts with kubectl rollout status, run ./scripts/verify-deployment.sh, and upload governance evidence artifacts.

Testing

  • Parsed the modified workflow YAML using PyYAML with yaml.safe_load which succeeded (YAML_OK).
  • Verified required referenced files exist via automated checks: terraform/environments/prod/main.tf, charts/universal-app/Chart.yaml, and scripts/verify-deployment.sh (all present).
  • Attempted to run actionlint .github/workflows/deploy-aws.yml but actionlint is not installed in this environment so static GitHub Actions linting was not performed (recommend running actionlint in CI or locally before merge).

{"agent":"codex","change_type":"patch","affected_paths":[".github/workflows/deploy-aws.yml"],"ci_checks":["yaml-parse","file-existence","actionlint:skipped"],"rollback":"re-run workflow with previous image_tag or use helm rollback for affected releases","post_deploy_watch":"60m"}


Codex Task

Summary by CodeRabbit

  • Chores
    • Deployment workflow now triggers on releases instead of automatic pushes to main branch.
    • Added manual deployment option with configurable image tag and infrastructure update settings.
    • Enhanced pre-deployment validation and Kubernetes rollout verification processes.
    • Updated compliance and governance evidence tracking.

@BrianCLong BrianCLong added the codex Codex-owned implementation work label Mar 29, 2026 — with ChatGPT Codex Connector
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

Warning

Rate limit exceeded

@BrianCLong has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 31 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 3 minutes and 31 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 69eb6520-94cf-4e8b-896d-db8f22998f2e

📥 Commits

Reviewing files that changed from the base of the PR and between 3c28137 and aeed8ad.

📒 Files selected for processing (2)
  • .github/workflows/deploy-aws.yml
  • docker-compose.dev.yaml

Walkthrough

The GitHub Actions workflow for AWS deployment was restructured: renamed to reflect release-based triggers, changed from push-on-main to release events with optional inputs, replaced reusable pre-deploy gate with inline checks, and enhanced infrastructure and Kubernetes deployment stages with conditional execution, improved credential handling, and expanded deployment verification.

Changes

Cohort / File(s) Summary
AWS Deployment Workflow
.github/workflows/deploy-aws.yml
Renamed workflow, changed event triggers from push to release with workflow_dispatch inputs. Replaced reusable pre-deploy gate with inline preflight job, removed Trivy scan. Refactored image tagging from github.sha to computed RELEASE_TAG. Added conditional deploy-infra job with Terraform deployment when apply_infra == true. Updated Kubernetes deployment to use Helm/kubectl setup, explicit kubeconfig generation, and parameterized image tag/namespace. Expanded rollout verification for additional deployments. Updated governance evidence artifact naming and retention policy.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hoppy Deployment Days

Releases trigger, no more main push,
Preflight gates in a single rush,
Terraform builds when the choice is made,
Kubernetes rolls with the RELEASE TAG parade,
Evidence logged, safely stored for days!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is comprehensive and covers motivation, implementation details, and testing approach. However, it is missing several required template sections including Risk & Surface area, Assumption Ledger, Execution Governor & Customer Impact, Green CI Contract Checklist, and Investigation Trust Doctrine Checklist. Complete the required template sections: select Risk Level and Surface Area, fill in Assumption Ledger, specify Customer Impact and Rollback Plan, and complete the Green CI Contract and Investigation Trust Doctrine checklists.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change—hardening the AWS release deployment workflow for go-live execution, which aligns with the PR's core objective of making deployments deterministic and safe.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/deploy-release-on-aws

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c28137fe9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +173 to +177
helm upgrade --install maestro charts/universal-app \
--set image.repository=$ECR_REGISTRY/summit/maestro \
--set image.tag=${{ github.sha }} \
--set service.targetPort=8001 \
--set ingress.enabled=true \
--set ingress.hosts[0].host=api.summit.internal \
--set ingress.hosts[0].paths[0].path=/maestro \
--set ingress.hosts[0].paths[0].pathType=Prefix \
--namespace default

# Deploy Prov Ledger
--namespace "$K8S_NAMESPACE" \
--set image.repository="$ECR_REGISTRY/summit/maestro" \
--set image.tag="$RELEASE_TAG" \
--set service.targetPort=8001
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve Maestro ingress overrides during Helm upgrade

This Helm invocation no longer sets ingress.enabled=true (or the hosts/paths overrides) for maestro; because charts/universal-app/values.yaml defaults ingress.enabled to false, running this workflow will render the ingress manifest absent and Helm will remove the existing ingress on upgrade. In environments that rely on the /maestro ingress route, this change causes an external availability outage immediately after deployment.

Useful? React with 👍 / 👎.

Comment on lines +199 to +203
- name: Upload governance evidence
uses: actions/upload-artifact@v4
with:
name: aws-deploy-evidence-${{ github.run_id }}
path: evidence-artifacts/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Upload deployment evidence even when rollout fails

The artifact upload runs only on success by default, so any failure in rollout verification or verify-deployment.sh prevents evidence-artifacts/ from being published. That removes the caller/cluster evidence needed for incident triage and compliance in exactly the failure scenarios where it is most valuable; adding if: always() (or moving upload earlier) keeps evidence available for failed runs.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
.github/workflows/deploy-aws.yml (2)

199-204: Always upload the evidence collected for failed deploys.

If rollout verification or the smoke check fails, this step is skipped and the caller/cluster snapshots are lost for the exact attempt you need to debug. Make the upload unconditional, and consider including github.run_attempt so reruns keep distinct evidence.

Suggested change
       - name: Upload governance evidence
+        if: ${{ always() }}
         uses: actions/upload-artifact@v4
         with:
-          name: aws-deploy-evidence-${{ github.run_id }}
+          name: aws-deploy-evidence-${{ github.run_id }}-${{ github.run_attempt }}
           path: evidence-artifacts/
           retention-days: 30
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy-aws.yml around lines 199 - 204, The "Upload
governance evidence" workflow step currently gets skipped on failed deploys;
make the artifact upload unconditional by removing any surrounding if: or
conditional job/step guards so the actions/upload-artifact@v4 step always runs,
and update the artifact name expression to include the run attempt (e.g.,
aws-deploy-evidence-${{ github.run_id }}-attempt-${{ github.run_attempt }}) so
reruns produce distinct evidence; keep the same path (evidence-artifacts/) and
retention-days: 30.

122-136: Make Terraform consume the same region the rest of the workflow uses.

The AWS CLI/actions read AWS_REGION, but terraform/environments/prod/main.tf is wired to var.aws_region. If prod is not already setting that variable via tfvars, this job is either non-deterministic or will fail under -input=false. Export TF_VAR_aws_region here so the provider uses the same region as the rest of the deploy.

Suggested change
       - name: Terraform apply
+        env:
+          TF_VAR_aws_region: ${{ env.AWS_REGION }}
         working-directory: terraform/environments/prod
         run: |
           set -euo pipefail
           terraform init -input=false
           terraform apply -auto-approve -input=false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy-aws.yml around lines 122 - 136, The Terraform step
is not receiving the workflow AWS region, causing indeterministic runs; set the
TF_VAR_aws_region environment variable from the workflow's AWS_REGION so
Terraform's var.aws_region matches the actions' region. Update the "Terraform
apply" step (or its environment for that job) to export or define
TF_VAR_aws_region="${{ env.AWS_REGION }}" before running terraform init/apply so
the provider uses the same region as the aws-actions/configure-aws-credentials
step.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/deploy-aws.yml:
- Around line 19-20: The concurrency group currently uses a dynamic key ("group:
deploy-aws-${{ github.ref_name || github.sha }}") which allows separate runs to
race; change the concurrency group to a fixed name (for example
"deploy-aws-production") so all production deployments serialize into the same
group, keeping or adjusting cancel-in-progress as desired; update the
concurrency block that defines group/cancel-in-progress to use the fixed string
instead of the expression.
- Around line 6-16: The workflow currently rebuilds and repushes unconditionally
when workflow_dispatch.image_tag is supplied; change the build-and-push logic so
manual runs that specify image_tag do not overwrite existing tags: modify the
build/push job or steps (the steps that perform the build and ECR push) to run
only when no image_tag was passed (e.g., add an if condition checking
github.event.inputs.image_tag is empty) or, if you must allow a push, first
validate the checked-out ref matches the requested tag (use git to resolve the
ref/tag and compare to github.event.inputs.image_tag) and fail if they differ;
update the conditional on the build/push steps or job to use github.event_name
== 'workflow_dispatch' && github.event.inputs.image_tag == '' (or the equivalent
validation check) so manual deploys with image_tag run in deploy-only mode or
safely validate before pushing.
- Around line 168-196: The rollout checks fail because the chart's templates
render deployments as "<release>-universal-app" while the later kubectl checks
expect "<release>" names; update each Helm invocation (the helm upgrade
--install calls for maestro, prov-ledger, and policy-lac) to set the release
name as the intended deployment basename by adding --set
fullnameOverride=<release-name> (i.e., --set fullnameOverride=maestro for the
maestro release, --set fullnameOverride=prov-ledger for prov-ledger, and --set
fullnameOverride=policy-lac for policy-lac) so the generated Deployment names
match the kubectl rollout status checks.
- Around line 88-97: The Dependency audit step currently runs safety without
using maestro's pinned deps; update the "Dependency audit" step so when
matrix.service == "maestro" you either install maestro's locked deps (e.g., run
python -m pip install --upgrade pip && pip install -r maestro/requirements.txt)
before invoking safety or call safety against the manifest (e.g., safety check
-r maestro/requirements.txt) so safety actually scans maestro's dependency
manifest instead of the environment's default packages.

---

Nitpick comments:
In @.github/workflows/deploy-aws.yml:
- Around line 199-204: The "Upload governance evidence" workflow step currently
gets skipped on failed deploys; make the artifact upload unconditional by
removing any surrounding if: or conditional job/step guards so the
actions/upload-artifact@v4 step always runs, and update the artifact name
expression to include the run attempt (e.g., aws-deploy-evidence-${{
github.run_id }}-attempt-${{ github.run_attempt }}) so reruns produce distinct
evidence; keep the same path (evidence-artifacts/) and retention-days: 30.
- Around line 122-136: The Terraform step is not receiving the workflow AWS
region, causing indeterministic runs; set the TF_VAR_aws_region environment
variable from the workflow's AWS_REGION so Terraform's var.aws_region matches
the actions' region. Update the "Terraform apply" step (or its environment for
that job) to export or define TF_VAR_aws_region="${{ env.AWS_REGION }}" before
running terraform init/apply so the provider uses the same region as the
aws-actions/configure-aws-credentials step.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b153727a-feb9-4764-aac2-edc9acfc75f6

📥 Commits

Reviewing files that changed from the base of the PR and between 12cad4a and 3c28137.

📒 Files selected for processing (1)
  • .github/workflows/deploy-aws.yml

@BrianCLong
Copy link
Copy Markdown
Owner Author

Queue note: branch-local workflow issues have been repaired and the AWS deploy workflow was tightened in aeed8adba33bd799f7ea7cd16dc8fa8248cbe12a, so current CI is exercising the intended branch state. Remaining DIRTY status is intentionally constrained to the compose base repair in #22220; once that merges, refresh this branch from main and rerun build, evidence, test, and tests.

@BrianCLong
Copy link
Copy Markdown
Owner Author

Closing in favor of #22241 for the active governed surface. The remaining non-main delta only targeted the archived deploy workflow path and was intentionally excluded from the convergence train.

@BrianCLong BrianCLong closed this Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

codex Codex-owned implementation work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant