Skip to content

Add Packer template for macOS GHA runner AMIs#8091

Open
malfet wants to merge 4 commits into
mainfrom
malfet/add-macos-packer
Open

Add Packer template for macOS GHA runner AMIs#8091
malfet wants to merge 4 commits into
mainfrom
malfet/add-macos-packer

Conversation

@malfet
Copy link
Copy Markdown
Contributor

@malfet malfet commented May 16, 2026

Summary

Mirrors aws/ami/windows/ layout for macOS. Builds reusable arm64 macOS AMIs for the PyTorch GHA CI runner fleet, baking the host-shape-independent steps that today run at first boot via pytorch-gha-infra/macos-runners/playbooks/bootstrap-runner.yml.

The template uses an Ansible provisioner that runs a bakeable subset of that playbook: Homebrew packages (gh, jq, tmux, libomp, pstree, miniconda cask), conda init, the runner user, the SSM agent, the CloudWatch agent binary + plist, /opt/runner_scripts/, and boto3/botocore. Per-instance steps — IAM role attach, GH runner registration, starting the CloudWatch daemon with the live config — stay in the runtime playbooks.

A single arm64 base AMI is portable across every Mac2 instance family (mac2, mac2-m2, mac2-m2pro, mac2-m4*) because AWS publishes one arm64_mac base AMI per macOS version, not per chip. The template hardcodes arm64; an x86_64 fork can be added later if needed.

build_macos_ami.py wraps packer init + packer build, resolves the dedicated host's AZ via DescribeHosts so the launch always matches the host, and supports building multiple macOS versions back-to-back on one host. This amortizes the 24h Mac host billing minimum.

Validation

Manually verified end-to-end against a fresh mac2.metal host in us-east-2c:

  • Packer launches successfully (with IMDSv2 enforced via metadata_options, as required by org SCP).
  • Ansible playbook runs cleanly on the live instance: 13 tasks, ok=13, changed=9, failed=0.
  • AMI registration succeeds via aws ec2 create-image.

Test plan

  • Reviewer runs packer init . and packer validate -var host_id=h-... -var availability_zone=us-east-2c -var macos_version=14 . against the macos dir
  • CI: none yet — first cut. Follow-up PR can add a manual-dispatch workflow analogous to the Windows AMI builder.
  • Follow-up in pytorch-gha-infra: trim the now-bakeable steps from bootstrap-runner.yml once an AMI is consumed by Terraform's ami_filter_macos_*.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
torchci Ignored Ignored Preview May 22, 2026 12:05am

Request Review

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2026
@malfet malfet force-pushed the malfet/add-macos-packer branch 2 times, most recently from e0cd8ac to e9c1eb0 Compare May 17, 2026 00:29
Mirrors the layout of aws/ami/windows. The template uses an Ansible
provisioner that runs a bakeable subset of pytorch-gha-infra's
bootstrap-runner.yml: Homebrew packages (gh, jq, tmux, libomp, pstree,
miniconda), conda init, the runner user, the SSM agent, the CloudWatch
agent binary + plist, /opt/runner_scripts/, and boto3/botocore.

Per-instance steps (IAM role attach, GH runner registration, starting
the CloudWatch daemon) stay in the runtime playbooks.

build_macos_ami.py wraps `packer init` + `packer build`, resolves the
host's AZ via DescribeHosts so launches always match the dedicated
host, and supports building multiple macOS versions back-to-back on
one host (sequential, waiting out the ~1-2h scrub window between
builds) so the 24h Mac host billing minimum is amortized.

A single arm64 base AMI is portable across every Mac2 instance family
(mac2 / mac2-m2 / mac2-m2pro / mac2-m4*), so the template hardcodes
arm64; an x86_64 fork can be added later if needed.
@malfet malfet force-pushed the malfet/add-macos-packer branch from e9c1eb0 to a7a277d Compare May 17, 2026 13:24
malfet added 3 commits May 21, 2026 13:24
community.general.homebrew[_cask] treats brew's stderr progress and "already
installed" warnings on Homebrew 5.x as hard failures. Switch to shell loops
that mirror the workaround already documented in
pytorch-gha-infra/macos-runners/playbooks/install-runner.yml.
@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented May 22, 2026

Are u going to add a workflow to run this later like https://github.com/pytorch/test-infra/blob/main/.github/workflows/build-windows-ami.yml?

)
p.add_argument(
"--availability-zone",
default="us-east-1a",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how easy it is to manually find a AZ with available runner. Maybe it's easier to just set the region, and sequentially try all AZs in the region until an available one is secured

register: create_runner_user
changed_when: create_runner_user.stdout != 'RUNNER_USER EXISTS'
ansible.builtin.shell: |
sudo /opt/runner_scripts/create-runner-user.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: sudo here is a bit out of place when the rest of the playbook uses become_user: root

register: create_runner_user
changed_when: create_runner_user.stdout != 'RUNNER_USER EXISTS'
ansible.builtin.shell: |
sudo /opt/runner_scripts/create-runner-user.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about this new runner user (1001). The step makes sense to me as the it's the same user GitHub runners use. Homebrew and conda installation steps later are for ec2-user. Is that expected?

I also saw RUNNER_USER set to either runner on https://github.com/meta-pytorch/pytorch-gha-infra/blob/main/macos-runners/playbooks/bootstrap-runner.yml#L44 or ec2-user on https://github.com/meta-pytorch/pytorch-gha-infra/blob/main/macos-runners/playbooks/install-runner.yml#L37, and got myself confused on which user was used.

Copy link
Copy Markdown
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a couple of questions, but the approach looks good to me!

Copy link
Copy Markdown
Contributor

@jeanschmidt jeanschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need a better way to handle available hosts.

If we want to run this on CI, we need a RELIABLE way to use one hardware to build it, this probably would mean searching for any host that don't have a instance allocated first, then trying to obtain a host whose instance is not connected to github / available as a runner, try a instance whose runner is idle, otherwise smartly draining an instance by refusing it to pick newer jobs once they finish the current one.

If building images is a sole manual op, this can be OK, but I still would strongly advise to automate, as this can be a bothersome/lenghty/multi-command process to properly drain an instance if we need to.

We already have code for all this logic, so it should be a matter of refactoring / reengineering. But the solution already exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants