Add Packer template for macOS GHA runner AMIs by malfet · Pull Request #8091 · pytorch/test-infra

malfet · 2026-05-16T02:21:28Z

Summary

Mirrors aws/ami/windows/ layout for macOS. Builds reusable arm64 macOS AMIs for the PyTorch GHA CI runner fleet, baking the host-shape-independent steps that today run at first boot via pytorch-gha-infra/macos-runners/playbooks/bootstrap-runner.yml.

The template uses an Ansible provisioner that runs a bakeable subset of that playbook: Homebrew packages (gh, jq, tmux, libomp, pstree, miniconda cask), conda init, the runner user, the SSM agent, the CloudWatch agent binary + plist, /opt/runner_scripts/, and boto3/botocore. Per-instance steps — IAM role attach, GH runner registration, starting the CloudWatch daemon with the live config — stay in the runtime playbooks.

A single arm64 base AMI is portable across every Mac2 instance family (mac2, mac2-m2, mac2-m2pro, mac2-m4*) because AWS publishes one arm64_mac base AMI per macOS version, not per chip. The template hardcodes arm64; an x86_64 fork can be added later if needed.

build_macos_ami.py wraps packer init + packer build, resolves the dedicated host's AZ via DescribeHosts so the launch always matches the host, and supports building multiple macOS versions back-to-back on one host. This amortizes the 24h Mac host billing minimum.

Validation

Manually verified end-to-end against a fresh mac2.metal host in us-east-2c:

Packer launches successfully (with IMDSv2 enforced via metadata_options, as required by org SCP).
Ansible playbook runs cleanly on the live instance: 13 tasks, ok=13, changed=9, failed=0.
AMI registration succeeds via aws ec2 create-image.

Test plan

Reviewer runs packer init . and packer validate -var host_id=h-... -var availability_zone=us-east-2c -var macos_version=14 . against the macos dir
CI: none yet — first cut. Follow-up PR can add a manual-dispatch workflow analogous to the Windows AMI builder.
Follow-up in pytorch-gha-infra: trim the now-bakeable steps from bootstrap-runner.yml once an AMI is consumed by Terraform's ami_filter_macos_*.

vercel · 2026-05-16T02:21:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
torchci	Ignored	Preview	May 22, 2026 12:05am

Mirrors the layout of aws/ami/windows. The template uses an Ansible provisioner that runs a bakeable subset of pytorch-gha-infra's bootstrap-runner.yml: Homebrew packages (gh, jq, tmux, libomp, pstree, miniconda), conda init, the runner user, the SSM agent, the CloudWatch agent binary + plist, /opt/runner_scripts/, and boto3/botocore. Per-instance steps (IAM role attach, GH runner registration, starting the CloudWatch daemon) stay in the runtime playbooks. build_macos_ami.py wraps `packer init` + `packer build`, resolves the host's AZ via DescribeHosts so launches always match the dedicated host, and supports building multiple macOS versions back-to-back on one host (sequential, waiting out the ~1-2h scrub window between builds) so the 24h Mac host billing minimum is amortized. A single arm64 base AMI is portable across every Mac2 instance family (mac2 / mac2-m2 / mac2-m2pro / mac2-m4*), so the template hardcodes arm64; an x86_64 fork can be added later if needed.

community.general.homebrew[_cask] treats brew's stderr progress and "already installed" warnings on Homebrew 5.x as hard failures. Switch to shell loops that mirror the workaround already documented in pytorch-gha-infra/macos-runners/playbooks/install-runner.yml.

huydhn · 2026-05-22T07:03:29Z

Are u going to add a workflow to run this later like https://github.com/pytorch/test-infra/blob/main/.github/workflows/build-windows-ami.yml?

huydhn · 2026-05-22T07:12:55Z

+    )
+    p.add_argument(
+        "--availability-zone",
+        default="us-east-1a",


I'm not sure how easy it is to manually find a AZ with available runner. Maybe it's easier to just set the region, and sequentially try all AZs in the region until an available one is secured

huydhn · 2026-05-22T07:35:20Z

+      register: create_runner_user
+      changed_when: create_runner_user.stdout != 'RUNNER_USER EXISTS'
+      ansible.builtin.shell: |
+        sudo /opt/runner_scripts/create-runner-user.sh


Nit: sudo here is a bit out of place when the rest of the playbook uses become_user: root

huydhn · 2026-05-22T07:48:02Z

+      register: create_runner_user
+      changed_when: create_runner_user.stdout != 'RUNNER_USER EXISTS'
+      ansible.builtin.shell: |
+        sudo /opt/runner_scripts/create-runner-user.sh


I have a question about this new runner user (1001). The step makes sense to me as the it's the same user GitHub runners use. Homebrew and conda installation steps later are for ec2-user. Is that expected?

I also saw RUNNER_USER set to either runner on https://github.com/meta-pytorch/pytorch-gha-infra/blob/main/macos-runners/playbooks/bootstrap-runner.yml#L44 or ec2-user on https://github.com/meta-pytorch/pytorch-gha-infra/blob/main/macos-runners/playbooks/install-runner.yml#L37, and got myself confused on which user was used.

huydhn

Have a couple of questions, but the approach looks good to me!

jeanschmidt

I believe we need a better way to handle available hosts.

If we want to run this on CI, we need a RELIABLE way to use one hardware to build it, this probably would mean searching for any host that don't have a instance allocated first, then trying to obtain a host whose instance is not connected to github / available as a runner, try a instance whose runner is idle, otherwise smartly draining an instance by refusing it to pick newer jobs once they finish the current one.

If building images is a sole manual op, this can be OK, but I still would strongly advise to automate, as this can be a bothersome/lenghty/multi-command process to properly drain an instance if we need to.

We already have code for all this logic, so it should be a matter of refactoring / reengineering. But the solution already exists.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2026

malfet force-pushed the malfet/add-macos-packer branch 2 times, most recently from e0cd8ac to e9c1eb0 Compare May 17, 2026 00:29

malfet force-pushed the malfet/add-macos-packer branch from e9c1eb0 to a7a277d Compare May 17, 2026 13:24

malfet added 3 commits May 21, 2026 13:24

[macOS AMI] Fix lint: PYFMT + E501

51b1806

Fix PYFMT lint on build_macos_ami.py

f1990ab

huydhn reviewed May 22, 2026

View reviewed changes

huydhn approved these changes May 22, 2026

View reviewed changes

jeanschmidt requested changes May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Packer template for macOS GHA runner AMIs#8091

Add Packer template for macOS GHA runner AMIs#8091
malfet wants to merge 4 commits into
mainfrom
malfet/add-macos-packer

malfet commented May 16, 2026

Uh oh!

vercel Bot commented May 16, 2026 •

edited

Loading

Uh oh!

huydhn commented May 22, 2026

Uh oh!

huydhn May 22, 2026

Uh oh!

huydhn May 22, 2026

Uh oh!

huydhn May 22, 2026

Uh oh!

huydhn left a comment

Uh oh!

jeanschmidt left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

malfet commented May 16, 2026

Summary

Validation

Test plan

Uh oh!

vercel Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn commented May 22, 2026

Uh oh!

huydhn May 22, 2026

Choose a reason for hiding this comment

Uh oh!

huydhn May 22, 2026

Choose a reason for hiding this comment

Uh oh!

huydhn May 22, 2026

Choose a reason for hiding this comment

Uh oh!

huydhn left a comment

Choose a reason for hiding this comment

Uh oh!

jeanschmidt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel Bot commented May 16, 2026 •

edited

Loading

jeanschmidt left a comment •

edited

Loading