Add Packer template for macOS GHA runner AMIs#8091
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
e0cd8ac to
e9c1eb0
Compare
Mirrors the layout of aws/ami/windows. The template uses an Ansible provisioner that runs a bakeable subset of pytorch-gha-infra's bootstrap-runner.yml: Homebrew packages (gh, jq, tmux, libomp, pstree, miniconda), conda init, the runner user, the SSM agent, the CloudWatch agent binary + plist, /opt/runner_scripts/, and boto3/botocore. Per-instance steps (IAM role attach, GH runner registration, starting the CloudWatch daemon) stay in the runtime playbooks. build_macos_ami.py wraps `packer init` + `packer build`, resolves the host's AZ via DescribeHosts so launches always match the dedicated host, and supports building multiple macOS versions back-to-back on one host (sequential, waiting out the ~1-2h scrub window between builds) so the 24h Mac host billing minimum is amortized. A single arm64 base AMI is portable across every Mac2 instance family (mac2 / mac2-m2 / mac2-m2pro / mac2-m4*), so the template hardcodes arm64; an x86_64 fork can be added later if needed.
e9c1eb0 to
a7a277d
Compare
community.general.homebrew[_cask] treats brew's stderr progress and "already installed" warnings on Homebrew 5.x as hard failures. Switch to shell loops that mirror the workaround already documented in pytorch-gha-infra/macos-runners/playbooks/install-runner.yml.
|
Are u going to add a workflow to run this later like https://github.com/pytorch/test-infra/blob/main/.github/workflows/build-windows-ami.yml? |
| ) | ||
| p.add_argument( | ||
| "--availability-zone", | ||
| default="us-east-1a", |
There was a problem hiding this comment.
I'm not sure how easy it is to manually find a AZ with available runner. Maybe it's easier to just set the region, and sequentially try all AZs in the region until an available one is secured
| register: create_runner_user | ||
| changed_when: create_runner_user.stdout != 'RUNNER_USER EXISTS' | ||
| ansible.builtin.shell: | | ||
| sudo /opt/runner_scripts/create-runner-user.sh |
There was a problem hiding this comment.
Nit: sudo here is a bit out of place when the rest of the playbook uses become_user: root
| register: create_runner_user | ||
| changed_when: create_runner_user.stdout != 'RUNNER_USER EXISTS' | ||
| ansible.builtin.shell: | | ||
| sudo /opt/runner_scripts/create-runner-user.sh |
There was a problem hiding this comment.
I have a question about this new runner user (1001). The step makes sense to me as the it's the same user GitHub runners use. Homebrew and conda installation steps later are for ec2-user. Is that expected?
I also saw RUNNER_USER set to either runner on https://github.com/meta-pytorch/pytorch-gha-infra/blob/main/macos-runners/playbooks/bootstrap-runner.yml#L44 or ec2-user on https://github.com/meta-pytorch/pytorch-gha-infra/blob/main/macos-runners/playbooks/install-runner.yml#L37, and got myself confused on which user was used.
huydhn
left a comment
There was a problem hiding this comment.
Have a couple of questions, but the approach looks good to me!
There was a problem hiding this comment.
I believe we need a better way to handle available hosts.
If we want to run this on CI, we need a RELIABLE way to use one hardware to build it, this probably would mean searching for any host that don't have a instance allocated first, then trying to obtain a host whose instance is not connected to github / available as a runner, try a instance whose runner is idle, otherwise smartly draining an instance by refusing it to pick newer jobs once they finish the current one.
If building images is a sole manual op, this can be OK, but I still would strongly advise to automate, as this can be a bothersome/lenghty/multi-command process to properly drain an instance if we need to.
We already have code for all this logic, so it should be a matter of refactoring / reengineering. But the solution already exists.
Summary
Mirrors
aws/ami/windows/layout for macOS. Builds reusable arm64 macOS AMIs for the PyTorch GHA CI runner fleet, baking the host-shape-independent steps that today run at first boot viapytorch-gha-infra/macos-runners/playbooks/bootstrap-runner.yml.The template uses an Ansible provisioner that runs a bakeable subset of that playbook: Homebrew packages (
gh,jq,tmux,libomp,pstree,minicondacask),conda init, therunneruser, the SSM agent, the CloudWatch agent binary + plist,/opt/runner_scripts/, andboto3/botocore. Per-instance steps — IAM role attach, GH runner registration, starting the CloudWatch daemon with the live config — stay in the runtime playbooks.A single arm64 base AMI is portable across every Mac2 instance family (
mac2,mac2-m2,mac2-m2pro,mac2-m4*) because AWS publishes onearm64_macbase AMI per macOS version, not per chip. The template hardcodes arm64; an x86_64 fork can be added later if needed.build_macos_ami.pywrapspacker init+packer build, resolves the dedicated host's AZ viaDescribeHostsso the launch always matches the host, and supports building multiple macOS versions back-to-back on one host. This amortizes the 24h Mac host billing minimum.Validation
Manually verified end-to-end against a fresh
mac2.metalhost inus-east-2c:metadata_options, as required by org SCP).aws ec2 create-image.Test plan
packer init .andpacker validate -var host_id=h-... -var availability_zone=us-east-2c -var macos_version=14 .against the macos dirpytorch-gha-infra: trim the now-bakeable steps frombootstrap-runner.ymlonce an AMI is consumed by Terraform'sami_filter_macos_*.