Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions aws/ami/macos/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Build macOS AMIs

This folder uses Packer to bake reusable macOS AMIs for PyTorch GHA CI
runners, mirroring the layout under `../windows`.

The baked AMI contains everything host-shape-independent — Homebrew
packages (`gh`, `jq`, `tmux`, `libomp`, `pstree`, `miniconda` cask),
`conda init`, the `runner` user, the SSM agent, the CloudWatch agent
binary + plist, `/opt/runner_scripts/`, and `boto3`/`botocore` for the
runtime Ansible plays. Per-instance steps (IAM role attach, GH runner
registration, starting the CloudWatch daemon with the live config)
remain in the runtime playbooks under
`pytorch-gha-infra/macos-runners/playbooks`.

## Why per-arch AMIs are enough

AWS publishes one `arm64_mac` base AMI per macOS version, and one
`x86_64_mac` base AMI per macOS version. There is no per-chip variant
(no separate M1/M2/M2-Pro/M4 AMI). A custom AMI built from one of those
base images is portable across every Apple Silicon Mac instance family
(`mac2.metal`, `mac2-m2.metal`, `mac2-m2pro.metal`, `mac2-m4*.metal`).
Build matrix is therefore `(arch, macos_version)` — 2-4 AMIs total in
practice, not 10+.

## Why a Python driver instead of plain `packer build`

EC2 Mac instances require a Dedicated Host. Dedicated Mac hosts have:

- A **24-hour minimum billing window**. Releasing earlier still costs a
full day.
- A **~1-2 hour scrubbing window** after every instance terminates,
during which the host cannot accept a new launch.

Letting Packer allocate and release a host per build would cost one
host-day per AMI. The driver script (`build_macos_ami.py`) allocates a
single host, runs N packer builds sequentially against it (waiting out
the scrub window between builds), and leaves it allocated by default so
you don't pay for a fresh day on the next invocation.

## Setup

1. Configure AWS credentials (`AWS_PROFILE=fbossci` for the PyTorch CI
account).
2. Install Packer
([instructions](https://developer.hashicorp.com/packer/tutorials/docker-get-started/get-started-install-cli)).
3. Install Ansible locally (Packer's Ansible provisioner runs it from
the build host, not the target):
```bash
pip install ansible boto3
ansible-galaxy install -r ansible/requirements.yml
```
4. `cd` here and run `packer init .` (the driver also does this).

## Usage

### Host discovery

If `--host-id` is not passed, the driver looks for an existing Dedicated
Host tagged `Name=packer-macos-arm64-builder` in `--region`. The first
idle match (state `available`, no running instances) is reused;
otherwise the driver allocates a fresh host with that same tag. Pass
`--no-reuse` to force allocation, or `--host-id h-...` to pin to a
specific host.

This means the common case is one command, no manual host bookkeeping:

```bash
AWS_PROFILE=fbossci python build_macos_ami.py --region us-east-2 --macos-version 14
```

### Build all supported macOS versions on one host (cost-optimal)

```bash
AWS_PROFILE=fbossci python build_macos_ami.py \
--region us-east-2 \
--macos-version 14 \
--macos-version 15 \
--macos-version 26
```

Mac dedicated hosts have a 24h billing minimum, so amortizing multiple
builds across one host avoids paying for multiple host-days.

### Smoke-test the provisioners without creating an AMI

```bash
AWS_PROFILE=fbossci python build_macos_ami.py \
--region us-east-2 --macos-version 14 --skip-create-ami
```

### Release the host when fully done

```bash
aws ec2 release-hosts --host-ids h-0123456789abcdef0 --region us-east-1
```

Or pass `--release-after` to the driver (note: still billed for 24h).

## Multi-region publication

The template defaults to publishing the AMI to both `us-east-1` and
`us-east-2` (the regions PyTorch CI currently runs Mac runners in).
Packer registers in the build region first, then issues `CopyImage` to
the other regions in the list — each copy creates a fresh AMI ID and a
fresh EBS snapshot in that region.

To narrow or widen the set, pass `ami_regions` through:

```bash
python build_macos_ami.py \
--host-id h-... --region us-east-2 --macos-version 14 \
--packer-extra-arg='-var=ami_regions=["us-east-1","us-east-2","us-west-2"]'
```

CopyImage is roughly free at the API level but each destination region
incurs snapshot storage (~$0.05/GB-month) and a one-time inter-region
data-transfer charge for the snapshot.

## Consuming the AMI from Terraform

Mirror the Windows pattern in
`pytorch-gha-infra/runners/regions/us-east-1/main.tf`:

```hcl
ami_owners_macos_arm64 = ["<this-account-id>"]
ami_filter_macos_arm64 = {
name = ["pytorch-ci-macos-14-arm64-*"]
architecture = ["arm64_mac"]
}
```

Because the same AMI name lands in every region in `ami_regions` (with
different IDs), Terraform's per-region lookup naturally resolves to the
local copy without extra configuration. The AMI name embeds
`(macos_version, arch, timestamp)`, so filters can be as broad or
narrow as needed.

## Layout

```
macos/
├── README.md # this file
├── plugins.pkr.hcl # required packer plugins (amazon, ansible)
├── variables.pkr.hcl # input variables
├── macos.pkr.hcl # source + build blocks
├── build_macos_ami.py # host-lifecycle driver
├── ansible/
│ ├── bake.yml # tasks baked into the AMI
│ └── requirements.yml # ansible-galaxy deps
├── scripts/ # shipped to /opt/runner_scripts/ in AMI
│ ├── create-runner-user.sh
│ └── install-ssm-agent.sh
└── configs/
└── cloudwatch_config.json # staged for the runtime playbook
```
128 changes: 128 additions & 0 deletions aws/ami/macos/ansible/bake.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
# Tasks here are baked into the AMI. Anything that depends on per-instance
# state (IAM role attach, GH runner registration, instance-id-derived names)
# must stay in the runtime playbooks under pytorch-gha-infra/macos-runners.

- name: Bake macOS GHA runner AMI
hosts: all
module_defaults:
shell:
executable: /bin/zsh
become: true
become_user: ec2-user
tasks:
- name: Ensure boto libraries are installed
ansible.builtin.pip:
name:
- boto3
- botocore
executable: pip3

- name: Transfer runner scripts
become: true
become_user: root
ansible.builtin.copy:
src: ../scripts/
dest: /opt/runner_scripts/
directory_mode: true
mode: '0755'

- name: Create post-job log directory
become: true
become_user: root
ansible.builtin.file:
path: /var/log/post_job
state: directory
mode: '0777'

- name: Create runner user
environment:
RUNNER_USER: runner
register: create_runner_user
changed_when: create_runner_user.stdout != 'RUNNER_USER EXISTS'
ansible.builtin.shell: |
sudo /opt/runner_scripts/create-runner-user.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: sudo here is a bit out of place when the rest of the playbook uses become_user: root

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about this new runner user (1001). The step makes sense to me as the it's the same user GitHub runners use. Homebrew and conda installation steps later are for ec2-user. Is that expected?

I also saw RUNNER_USER set to either runner on https://github.com/meta-pytorch/pytorch-gha-infra/blob/main/macos-runners/playbooks/bootstrap-runner.yml#L44 or ec2-user on https://github.com/meta-pytorch/pytorch-gha-infra/blob/main/macos-runners/playbooks/install-runner.yml#L37, and got myself confused on which user was used.

echo -n "RUNNER_USER EXISTS"

- name: Install SSM Agent
become: true
become_user: root
register: install_ssm
changed_when: install_ssm.stdout != 'SSM INSTALLED'
ansible.builtin.shell: |
bash /opt/runner_scripts/install-ssm-agent.sh

# Avoid community.general.homebrew_cask / .homebrew: Homebrew 5.x emits
# "already installed" / progress lines on stderr, which the modules treat
# as hard failures. Same workaround as macos-runners/playbooks/install-runner.yml.
- name: Install Homebrew cask dependencies
register: brew_cask_install
changed_when: "'Installing' in brew_cask_install.stdout"
ansible.builtin.shell: |
set -eu
export PATH="/opt/homebrew/bin:/opt/homebrew/sbin:${PATH:-}"
for cask in miniconda; do
if /opt/homebrew/bin/brew list --cask "$cask" >/dev/null 2>&1; then
echo "Present: $cask"
else
echo "Installing: $cask"
HOMEBREW_NO_AUTO_UPDATE=1 NONINTERACTIVE=1 \
/opt/homebrew/bin/brew install --cask "$cask"
fi
done

- name: Install Homebrew dependencies
register: brew_install
changed_when: "'Installing' in brew_install.stdout"
ansible.builtin.shell: |
set -eu
export PATH="/opt/homebrew/bin:/opt/homebrew/sbin:${PATH:-}"
for pkg in gh jq tmux libomp pstree; do
if /opt/homebrew/bin/brew list --formula "$pkg" >/dev/null 2>&1; then
echo "Present: $pkg"
else
echo "Installing: $pkg"
HOMEBREW_NO_AUTO_UPDATE=1 NONINTERACTIVE=1 \
/opt/homebrew/bin/brew install "$pkg"
fi
done

- name: Initialize conda in shell rc files
register: conda_init
changed_when: '"modified" in conda_init.stdout'
ansible.builtin.shell: |
/opt/homebrew/bin/conda init --all || /usr/local/bin/conda init --all

# CloudWatch agent: download from the public bucket, install into /opt/aws,
# and place the LaunchDaemon plist. The first-boot playbook is still
# responsible for `fetch-config -s` (which starts the daemon with the
# current config file).
- name: Determine CloudWatch agent download URL
ansible.builtin.set_fact:
cw_agent_url: >-
{{ 'https://amazoncloudwatch-agent.s3.amazonaws.com/darwin/arm64/latest/amazon-cloudwatch-agent.pkg'
if ansible_architecture == 'arm64'
else 'https://amazoncloudwatch-agent.s3.amazonaws.com/darwin/amd64/latest/amazon-cloudwatch-agent.pkg' }}

- name: Download CloudWatch agent installer
become: true
become_user: root
ansible.builtin.get_url:
url: "{{ cw_agent_url }}"
dest: /tmp/amazon-cloudwatch-agent.pkg
mode: '0644'

- name: Install CloudWatch agent
become: true
become_user: root
ansible.builtin.command:
cmd: installer -pkg /tmp/amazon-cloudwatch-agent.pkg -target /
creates: /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl

- name: Stage CloudWatch config (consumed at first boot)
become: true
become_user: root
ansible.builtin.copy:
src: ../configs/cloudwatch_config.json
dest: /opt/runner_scripts/cloudwatch_config.json
mode: '0644'
4 changes: 4 additions & 0 deletions aws/ami/macos/ansible/requirements.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
collections:
- name: community.general
- name: amazon.aws
Loading
Loading