Skip to content

fix(ci): harden Android NativeAOT instrumentation test scripts against transient failures#1247

Open
agneszitte wants to merge 1 commit intomasterfrom
dev/agzi/fix-android-naot-ci-flakiness
Open

fix(ci): harden Android NativeAOT instrumentation test scripts against transient failures#1247
agneszitte wants to merge 1 commit intomasterfrom
dev/agzi/fix-android-naot-ci-flakiness

Conversation

@agneszitte
Copy link
Copy Markdown
Contributor

@agneszitte agneszitte commented Apr 2, 2026

Summary

Fixes transient CI flakiness in the Android+Skia+NativeAOT Instrumentation Test pipeline introduced by #1245.

Three flakiness modes were observed:

  1. sdkmanager --install dying at "33% Unzipping" — exits with code 1 on transient network/IO issues, and set -euxo pipefail aborts the entire pipeline immediately with no retry. First observed in CI build 204968.
  2. Emulator ANR crashing the boot-wait scriptadb shell input keyevent returns exit 255 while dismissing an "Application Not Responding: system" dialog during emulator boot, and set -e kills the script. First observed in canary build 205041 (Attempt 1).
  3. Job never reaching adb shell am instrumentadb install can fail transiently right after emulator boot, again aborting due to set -e.

Changes

Fix 1 — sdkmanager_install() retry wrapper (android-sdk-emu.inc.sh)

  • Wraps all sdkmanager --install calls in a function that retries up to 3 times with increasing back-off (1 s, 2 s).
  • Addresses the "33% Unzipping" flakiness observed in CI build 204968.

Fix 2 — Stop downloading unused API-36 system image (android-sdk-emu.inc.sh)

  • install_android_sdk 36 was downloading the full system-images;android-36;google_apis_playstore;x86_64 (~1.5 GB), but the emulator AVD is always created with API-34.
  • Now only installs platforms;android-36 (needed for build-tools/apkanalyzer), saving ~1.5 GB of download and reducing the window for transient failures.

Fix 3 — adb install retry (android-test-run.sh)

  • Adds a 3-attempt retry loop with back-off around adb install -r, since the emulator may be transiently unresponsive right after boot.

Fix 4 — Guard adb keyevent calls against transient exit 255 (android-uitest-wait-systemui.sh)

  • Appends || true to all non-critical adb shell input keyevent and adb shell calls in the emulator boot-wait loop and post-boot setup.
  • Under set -e, a transient exit 255 from adb shell input keyevent KEYCODE_ENTER (while dismissing an ANR dialog) was killing the entire script. The loop can now retry on the next iteration instead of aborting.
  • Directly addresses the emulator ANR failure in canary build 205041 (Attempt 1).

Related

Note: No related issue (CI maintenance).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces Android NativeAOT instrumentation test flakiness in CI by adding retry logic around known transient failure points and by avoiding downloading an unused Android system image.

Changes:

  • Add a sdkmanager --install retry wrapper with back-off to handle transient install/unzip failures.
  • Stop downloading the unused API-36 emulator system image; install only the API-36 platform package.
  • Add retry logic for adb install -r during emulator startup.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
build/scripts/android-sdk-emu.inc.sh Adds sdkmanager_install() retry helper and skips downloading API-36 system image while still installing API-36 platform.
build/scripts/android-test-run.sh Adds retry loop around adb install -r to tolerate transient emulator unresponsiveness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@agneszitte agneszitte force-pushed the dev/agzi/fix-android-naot-ci-flakiness branch from 41e2828 to c659a93 Compare April 2, 2026 00:27
@agneszitte agneszitte changed the title fix(ci): add retry logic to Android NativeAOT instrumentation test scripts fix(ci): harden Android NativeAOT instrumentation test scripts against transient failures Apr 2, 2026
@agneszitte agneszitte requested a review from Copilot April 2, 2026 00:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…t transient failures

- Add sdkmanager_install() retry wrapper (3 attempts with back-off)
  for all sdkmanager --install calls (android-sdk-emu.inc.sh)
- Stop downloading unused API-36 system image (~1.5 GB saving);
  only install platforms;android-36 needed for build-tools
- Add adb install retry loop (3 attempts) in android-test-run.sh
- Guard adb shell keyevent/settings calls with || true in
  android-uitest-wait-systemui.sh to prevent set -e abort on
  transient exit 255 during emulator ANR dismissal
@agneszitte agneszitte force-pushed the dev/agzi/fix-android-naot-ci-flakiness branch from c659a93 to 3c1d368 Compare April 2, 2026 00:54
@agneszitte agneszitte requested a review from Copilot April 2, 2026 01:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@agneszitte agneszitte requested a review from jonpryor April 2, 2026 01:11
@agneszitte agneszitte marked this pull request as ready for review April 2, 2026 01:32
fi
if (( attempt < max_attempts )); then
echo "sdkmanager --install $* failed (attempt $attempt/$max_attempts), retrying in ${attempt}s..."
sleep "$attempt"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe 10x that value, 1~3s might not be enough to escape the same fate as previous attempt(s)
in grand scheme of things, 10~30s(1min total) is really nothing IF we can move forward

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we don't understand why it's failing in the first place. It just up and dies, with nothing in the log files. This makes it slightly annoying to reason about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants