Validate pull requests in TaskCluster by jugglinmike · Pull Request #12657 · web-platform-tests/wpt

jugglinmike · 2018-08-23T18:08:30Z

Extend the configuration for the TaskCluster service to generate tasks in
response to GitHub pull requests. Each pull request should create one cluster
containing four tasks, all of which operate on the affected tests as identified
by wpt tests-affected:

run the tests in Chrome and publish the results as a gzipped artifact
run the tests in Firefox and publish the results as a gzipped artifact
verify the stability of the tests in Chrome
verify the stability of the tests in Firefox

The former two tasks are intended for future use in comparing test results with
data available on https://wpt.fyi. This will be informative only. The latter
two tasks are intended to verify that the affected tests are stable. This will
influence whether or not the pull request may be merged.

A demonstration of expected functionality is available on Bocoup's fork of WPT:

master branch (this functionality was implemented previously, and this pull
request is not intended to influence the behavior we see today)
- master branch of WPT fork
- Task cluster
Pull request validation
- pull request
- Task cluster

It seems wise to vet this in WPT for a bit before allowing the results to
control whether pull requests may be merged. The configuration proposed here
will permit unstable pull requests, but it stores the status of the check in a
dedicated artifact. After some time, we can use this to compare the behavior
between the current TravisCI-powered checks and this new one. If they align, we
can revert the second commit.

Due to some changes to indentation, using -w or the ?w=1 query string
parameter may make this change set easier to review.

[fixes #10503]

jugglinmike · 2018-08-24T00:59:53Z

I had to iterate on this a bit after changing the upstream repository, but this patch is functioning as expected. I've rebased and pushed up a couple of commits to demonstrate how the CI handles patches with no affected tests and how it handles patches with one or more affected tests (and unstable ones, at that). Those final two commits should not be merged to master.

jgraham

Very very excited to see this land. But I think the implementation got a little hard to maintain by trying to force everything to go via a shell script.

jgraham · 2018-08-24T10:12:03Z

+                owner: ${event.pusher.email}
+                source: ${event.repository.url}
+              payload:
+                image: jugglinmike/web-platform-tests:0.18


We really need to set up a shared dockerhub account :)

jgraham · 2018-08-24T10:13:07Z

+            - name: wpt-${browser.name}-${browser.channel}-stability
+              description: >-
+                Verify that all tests affected by a pull request are stable
+                when executed in ${browser.name}. As of 2018-08-23, this task


Remove the extra text from the description; I don't think it's that helpful (we're more likely to forget to remove it later).

I'd like there to be some contributor-facing indication of this behavior.
I appreciate the concern about bitrot, though, so I've moved it to a statement
logged to standard out immediately following the command invocation. Does that
seem safe enough to you?

jgraham · 2018-08-24T10:19:09Z

+              # Bash removes null bytes from string values when set as
+              # environment variables. This invalidates the output of `wpt
+              # affected-tests` because it uses the null byte as a separator
+              # between test names. The list of effected tests is


That should be affected. But this complexity seems like a clue that a better solution is required.

jgraham · 2018-08-24T10:20:04Z

+# to be the name of the browser under test. This restricts the syntax available
+# to consumers: value-accepting options must be specified using the equals sign
+# (`=`).
+for argument in $@; do


I feel like this exceeded my complexity threshold for a bash script. If we want a single entry point for all the things, let's make it a python script instead.

gsnedders · 2018-08-24T15:42:08Z

From #10503:

In TravisCI today, we script unnecessary jobs to exit early. This avoids unnecessary work, but it also makes task reports a little noisy (since they include meaningless results for those no-op jobs).

A nice-to-have improvement would be to use TaskCluster's "decision tasks" feature to avoid scheduling unnecessary tasks.

At the same time, I'm unconvinced we should block doing this on having that working, because it's not a regression currently.

jugglinmike · 2018-08-29T02:25:00Z

@jgraham I've re-implemented ci_taskcluster.sh to taskcluster-run.py (the
ci_ prefix seemed superfluous given its location in a directory named ci/).
You were right to recommend this; the logic that wired the commands together
was definitely too complex to be readable in Bash.

This introduces more indirection, and I'm on the fence about whether it's
justified. The alternative is to thin out the wrapper script, and while that
makes the .taskcluster.yml file more literal, it also requires a fair amount
of repetition. I've implemented the more concise version with a separate commit
so you can more clearly see what I'm talking about. Do you have an opinion
about this?

jgraham · 2018-08-29T12:06:47Z

+import subprocess
+
+browser_specific_args = {
+    "firefox": ["--install-browser", "--reftest-internal"]


--reftest-internal is the default, I think.

I think you're right

tools/wptrunner/wptrunner/wptcommandline.py: if kwargs["reftest_internal"] is None: tools/wptrunner/wptrunner/wptcommandline.py- # Default to the internal reftest implementation on Linux and OSX tools/wptrunner/wptrunner/wptcommandline.py: kwargs["reftest_internal"] = sys.platform.startswith("linux") or sys.platform.startswith("darwin")

I wanted to verify with the person who added this flag, and it turns out that was you:

Explicitly set Firefox to use the fast reftest runner.

This should happen by default on Linux, but it doesn't hurt to be explicit.

Have you had a change of heart? Or should we keep the flag for the sake of explicitness?

Let's remove the flag and aim for a situation where we can use the same flags for all browsers.

You got it.

jgraham · 2018-08-29T12:10:04Z

@@ -0,0 +1,101 @@
+#!/usr/bin/env python


So, I think I'm OK with landing this as-is, but I wonder what the effect would be of moving the logic in this file into wpt run directly? If one added --commit-range as an argument to that function then the only things that wouldn't directly fit would be getting the arguments right per-browser and gzipping artifacts. Those could perhaps be moved out into the task definitions.

jgraham · 2018-08-29T12:11:48Z

+}
+
+def tests_affected(commit_range):
+    output = subprocess.check_output([


You could, of course, just import and use the function directly rather than going via a process.

@jgraham wouldn't that require hacking sys.path? If so, I'm in favour of forking.

Yes. I think it makes more long term sense to move all of this into wpt run and not have another wrapper script at all. So I don't object to defering that change here.

Hexcles

The implementation LGTM. I didn't look into the Travis & TC failures though. Could you take a look?

Hexcles · 2018-08-29T15:37:39Z

+}
+
+def tests_affected(commit_range):
+    output = subprocess.check_output([


@jgraham wouldn't that require hacking sys.path? If so, I'm in favour of forking.

jugglinmike · 2018-08-29T18:56:38Z

@jgraham Instead of installing Firefox via wpt run, we could install it via wpt install firefox browser from within start.sh. Admittedly, that script is a little more difficult to maintain since it's built into the Docker image. However, refactoring like this will make the script's behavior more consistent, and (depending on how you feel about the technically-unnecessary argument), it could allow us to remove all browser-specific run arguments.

jugglinmike · 2018-08-29T19:06:31Z

Yes. I think it makes more long term sense to move all of this into wpt run and not have another wrapper script at all. So I don't object to defering that change here.

This aspect of the wrapper is motivated by my naive attempt to print the affected tests to standard out. Without that, it would be a simple matter of piping wpt tests-affected into wpt run. Then again, after reviewing the xargs docs, it looks like the --verbose flag would get us the functionality without the need for all the encoding nonsense:

$ ./wpt tests-affected --null master... | xargs --verbose --null ./wpt run firefox
INFO:manifest:Updating manifest
./wpt run firefox dom/events/CustomEvent.html 
Using certutil /usr/bin/certutil
(etc.)

That serves the same purpose as the logging statement in this patch:

logger.info("Executing command: %s" % " ".join(command))

Considering that we may not need browser-specific arguments, either, we may be able to compose these tasks in the .taskcluster.yml file, after all. Look ma, no wrapper. I'd be happy to keep iterating in this direction, but since we're also considering the long-term, it would be helpful for me to understand your design sensibilities.

Do either of you see value in maintaining separation between the various sub-commands? I don't mean to push the UNIX philosophy just for the sake of it, and I know that technically speaking, these are all implemented in the same application. That said, as someone who's been collecting results for a while, I appreciate the ability to compose tasks without patching WPT. I also find it easier to find features that are exposed in their own terms rather than as arguments to the increasingly-large wpt run subcommand. This could also allow us to be more strict in the contract we offer to consumers--if we were able to formally document a stable Python API (or remove it altogether), then I think the contribution experience would be even better.

jgraham · 2018-08-30T10:26:08Z

@jgraham Instead of installing Firefox via wpt run, we could install it via wpt install firefox browser from within start.sh. Admittedly, that script is a little more difficult to maintain since it's built into the Docker image. However, refactoring like this will make the script's behavior more consistent, and (depending on how you feel about the technically-unnecessary argument), it could allow us to remove all browser-specific run arguments.

I was thinking about the opposite; making wpt install chrome browser (and hence wpt run --install-browser chrome) actually work in the case that you're on Ubuntu/Debian Linux and can get root (and maybe adding more configurations in the future). Ideally I'd like to have less in the start.sh script becase updating the docker image is such a pain.

Do either of you see value in maintaining separation between the various sub-commands? I don't mean to push the UNIX philosophy just for the sake of it, and I know that technically speaking, these are all implemented in the same application. That said, as someone who's been collecting results for a while, I appreciate the ability to compose tasks without patching WPT. I also find it easier to find features that are exposed in their own terms rather than as arguments to the increasingly-large wpt run subcommand. This could also allow us to be more strict in the contract we offer to consumers--if we were able to formally document a stable Python API (or remove it altogether), then I think the contribution experience would be even better.

So, I think there are just a series of tradeoffs here and no clearly obvious right answer. Having a series of independent subcommands is great because, as you say, we can script different things together using common utilities, and we get a cleaner separation between concerns for testing. But on the other hand it has real and serious disadvantages:

It makes it harder to reproduce results locally. Having a single command that can just be copied and pasted locally is a big advanatage for reproducing results, especially compared to something that requies CI-specific setup (gecko CI has an egregious case of this where the way things run on CI is via a wrapper application that's very difficult to get running on a local machine, and it's awful. Nothing we are doing is that bad, but it's a concern).
It's not very portable. Bash scripts aren't going to work on Windows unless people are using WSL, so it becomes particuarly hard for Windows users to figure out how to reproduce what they saw in CI.
We end up with a collection of single-purpose shell scripts that themselves represent significant complexity; although I like the idea of lots of small composable utilities (and indeed that's how the wpt cli is structured in large part), it is often overlooked that the composition itself is a program that has to be maintained.

So I don't have a single answer at the moment, but my inclination such as it is is to keep building composable utilities for flexibility and experimentation, but to make the most common patterns baked in features (particularly of wpt run which is largely a wrapper around other functionaility anyway) so that they are portable and easy to run locally.

The simplest solution would be to perform a full repository fetch for every pull request. I don't know what effect this would have on build times, though maybe we could use TaskCluster's caching feature to optimize that.

So, the current start.sh script is supposed to perform a shallow clone and then if we don't have the commit we are looking for extend it to a full clone. But I think that for PRs we probably want to modify the script to pull from the refs/pulls namespace directly rather than first pulling in some of master and then extending it with the relevant PR. We can get the number of commits in the PR from the GH API so can figure out how much we need to pull to get the whole PR plus the merge base.

jugglinmike · 2018-08-31T02:29:17Z

So I don't have a single answer at the moment, but my inclination such as it
is is to keep building composable utilities for flexibility and
experimentation, but to make the most common patterns baked in features
(particularly of wpt run which is largely a wrapper around other
functionaility anyway) so that they are portable and easy to run locally.

Sounds good to me. Thanks for taking the time to write all of that up!

We can get the number of commits in the PR from the GH API so can figure out
how much we need to pull to get the whole PR plus the merge base.

I've pushed up a commit to do this. It necessitated a change to start.sh
because we can't use git clone to initialize a repository from an arbitrary
reference such as refs/pull/1/head (the command only supports branch names
and tag names). I renamed the variable REV to REVISION in order to avoid
confusion with the new REF variable.

That said, the strategy may be incomplete. If we test from the tip of the pull
request branch, then we will not be including any commits that may have landed
in master since the pull request was opened. That means the result reported
by TaskCluster might not reflect what we'd see after the patch was merged.

Contributors could get a more accurate picture by rebasing their patch, but I
don't know if many people would expect this. We could perform a merge in
TaskCluster (or maybe use the merge_commit_sha from GitHub). Would it be an
issue that we couldn't validate unmergable pull requests?

That idea highlights an underlying problem: we will only run these tasks in
response to a pull request event. Even with a "test merge" strategy, the
results will fall out of date if master advances and no new commits are
pushed. So staleness may be unavoidable to some extent.

These are all questions that I'm sure TravisCI et. al have worked out already,
but at the moment, I can't bring to mind any evidence of how they operate. I'll
look into that tomorrow, but I thought I'd leave some rambling here in case
@jgraham (or anyone else) feels like this issue is worth discussing.

This improves the authenticity of the reported results because it simulates how the patch will behave after it is merged. This also mimics the behavior of the TravisCI continuous integration platform.

The `start.sh` script now supports all git references, so this computation is no longer necessary.

This reverts commit 887512c.

jugglinmike · 2018-08-31T20:56:08Z

I think wpt run will do this be default now, if possible.

wpt run does attempt to download a manifest file by default, but wpt tests-affected (which precedes run in this script) does not. That seems appropriate for a lower-level utility like wpt tests-affected, so I've persisted the invocation of wpt manifest-download.

Regarding revision selection: a quick review of some TravisCI logs showed that they use a GitHub-provided reference I was not aware of: refs/pull/{pull request number}/merge. Since TravisCI has precedent in this project and on GitHub at large, I've updated the TaskCluster configuration to mimic this behavior. It means that the results will tend to reflect how the patch will behave after being merged to master, regardless of whether the author has rebased. There are a few other benefits, too. The diffing logic is simplified since we can just compare the first parent. Also, because the git repository now includes a commit present in master, we are downloading the generated manifest (rather than creating it from scratch).

I'm removing the "do not merge yet label"; this should be ready for another review. @jgraham and/or @Hexcles: would you mind?

jugglinmike · 2018-09-04T02:11:13Z

@jgraham @Hexcles In master today, TravisCI is configured to verify stability using the command ./wpt check-stability. This patch configures TaskCluster to verify stability using the command wpt run --verify. I chose this command because check-stability performs extra git/GitHub operations and because it includes TravisCI-specific code. I assumed validation heuristics of the commands were equivalent, but while researching an issue filed against the results-collection project, I learned that they differ.

I walked through an example in that issue report, but to summarize:

./wpt check-stability interleaves the tests
./wpt run --verify executes all iterations of each test separately

Is this difference intentional? The former allowed an unstable test to pass through undetected, but that doesn't necessarily mean the latter is technically superior.

jgraham · 2018-09-04T13:47:58Z

The difference is semi-intentional in that the --verify behaviour matches what other suites do on mozilla-central where the approach to test verification is to verify every test one at a time until some time limit is reached. This is an alternative apporach to the problem of how to handle time limits in the case that many tests need to be validated. I guess it also means that if one test is leaking state the others aren't affected (although that's bad in itself; doing this really really well would require running whole directories which has a significant time penalty).

Anyway, my general feeling is that the difference is OK and we should try out this patch as is without trying to change the semantics.

jgraham · 2018-09-04T13:50:20Z

+import subprocess
+
+browser_specific_args = {
+    "firefox": ["--install-browser", "--reftest-internal"]


Let's remove the flag and aim for a situation where we can use the same flags for all browsers.

`--reftest-internal` is enabled by default in GNU/Linux environments.

jugglinmike · 2018-09-06T00:42:07Z

I've included another commit to avoid a runtime exception for pull requests that have zero affected tests. We can validate that prior to merging when we remove the intentional instability.

jugglinmike · 2018-09-07T01:39:32Z

@jgraham I've resolved the conflicts introduced by gh-12679 and triggered both types of jobs:

github-push (via https://github.com/bocoup/wpt/commits/master): https://tools.taskcluster.net/task-group-inspector/#/K3hP_1FMSC-qyjJ0FQTtTg
github-pull-request (via this pull request): https://tools.taskcluster.net/task-group-inspector/#/UF5mqKDYRYWT5q_TtWkaXQ

Could you take another look?

jugglinmike · 2018-09-11T00:17:15Z

gh-12878 was recently merged to master. That introduced a conflict in this branch: it modified tools/ci/ci_taskcluster.sh which this patch removes in favor of a new Python script, taskcluster-run.py. The new script includes the semantics that gh-12878 introduced in the old script, so no change was necessary to resolve the conflict.

Hexcles

LGTM

Hexcles · 2018-09-12T15:51:34Z

+              owner: ${event.pull_request.user.login}@users.noreply.github.com
+              source: ${event.repository.url}
+            payload:
+              image: jugglinmike/web-platform-tests:0.21


A somewhat unrelated question: how can we create a public/shareable DockerHub account?

I don't know, but I think we all want that

This reverts commit ee8d57d.

jugglinmike requested review from Hexcles, gsnedders and jgraham August 23, 2018 18:08

wpt-pr-bot added ci infra docker dom labels Aug 23, 2018

wpt-pr-bot requested review from annevk, jdm and zqzhang August 23, 2018 22:13

jugglinmike added 2 commits August 23, 2018 20:33

[infra] Validate pull requests in TaskCluster

541acf3

[infra] Temporarily ignore stability status in CI

c3f109b

jugglinmike force-pushed the taskcluster-stability-2 branch from 70c22b2 to c3f109b Compare August 24, 2018 00:37

jugglinmike added 2 commits August 23, 2018 20:39

Demonstrate task with zero affected tests

a1f01d0

DO NOT MERGE Introduce instability

ee8d57d

jgraham requested changes Aug 24, 2018

View reviewed changes

jugglinmike force-pushed the taskcluster-stability-2 branch from 82e452d to ad8f76b Compare August 29, 2018 01:57

jugglinmike added 2 commits August 28, 2018 22:03

Re-implement script in Python

7808fb1

Reduce repetition

0d6c2a2

jugglinmike force-pushed the taskcluster-stability-2 branch from 569c1b6 to 0d6c2a2 Compare August 29, 2018 02:03

Trigger TaskCluster

c1792ee

jgraham approved these changes Aug 29, 2018

View reviewed changes

Hexcles approved these changes Aug 29, 2018

View reviewed changes

jugglinmike added 2 commits August 29, 2018 15:11

Trigger CI

f422774

Correct commit range

150058e

jugglinmike added the do not merge yet label Aug 30, 2018

Incorporate review feedback

30b4ae4

jugglinmike added 4 commits August 31, 2018 16:04

Test at GitHub-provided merge commit

3371a4c

This improves the authenticity of the reported results because it simulates how the patch will behave after it is merged. This also mimics the behavior of the TravisCI continuous integration platform.

Simplify reference

cbc779c

The `start.sh` script now supports all git references, so this computation is no longer necessary.

Remove superflous command

887512c

Revert "Remove superflous command"

d9a12c3

This reverts commit 887512c.

jugglinmike removed the do not merge yet label Aug 31, 2018

jugglinmike mentioned this pull request Sep 4, 2018

cookies/prefix/__secure.header.https.html results seem off. web-platform-tests/results-collection#599

Closed

jgraham reviewed Sep 4, 2018

View reviewed changes

jugglinmike added 2 commits September 5, 2018 20:24

Remove superfluous argument

80714c9

`--reftest-internal` is enabled by default in GNU/Linux environments.

Accommodate jobs with zero affected tests

c4ea324

Merge branch 'master' into taskcluster-stability-2

dba6e62

jugglinmike force-pushed the taskcluster-stability-2 branch from 849fece to dba6e62 Compare September 6, 2018 16:55

jgraham mentioned this pull request Sep 6, 2018

Fix #12876: set -ex on ci_taskcluster.sh #12878

Merged

Merge branch 'master' into taskcluster-stability-2

5c46835

jgraham mentioned this pull request Sep 12, 2018

Use --enable-experimental-web-platform-features for chrome-dev on Taskcluster #12908

Closed

jgraham approved these changes Sep 12, 2018

View reviewed changes

Hexcles approved these changes Sep 12, 2018

View reviewed changes

Revert "DO NOT MERGE Introduce instability"

e79a708

This reverts commit ee8d57d.

jgraham merged commit 1ede22e into web-platform-tests:master Sep 12, 2018

jugglinmike mentioned this pull request Sep 24, 2018

Verify correctness of Taskcluster PR checks #13194

Closed

jugglinmike mentioned this pull request Oct 7, 2018

Unify ./wpt check-stability and ./wpt run --verify #13406

Open

Conversation

jugglinmike commented Aug 23, 2018 • edited by gsnedders Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jugglinmike commented Aug 24, 2018

Uh oh!

jgraham left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsnedders commented Aug 24, 2018

Uh oh!

jugglinmike commented Aug 29, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hexcles left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jugglinmike commented Aug 29, 2018

Uh oh!

jugglinmike commented Aug 29, 2018

Uh oh!

jgraham commented Aug 30, 2018

Uh oh!

jugglinmike commented Aug 31, 2018

Uh oh!

jugglinmike commented Aug 31, 2018

Uh oh!

jugglinmike commented Sep 4, 2018

Uh oh!

jgraham commented Sep 4, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jugglinmike commented Sep 6, 2018

Uh oh!

jugglinmike commented Sep 7, 2018

Uh oh!

jugglinmike commented Sep 11, 2018

Uh oh!

Hexcles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

jugglinmike commented Aug 23, 2018 •

edited by gsnedders

Loading