Validate pull requests in TaskCluster#12657
Conversation
70c22b2 to
c3f109b
Compare
|
I had to iterate on this a bit after changing the upstream repository, but this patch is functioning as expected. I've rebased and pushed up a couple of commits to demonstrate how the CI handles patches with no affected tests and how it handles patches with one or more affected tests (and unstable ones, at that). Those final two commits should not be merged to |
jgraham
left a comment
There was a problem hiding this comment.
Very very excited to see this land. But I think the implementation got a little hard to maintain by trying to force everything to go via a shell script.
| owner: ${event.pusher.email} | ||
| source: ${event.repository.url} | ||
| payload: | ||
| image: jugglinmike/web-platform-tests:0.18 |
There was a problem hiding this comment.
We really need to set up a shared dockerhub account :)
| - name: wpt-${browser.name}-${browser.channel}-stability | ||
| description: >- | ||
| Verify that all tests affected by a pull request are stable | ||
| when executed in ${browser.name}. As of 2018-08-23, this task |
There was a problem hiding this comment.
Remove the extra text from the description; I don't think it's that helpful (we're more likely to forget to remove it later).
There was a problem hiding this comment.
I'd like there to be some contributor-facing indication of this behavior.
I appreciate the concern about bitrot, though, so I've moved it to a statement
logged to standard out immediately following the command invocation. Does that
seem safe enough to you?
| # Bash removes null bytes from string values when set as | ||
| # environment variables. This invalidates the output of `wpt | ||
| # affected-tests` because it uses the null byte as a separator | ||
| # between test names. The list of effected tests is |
There was a problem hiding this comment.
That should be affected. But this complexity seems like a clue that a better solution is required.
| # to be the name of the browser under test. This restricts the syntax available | ||
| # to consumers: value-accepting options must be specified using the equals sign | ||
| # (`=`). | ||
| for argument in $@; do |
There was a problem hiding this comment.
I feel like this exceeded my complexity threshold for a bash script. If we want a single entry point for all the things, let's make it a python script instead.
|
From #10503:
At the same time, I'm unconvinced we should block doing this on having that working, because it's not a regression currently. |
82e452d to
ad8f76b
Compare
569c1b6 to
0d6c2a2
Compare
|
@jgraham I've re-implemented This introduces more indirection, and I'm on the fence about whether it's |
| import subprocess | ||
|
|
||
| browser_specific_args = { | ||
| "firefox": ["--install-browser", "--reftest-internal"] |
There was a problem hiding this comment.
--reftest-internal is the default, I think.
There was a problem hiding this comment.
I think you're right
tools/wptrunner/wptrunner/wptcommandline.py: if kwargs["reftest_internal"] is None:
tools/wptrunner/wptrunner/wptcommandline.py- # Default to the internal reftest implementation on Linux and OSX
tools/wptrunner/wptrunner/wptcommandline.py: kwargs["reftest_internal"] = sys.platform.startswith("linux") or sys.platform.startswith("darwin")
I wanted to verify with the person who added this flag, and it turns out that was you:
Explicitly set Firefox to use the fast reftest runner.
This should happen by default on Linux, but it doesn't hurt to be explicit.
Have you had a change of heart? Or should we keep the flag for the sake of explicitness?
There was a problem hiding this comment.
Let's remove the flag and aim for a situation where we can use the same flags for all browsers.
| @@ -0,0 +1,101 @@ | |||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
So, I think I'm OK with landing this as-is, but I wonder what the effect would be of moving the logic in this file into wpt run directly? If one added --commit-range as an argument to that function then the only things that wouldn't directly fit would be getting the arguments right per-browser and gzipping artifacts. Those could perhaps be moved out into the task definitions.
| } | ||
|
|
||
| def tests_affected(commit_range): | ||
| output = subprocess.check_output([ |
There was a problem hiding this comment.
You could, of course, just import and use the function directly rather than going via a process.
There was a problem hiding this comment.
@jgraham wouldn't that require hacking sys.path? If so, I'm in favour of forking.
There was a problem hiding this comment.
Yes. I think it makes more long term sense to move all of this into wpt run and not have another wrapper script at all. So I don't object to defering that change here.
Hexcles
left a comment
There was a problem hiding this comment.
The implementation LGTM. I didn't look into the Travis & TC failures though. Could you take a look?
| } | ||
|
|
||
| def tests_affected(commit_range): | ||
| output = subprocess.check_output([ |
There was a problem hiding this comment.
@jgraham wouldn't that require hacking sys.path? If so, I'm in favour of forking.
|
@jgraham Instead of installing Firefox via |
This aspect of the wrapper is motivated by my naive attempt to print the affected tests to standard out. Without that, it would be a simple matter of piping That serves the same purpose as the logging statement in this patch: Considering that we may not need browser-specific arguments, either, we may be able to compose these tasks in the Do either of you see value in maintaining separation between the various sub-commands? I don't mean to push the UNIX philosophy just for the sake of it, and I know that technically speaking, these are all implemented in the same application. That said, as someone who's been collecting results for a while, I appreciate the ability to compose tasks without patching WPT. I also find it easier to find features that are exposed in their own terms rather than as arguments to the increasingly-large |
I was thinking about the opposite; making
So, I think there are just a series of tradeoffs here and no clearly obvious right answer. Having a series of independent subcommands is great because, as you say, we can script different things together using common utilities, and we get a cleaner separation between concerns for testing. But on the other hand it has real and serious disadvantages:
So I don't have a single answer at the moment, but my inclination such as it is is to keep building composable utilities for flexibility and experimentation, but to make the most common patterns baked in features (particularly of
So, the current |
Sounds good to me. Thanks for taking the time to write all of that up!
I've pushed up a commit to do this. It necessitated a change to That said, the strategy may be incomplete. If we test from the tip of the pull Contributors could get a more accurate picture by rebasing their patch, but I That idea highlights an underlying problem: we will only run these tasks in These are all questions that I'm sure TravisCI et. al have worked out already, |
This improves the authenticity of the reported results because it simulates how the patch will behave after it is merged. This also mimics the behavior of the TravisCI continuous integration platform.
The `start.sh` script now supports all git references, so this computation is no longer necessary.
This reverts commit 887512c.
Regarding revision selection: a quick review of some TravisCI logs showed that they use a GitHub-provided reference I was not aware of: I'm removing the "do not merge yet label"; this should be ready for another review. @jgraham and/or @Hexcles: would you mind? |
|
@jgraham @Hexcles In I walked through an example in that issue report, but to summarize:
Is this difference intentional? The former allowed an unstable test to pass through undetected, but that doesn't necessarily mean the latter is technically superior. |
|
The difference is semi-intentional in that the Anyway, my general feeling is that the difference is OK and we should try out this patch as is without trying to change the semantics. |
| import subprocess | ||
|
|
||
| browser_specific_args = { | ||
| "firefox": ["--install-browser", "--reftest-internal"] |
There was a problem hiding this comment.
Let's remove the flag and aim for a situation where we can use the same flags for all browsers.
`--reftest-internal` is enabled by default in GNU/Linux environments.
|
I've included another commit to avoid a runtime exception for pull requests that have zero affected tests. We can validate that prior to merging when we remove the intentional instability. |
849fece to
dba6e62
Compare
|
@jgraham I've resolved the conflicts introduced by gh-12679 and triggered both types of jobs:
Could you take another look? |
|
gh-12878 was recently merged to |
| owner: ${event.pull_request.user.login}@users.noreply.github.com | ||
| source: ${event.repository.url} | ||
| payload: | ||
| image: jugglinmike/web-platform-tests:0.21 |
There was a problem hiding this comment.
A somewhat unrelated question: how can we create a public/shareable DockerHub account?
There was a problem hiding this comment.
I don't know, but I think we all want that
This reverts commit ee8d57d.
Extend the configuration for the TaskCluster service to generate tasks in
response to GitHub pull requests. Each pull request should create one cluster
containing four tasks, all of which operate on the affected tests as identified
by
wpt tests-affected:The former two tasks are intended for future use in comparing test results with
data available on https://wpt.fyi. This will be informative only. The latter
two tasks are intended to verify that the affected tests are stable. This will
influence whether or not the pull request may be merged.
A demonstration of expected functionality is available on Bocoup's fork of WPT:
masterbranch (this functionality was implemented previously, and this pullrequest is not intended to influence the behavior we see today)
masterbranch of WPT forkIt seems wise to vet this in WPT for a bit before allowing the results to
control whether pull requests may be merged. The configuration proposed here
will permit unstable pull requests, but it stores the status of the check in a
dedicated artifact. After some time, we can use this to compare the behavior
between the current TravisCI-powered checks and this new one. If they align, we
can revert the second commit.
Due to some changes to indentation, using
-wor the?w=1query stringparameter may make this change set easier to review.
[fixes #10503]