[LV] Don't skip VPlan cost model for div/rem instructions#187056
[LV] Don't skip VPlan cost model for div/rem instructions#187056
Conversation
|
@llvm/pr-subscribers-llvm-transforms Author: David Sherwood (david-arm) ChangesIn LoopVectorizationPlanner::precomputeCosts we are skipping calculation of costs using the VPlan cost model, instead preferring to use the legacy costs. This helps to prevent the legacy and vplan cost model assert firing, but really we should be encouraging full use of the VPlan cost model. I've created this initial PR to stop skipping the computation costs for udiv/urem/sdiv/srem instructions. The VPlan costs seem to match up nicely. I intend to follow up with more PRs to move more opcodes across. Full diff: https://github.com/llvm/llvm-project/pull/187056.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index ac9b790c739bf..59b039a75eec2 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7009,8 +7009,21 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
});
Cost += ForcedCost;
}
+
+ auto UseVPlanCostModel = [](Instruction *I) -> bool {
+ switch (I->getOpcode()) {
+ case Instruction::SDiv:
+ case Instruction::UDiv:
+ case Instruction::SRem:
+ case Instruction::URem:
+ return true;
+ default:
+ return false;
+ }
+ };
for (const auto &[Scalarized, ScalarCost] : CM.InstsToScalarize[VF]) {
- if (CostCtx.skipCostComputation(Scalarized, VF.isVector()))
+ if (UseVPlanCostModel(Scalarized) ||
+ CostCtx.skipCostComputation(Scalarized, VF.isVector()))
continue;
CostCtx.SkipCostComputation.insert(Scalarized);
LLVM_DEBUG({
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll b/llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
index 1f3949172b758..983c1b9c2b902 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
@@ -13,7 +13,7 @@ target triple = "aarch64--linux-gnu"
; %var4 a lower scalarization overhead.
;
; COST-LABEL: predicated_udiv_scalarized_operand
-; COST: Cost of 5 for VF 2: profitable to scalarize %var4 = udiv i64 %var2, %var3
+; COST: Cost of 5 for VF 2: REPLICATE ir<%var4> = udiv ir<%var2>, ir<%var3> (S->V)
;
;
define i64 @predicated_udiv_scalarized_operand(ptr %a, i64 %x) {
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll
index d84a6e27e5473..944632a796bdb 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll
@@ -19,7 +19,7 @@ target triple = "aarch64--linux-gnu"
; (udiv(2) + extractelement(8) + insertelement(4)) / 2 = 7
;
; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
-; CHECK: Cost of 7 for VF 2: profitable to scalarize %tmp4 = udiv i32 %tmp2, %tmp3
+; CHECK: Cost of 7 for VF 2: REPLICATE ir<%tmp4> = udiv ir<%tmp2>, ir<%tmp3> (S->V)
;
define i32 @predicated_udiv(ptr %a, ptr %b, i1 %c, i64 %n) {
entry:
@@ -135,8 +135,8 @@ for.end:
;
; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x
; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
-; CHECK: Cost of 5 for VF 2: profitable to scalarize %tmp4 = udiv i32 %tmp2, %tmp3
; CHECK: Cost of 3 for VF 2: profitable to scalarize %tmp3 = add nsw i32 %tmp2, %x
+; CHECK: Cost of 5 for VF 2: REPLICATE ir<%tmp4> = udiv ir<%tmp2>, ir<%tmp3> (S->V)
;
define i32 @predicated_udiv_scalarized_operand(ptr %a, i1 %c, i32 %x, i64 %n) {
@@ -233,11 +233,11 @@ for.end:
; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2
; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x
; CHECK: Scalarizing and predicating: store i32 %tmp5, ptr %tmp0, align 4
-; CHECK: Cost of 7 for VF 2: profitable to scalarize %tmp3 = sdiv i32 %tmp1, %tmp2
-; CHECK: Cost of 7 for VF 2: profitable to scalarize %tmp4 = udiv i32 %tmp3, %tmp2
; CHECK: Cost of 2 for VF 2: profitable to scalarize store i32 %tmp5, ptr %tmp0, align 4
; CHECK: Cost of 3 for VF 2: profitable to scalarize %tmp5 = sub i32 %tmp4, %x
; CHECK: Cost of 1 for VF 2: WIDEN ir<%tmp2> = add ir<%tmp1>, ir<%x>
+; CHECK: Cost of 7 for VF 2: REPLICATE ir<%tmp3> = sdiv ir<%tmp1>, ir<%tmp2>
+; CHECK: Cost of 5 for VF 2: REPLICATE ir<%tmp4> = udiv ir<%tmp3>, ir<%tmp2>
;
define void @predication_multi_context(ptr %a, i1 %c, i32 %x, i64 %n) {
entry:
|
|
@llvm/pr-subscribers-vectorizers Author: David Sherwood (david-arm) ChangesIn LoopVectorizationPlanner::precomputeCosts we are skipping calculation of costs using the VPlan cost model, instead preferring to use the legacy costs. This helps to prevent the legacy and vplan cost model assert firing, but really we should be encouraging full use of the VPlan cost model. I've created this initial PR to stop skipping the computation costs for udiv/urem/sdiv/srem instructions. The VPlan costs seem to match up nicely. I intend to follow up with more PRs to move more opcodes across. Full diff: https://github.com/llvm/llvm-project/pull/187056.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index ac9b790c739bf..59b039a75eec2 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7009,8 +7009,21 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
});
Cost += ForcedCost;
}
+
+ auto UseVPlanCostModel = [](Instruction *I) -> bool {
+ switch (I->getOpcode()) {
+ case Instruction::SDiv:
+ case Instruction::UDiv:
+ case Instruction::SRem:
+ case Instruction::URem:
+ return true;
+ default:
+ return false;
+ }
+ };
for (const auto &[Scalarized, ScalarCost] : CM.InstsToScalarize[VF]) {
- if (CostCtx.skipCostComputation(Scalarized, VF.isVector()))
+ if (UseVPlanCostModel(Scalarized) ||
+ CostCtx.skipCostComputation(Scalarized, VF.isVector()))
continue;
CostCtx.SkipCostComputation.insert(Scalarized);
LLVM_DEBUG({
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll b/llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
index 1f3949172b758..983c1b9c2b902 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
@@ -13,7 +13,7 @@ target triple = "aarch64--linux-gnu"
; %var4 a lower scalarization overhead.
;
; COST-LABEL: predicated_udiv_scalarized_operand
-; COST: Cost of 5 for VF 2: profitable to scalarize %var4 = udiv i64 %var2, %var3
+; COST: Cost of 5 for VF 2: REPLICATE ir<%var4> = udiv ir<%var2>, ir<%var3> (S->V)
;
;
define i64 @predicated_udiv_scalarized_operand(ptr %a, i64 %x) {
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll
index d84a6e27e5473..944632a796bdb 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll
@@ -19,7 +19,7 @@ target triple = "aarch64--linux-gnu"
; (udiv(2) + extractelement(8) + insertelement(4)) / 2 = 7
;
; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
-; CHECK: Cost of 7 for VF 2: profitable to scalarize %tmp4 = udiv i32 %tmp2, %tmp3
+; CHECK: Cost of 7 for VF 2: REPLICATE ir<%tmp4> = udiv ir<%tmp2>, ir<%tmp3> (S->V)
;
define i32 @predicated_udiv(ptr %a, ptr %b, i1 %c, i64 %n) {
entry:
@@ -135,8 +135,8 @@ for.end:
;
; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x
; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
-; CHECK: Cost of 5 for VF 2: profitable to scalarize %tmp4 = udiv i32 %tmp2, %tmp3
; CHECK: Cost of 3 for VF 2: profitable to scalarize %tmp3 = add nsw i32 %tmp2, %x
+; CHECK: Cost of 5 for VF 2: REPLICATE ir<%tmp4> = udiv ir<%tmp2>, ir<%tmp3> (S->V)
;
define i32 @predicated_udiv_scalarized_operand(ptr %a, i1 %c, i32 %x, i64 %n) {
@@ -233,11 +233,11 @@ for.end:
; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2
; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x
; CHECK: Scalarizing and predicating: store i32 %tmp5, ptr %tmp0, align 4
-; CHECK: Cost of 7 for VF 2: profitable to scalarize %tmp3 = sdiv i32 %tmp1, %tmp2
-; CHECK: Cost of 7 for VF 2: profitable to scalarize %tmp4 = udiv i32 %tmp3, %tmp2
; CHECK: Cost of 2 for VF 2: profitable to scalarize store i32 %tmp5, ptr %tmp0, align 4
; CHECK: Cost of 3 for VF 2: profitable to scalarize %tmp5 = sub i32 %tmp4, %x
; CHECK: Cost of 1 for VF 2: WIDEN ir<%tmp2> = add ir<%tmp1>, ir<%x>
+; CHECK: Cost of 7 for VF 2: REPLICATE ir<%tmp3> = sdiv ir<%tmp1>, ir<%tmp2>
+; CHECK: Cost of 5 for VF 2: REPLICATE ir<%tmp4> = udiv ir<%tmp3>, ir<%tmp2>
;
define void @predication_multi_context(ptr %a, i1 %c, i32 %x, i64 %n) {
entry:
|
There was a problem hiding this comment.
Pull request overview
This PR updates LoopVectorize’s cost precomputation logic to stop bypassing the VPlan cost model for scalarized div/rem instructions, aligning the debug-cost output and tests with VPlan-based costing for those opcodes.
Changes:
- Adjusted
LoopVectorizationPlanner::precomputeCoststo avoid precomputing legacy scalarization costs forsdiv/udiv/srem/urem, allowing VPlan recipe costing to account for them. - Updated AArch64 LoopVectorize debug-cost tests to expect
REPLICATE ...VPlan cost lines instead of legacy “profitable to scalarize” lines for affected div/rem instructions. - Refreshed corresponding FileCheck patterns in AArch64 predication cost tests.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp |
Changes cost precomputation to defer div/rem scalarized instruction costing to the VPlan model. |
llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll |
Updates cost-model assertions to match VPlan REPLICATE debug output for div/rem. |
llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll |
Updates expected cost-model debug output for udiv scalarization to VPlan REPLICATE form. |
You can also share your feedback on Copilot code review. Take the survey.
| for (const auto &[Scalarized, ScalarCost] : CM.InstsToScalarize[VF]) { | ||
| if (CostCtx.skipCostComputation(Scalarized, VF.isVector())) | ||
| if (UseVPlanCostModel(Scalarized) || | ||
| CostCtx.skipCostComputation(Scalarized, VF.isVector())) | ||
| continue; |
| ; CHECK: Cost of 7 for VF 2: REPLICATE ir<%tmp3> = sdiv ir<%tmp1>, ir<%tmp2> | ||
| ; CHECK: Cost of 5 for VF 2: REPLICATE ir<%tmp4> = udiv ir<%tmp3>, ir<%tmp2> |
llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
Outdated
Show resolved
Hide resolved
fhahn
left a comment
There was a problem hiding this comment.
I did run this on a set of workloads and it looks like there's some divergences at least on AAarch64. Will prepare a reproducer soon
OK thanks. I also independently found that the current VPlan cost model for udiv/sdiv is incorrect. I have another patch I intend to upstream that may fix the divergence you have found. It relates to uses of the udiv - if the only use is within the same block then it will be scalar and no inserts of the result into a vector are required. This is one area of divergence between the VP context getScalarizationOverhead and the scalar costs calculated in computePredInstDiscounts, which only calculates the cost of the result inserts if it's scalar with predication. |
Any update here @fhahn? I've improved the cost model to better reflect the VPlan model and the generated LLVM IR. I have more patches to follow in this area that are currently gated on this patch. |
| // Is this recipe only used by other recipes in the same block? If so, the | ||
| // result does not need scalarizing since it's only use will be scalar. |
There was a problem hiding this comment.
I am surprised this does not trigger the divergence assertion, as this doesn't match the legacy result as per the test changes?
There was a problem hiding this comment.
Yeah, although in this case I believe the vplan cost model to be more accurate. It would be a shame to deliberately over-cost simply to align with the incorrect legacy cost model.
There was a problem hiding this comment.
This change is also required for follow-on patches, such as using the vplan replicate cost model for more instructions such as Instruction::FAdd where the legacy cost model was producing lower costs than the vplan model.
There was a problem hiding this comment.
I am happy to provide the existing test case that shows this problem if that helps?
The legacy cost model computes and passes RHSInfo both when widening and replicating. Match behavior in VPlan-based cost model. The added test shows that we now compute the same cost as the legacy cost model. Without this change, the test added in llvm/test/Transforms/LoopVectorize/AArch64/predicated-costs.ll would crash with llvm#187056.
Sorry took a bit longer to get a good reproducer. It looks like a separate issue to insert/extract costs. I think with #188126, the version of the PR without the latest change didn't trigger any crashes with a set of different configs in my testing. I've not yet tried with the latest change, but it may be worth splitting it up? |
OK sure, happy to split up. Does that mean the original version of this PR essentially depends upon #188126 to avoid crashes? |
Yep, I expect that this may surface a few more issues by using the VPlan-based cost model more, but at least the issue in #188126 seems to be a genuine bug in the VPlan based cost model (+ one in the AArch64 cost model perhaps) |
In LoopVectorizationPlanner::precomputeCosts we are skipping calculation of costs using the VPlan cost model, instead preferring to use the legacy costs. This helps to prevent the legacy and vplan cost model assert firing, but really we should be encouraging full use of the VPlan cost model. I've created this initial PR to stop skipping the computation costs for udiv/urem/sdiv/srem instructions. The VPlan costs seem to match up nicely.
9e4a2e8 to
efe5082
Compare
|
Hi @fhahn, I've reverted the other cost model changes for now and will move them to a separate PR. |
The legacy cost model computes and passes RHSInfo both when widening and replicating. Match behavior in VPlan-based cost model. The added test shows that we now compute the same cost as the legacy cost model. Without this change, the test added in llvm/test/Transforms/LoopVectorize/AArch64/predicated-costs.ll would crash with #187056. PR: #188126
…e. (#188126) The legacy cost model computes and passes RHSInfo both when widening and replicating. Match behavior in VPlan-based cost model. The added test shows that we now compute the same cost as the legacy cost model. Without this change, the test added in llvm/test/Transforms/LoopVectorize/AArch64/predicated-costs.ll would crash with llvm/llvm-project#187056. PR: llvm/llvm-project#188126
| ; (sdiv(2) + extractelement(8) + insertelement(4)) / 2 = 7 | ||
| ; Cost of udiv: | ||
| ; (udiv(2) + extractelement(8) + insertelement(4)) / 2 = 7 | ||
| ; (udiv(2) + extractelement(4) + insertelement(4)) / 2 = 5 |
There was a problem hiding this comment.
Ah there's still a difference even with the other improvements removed. Do you know where this is coming from?
I'm running the patch on a large test-set now to see if it triggers any more issues, but if possible to avoid this divergence, that would probably be more reliable
There was a problem hiding this comment.
I think it's because of this code in VPCostContext::getScalarizationOverhead:
// Compute the cost of scalarizing the operands, skipping ones that do not
// require extraction/scalarization and do not incur any overhead.
SmallPtrSet<const VPValue *, 4> UniqueOperands;
SmallVector<Type *> Tys;
for (auto *Op : Operands) {
if (isa<VPIRValue>(Op) ||
(!AlwaysIncludeReplicatingR &&
isa<VPReplicateRecipe, VPPredInstPHIRecipe>(Op)) ||
(isa<VPReplicateRecipe>(Op) &&
cast<VPReplicateRecipe>(Op)->getOpcode() == Instruction::Load) ||
!UniqueOperands.insert(Op).second)
continue;
Tys.push_back(toVectorizedTy(Types.inferScalarType(Op), VF));
}
return ScalarizationCost +
TTI.getOperandsScalarizationOverhead(Tys, CostKind, VIC);
So if it's introducing a divergence between legacy and vplan cost model for udiv/urem then it's presumably also introducing the same divergence everywhere we call getScalarizationOverhead. Having said that, it feels like a backward step to deliberately regress the vplan cost model just to match the inaccurate legacy one.
I would prefer simply disabling the assert if there are replicating recipes in the vplan.
…8126) The legacy cost model computes and passes RHSInfo both when widening and replicating. Match behavior in VPlan-based cost model. The added test shows that we now compute the same cost as the legacy cost model. Without this change, the test added in llvm/test/Transforms/LoopVectorize/AArch64/predicated-costs.ll would crash with llvm#187056. PR: llvm#188126
…8126) The legacy cost model computes and passes RHSInfo both when widening and replicating. Match behavior in VPlan-based cost model. The added test shows that we now compute the same cost as the legacy cost model. Without this change, the test added in llvm/test/Transforms/LoopVectorize/AArch64/predicated-costs.ll would crash with llvm#187056. PR: llvm#188126
…8126) The legacy cost model computes and passes RHSInfo both when widening and replicating. Match behavior in VPlan-based cost model. The added test shows that we now compute the same cost as the legacy cost model. Without this change, the test added in llvm/test/Transforms/LoopVectorize/AArch64/predicated-costs.ll would crash with llvm#187056. PR: llvm#188126
|
Any thoughts on whether this can be merged now @fhahn? |
The legacy cost model computes and passes RHSInfo both when widening and replicating. Match behavior in VPlan-based cost model. The added test shows that we now compute the same cost as the legacy cost model. Without this change, the test added in llvm/test/Transforms/LoopVectorize/AArch64/predicated-costs.ll would crash with llvm/llvm-project#187056. PR: llvm/llvm-project#188126 (cherry picked from commit 86c1510)
Almost all recipes now go through ::computeCost to properly compute their costs using the VPlan-based cost model. There are currently no known cases where the VPlan-based cost model returns an incorrect cost vs the legacy cost model. I check the remaining open issues with reports of the assertion triggering and in all cases the VPlan-based cost model is more accurate, which is causing the divergence. There are still some fall-back paths, mostly via precomputeCosts, but those cannot be easily removed without triggering the assert, as the VPlan-based cost model is more accurate for those cases. An example of this is llvm#187056. Fixes llvm#38575. Fixes llvm#149651. Fixes llvm#182646. Fixes llvm#183739. Fixes llvm#187523.
In LoopVectorizationPlanner::precomputeCosts we are skipping calculation of costs using the VPlan cost model, instead preferring to use the legacy costs. This helps to prevent the legacy and vplan cost model assert firing, but really we should be encouraging full use of the VPlan cost model.
I've created this initial PR to stop skipping the computation costs for udiv/urem/sdiv/srem instructions. The VPlan costs seem to match up nicely.
I intend to follow up with more PRs to move more opcodes across.