Fixes for Digi Morphing: Limiting Histogram Size and Decoupling for TrackerTraits#49021
Conversation
|
cms-bot internal usage |
|
enable gpu |
|
solves #48885 |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-49021/46221
|
|
A new Pull Request was created by @AdrianoDee for master. It involves the following packages:
@Dr15Jones, @bsunanda, @civanch, @cmsbuild, @jfernan2, @kpedro88, @makortel, @mandrenguyen, @mdhildreth can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
test parameters:
|
|
please test |
|
type bug-fix |
|
assign heterogeneous |
| #include "HeterogeneousCore/AlpakaInterface/interface/warpsize.h" | ||
|
|
||
| //#define GPU_DEBUG | ||
| // #define GPU_DEBUG |
There was a problem hiding this comment.
Could you undo the extra whitespace change ?
| ALPAKA_ASSERT_ACC((alpaka::getWorkDiv<alpaka::Thread, alpaka::Elems>(acc)[0u] <= maxElements)); | ||
|
|
||
| constexpr unsigned int maxIter = maxIterGPU * maxElements; | ||
| const unsigned int maxIter = TrackerTraits::maxIterClustering * maxElements; |
There was a problem hiding this comment.
The declaration of the arrays nn[maxIter][maxNeighbours] and nnn[maxIter] should not be allowed if maxIter is not constexpr or anyway known at compile time ?
There was a problem hiding this comment.
In the host code we have tolerated variable-length arrays as a non-standard extension (for reasons that can be debated elsewhere). I don't know to what extent VLAs work in nvcc or hipcc.
There was a problem hiding this comment.
CUDA does not seem to like it, compiling this
__global__
void kernel(bool more) {
const int size = more ? 42 : 21;
float data[size];
if (threadIdx.x < size) {
data[size] = 0;
}
}
int main(void) {
kernel<<<1,1>>>(true);
return 0;
}fails with
$ /usr/local/cuda-12.9/bin/nvcc -c test.cu -o test.o -arch sm_75
test.cu(4): error: expression must have a constant value
float data[size];
^
test.cu(4): note #2689-D: the value of variable "size" (declared at line 3) cannot be used as a constant
float data[size];
^
1 error detected in the compilation of "test.cu".
There was a problem hiding this comment.
Although the compilation of the tests seems to be progressing fine ?
And CUDA does support alloca() 🤔, in fact this compiles:
__global__
void kernel(bool more) {
const int size = more ? 42 : 21;
//float data[size];
float* data = static_cast<float *>(alloca(size * sizeof(float)));
if (threadIdx.x < size) {
data[size] = 0;
}
}There was a problem hiding this comment.
The alternative I see is just sizing it with the maximum possible (so TrackerTraits::maxElementsPerBlockMorph), basically wasting some of it.
There was a problem hiding this comment.
And to be honest, I wasn't expecting this to compile either.
There was a problem hiding this comment.
After looking a bit better into it I think I understand why it works:
- on CPU the value of
maxIterdepends on whether morphing is enabled or not, but it works because as Matti pointed out we allow variable sized arrays; - on GPU the value of
maxIteris actually independent whether morphing is enabled or not, so the compiler can determine it at compile time.
There was a problem hiding this comment.
One smart compiler!
|
|
||
| static constexpr uint32_t maxPixInModule = 6000; | ||
| static constexpr uint32_t maxPixInModuleForMorphing = maxPixInModule; | ||
| static constexpr uint32_t maxIterClustering = 16; |
There was a problem hiding this comment.
Can we derive this from maxPixInModule or maxPixInModuleForMorphing ?
There was a problem hiding this comment.
Yes, we could. But it depends on how we want to handle the number of blocks and threads for FindClus. As is, we fix maxPixInModule, maxIterClustering, blocks, and extrapolate maxElementsPerBlock so that maxElementsPerBlock = maxPixInModule/(maxIterClustering * blocks).
There was a problem hiding this comment.
If I follow the code correctly, now we have
maxPixInModule,maxPixInModuleForMorphingandmaxIterClusteringfixed heremaxElementsPerBlock = maxPixInModule / maxIterClustering, round up to the next multiple of 64maxElementsPerBlockMorph = maxPixInModuleForMorphing / maxIterClustering, round up to the next multiple of 64maxElements- on CPU it is either
maxElementsPerBlockormaxElementsPerBlockMorph - on GPU it is always
1.
- on CPU it is either
maxIter = maxIterClustering × maxElements- on CPU it is
maxPixInModuleormaxPixInModuleForMorphing
rounded up to the next multiple of(maxIterClustering × 64) - on GPU it is
maxIterClustering
- on CPU it is
Which results in
| Phase2 | Phase1 | HIonPhase1 | |
|---|---|---|---|
maxPixInModule |
6000 | 6000 | 10000 |
maxPixInModuleForMorphing |
6000 | 8400 | 11000 |
maxIterClustering |
16 | 24 | 32 |
maxElementsPerBlock |
384 | 256 | 320 |
maxElementsPerBlockMorph |
384 | 384 | 384 |
maxElements (CPU, enableDigiMorphing = false) |
384 | 256 | 320 |
maxElements (CPU, enableDigiMorphing = true) |
384 | 384 | 384 |
maxElements (GPU) |
1 | 1 | 1 |
maxIter (CPU, enableDigiMorphing = false) |
6144 | 4096 | 5120 |
maxIter (CPU, enableDigiMorphing = true) |
6144 | 9216 | 12288 |
maxIter (GPU) |
16 | 24 | 32 |
Then maxIter is used to allocate the arrays of nearest neighbours.
My suggestion would be to
- fix
maxPixInModuleandmaxPixInModuleForMorphinglike in this PR - determine
maxElementsPerBlockbased on what works and gives good performance on the T4 and/or L4 GPUs, and keep it fixed (hopefully using the same value with and without morphing) - derive
maxIterClusteringfrommaxPixInModuleForMorphingormaxPixInModule, depending if morphing is enabled or not.
What do you think ?
There was a problem hiding this comment.
I was implementing this (I agree it's a better set of fixed variables), but: does this imply that maxIter is not fixed at compile time on GPU and the issue above would manifest?
There was a problem hiding this comment.
Shall we stay with the current schema for the moment? Just to get it in for the next 15_1_X release.
There was a problem hiding this comment.
OK, I don't have a better suggestions, so let's keep it as it is for the moment 🤷🏻♂️
| static constexpr uint16_t last_barrel_detIndex = 864; | ||
|
|
||
| static constexpr uint32_t maxPixInModule = 6000; | ||
| static constexpr uint32_t maxPixInModuleForMorphing = maxPixInModule; |
There was a problem hiding this comment.
Given the name maxPixInModuleForMorphing, it would make sense for this constant to indicate how many pixels at most one can expect to be recovered by the morphing step, rather than the total of original plus recovered pixels ?
There was a problem hiding this comment.
Right, makes sense.
|
+1 Size: This PR adds an extra 48KB to repository Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
AMD_MI300X Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
NVIDIA_T4 Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
|
|
Pull request #49021 was updated. @Dr15Jones, @bsunanda, @civanch, @cmsbuild, @fwyzard, @jfernan2, @kpedro88, @makortel, @mandrenguyen, @mdhildreth can you please check and sign again. |
|
please test |
|
+1 Size: This PR adds an extra 40KB to repository Comparison SummarySummary:
AMD_MI300X Comparison SummarySummary:
AMD_W7900 Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
NVIDIA_T4 Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
|
|
urgent |
|
+1 |
|
+heterogeneous Thanks Adriano for the fix and addressing the various comments. |
|
@cms-sw/geometry-l2 ping |
|
+1 |
|
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2) |
|
+1 |
PR description:
This PR proposes a couple of fixes for the digi morphing to work properly for differen conditions (when acitve or not):
maxPixInModuleForMorphingconstant depending on theTrackerTraitsto be used to define the number of threads for theFindCluskernel. This also sizes the histogram holding the pixels in a module:maxIterGPUper topology (given the different number of pixels affects the number of iterations we can use to cover the full module);maxFakesInModuleconfiguration parameter to take into account themaxPixInModuleForMorphingmax to avoid the histogram overflowing.PR validation:
160.03502run.Running successfully the test from @henriettepetersen :
Backport is needed to
15_1_Xfor HI data taking.