WebGPURenderer: Adjust getArrayBufferAsync behavior, add support for partial readback#33322
WebGPURenderer: Adjust getArrayBufferAsync behavior, add support for partial readback#33322sunag merged 19 commits intomrdoob:devfrom
Conversation
📦 Bundle sizeFull ESM build, minified and gzipped.
🌳 Bundle size after tree-shakingMinimal build including a renderer, camera, empty scene, and dependencies.
|
|
I think the idea of sharing buffers is great but the API proposed in the PR #33315 seems simpler and can cover the same use cases as the current one, with the exception of offset/size, which should be easy to implement. Clone arraybuffer after obtaining it seem a bit contradictory to the observations we made earlier, since they use I think the implementations of the I would prefer to see a simpler function signature with parameter and return overloads: class ReadbackBuffer {
name: string;
size: number;
constructor( size: number );
release();
dispose();
}getArrayBufferAsync( attribute ) : ArrayBuffer
getArrayBufferAsync( attribute, offset, size ) : ArrayBuffer
getArrayBufferAsync( attribute, readbackBuffer ) : ArrayBuffer
getArrayBufferAsync( attribute, readbackBuffer, offset, size ) : ArrayBuffer |
It does not cover the same use cases. It includes all the same problematic behavior I called out in the original PR and this one. I would appreciate it if you start asking questions about why I prefer something and have recommended it rather than making sweeping claims like this. I have not suggested this approach for no reason.
All of my comments on why this is a problem seem to continually be ignored and I'm at a complete loss for why. First of all, I never asked for this and as a user I do not want this behavior. This doesn't solve anything I asked for and mandates incredibly confusing functionality not to mention breaks existing code unnecessarily. This code will no longer work as expected. The buffers are now set to the same buffer: let buff0, buff1;
renderer.compute( kernel, [ 1000, 1, 1 ] );
buff0 = await renderer.getArrayBufferAsync( attr );
renderer.compute( kernel, [ 1000, 1, 1 ] );
buff1 = await renderer.getArrayBufferAsync( attr );And this code will throw an error whereas it does not here nor in r183: let pr0, pr1;
renderer.compute( kernel, [ 1000, 1, 1 ] );
pr0 = renderer.getArrayBufferAsync( attr );
renderer.compute( kernel, [ 1000, 1, 1 ] );
pr1 = renderer.getArrayBufferAsync( attr );
const [ buff0, buff1 ] = Promise.all( [ pr0, pr1 ] );Not having to worry about managing these buffers and race conditions is a benefit. I will frequently write code quickly using less efficient mechanisms specifically so I can get something working then switch to more optimal system (like ReadbackBuffer in this case) so I plan to continue using this simple path in dev. This is even before we get to the fact that there is nearly no benefit and a ton of downsides: Cons
Pros
Bottom line is this hasn't actually solved a problem anyone has raised. If we're going to make such dramatic changes to an existing API can we at least wait for someone to actually complain about the code path first? Again this isn't anything I wanted and I raised the issue. The ReadbackBuffer otherwise enables the behavior I'm looking for.
It absolutely does. The buffer attached to "ReadBuffer" is the buffer that has been mapped and is the buffer that will be unmapped. This means the given buffer can be set to null or neutered at any time. So passing the "ReadBuffer" instance to other parts of an application means they can attach "release" or "dispose" events to it to determine exactly when that buffer will have been invalidated or changed. It also serves as a flag indicating whether the buffer is currently mapped or not denoting that it's "in use".
We can discuss rearranging the arguments but across the project the "target" object is consistently the final object in arguments list in math functions etc and is returned from the function. I'm not sure why this pattern should be different here. And if the function is going to work without passing a "readbackBuffer" as an option then my proposed order still allows you to readback a smaller buffer w/ count and offset. |
|
Could you make a simple fiddle, as you usually do, for your use case based on this PR? |
|
I think we should avoid cloning ArrayBuffer in the core; the user can easily do this using slice after obtaining if needed. It is more common for the user to manipulate the ArrayBuffer in one frame and expect the changes in another; if they need something more specific like this, they can still do it. let buff0, buff1;
renderer.compute( kernel, [ 1000, 1, 1 ] );
buff0 = ( await renderer.getArrayBufferAsync( attr ) ).slice( 0 );
renderer.compute( kernel, [ 1000, 1, 1 ] );
buff1 = ( await renderer.getArrayBufferAsync( attr ) ).slice( 0 ); |
The |
Sorry if I wasn’t able to be that specific, but this problem could be solved in the way below using with the changes made in #33315. You can paste this code into If the user wants to obtain multiple fractions of the same GPU buffer in parallel, they should use async function getArrayBuffer() {
const buffer = new THREE.ReadbackBuffer( waveArray.value.array.byteLength );
const array = ( await renderer.getArrayBufferAsync( waveArray.value, buffer ) ).slice( 0 );
buffer.dispose();
return array;
}
let c1 = getArrayBuffer();
let c2 = getArrayBuffer();
const [ waveBuffer1, waveBuffer2 ] = await Promise.all( [ c1, c2 ] );
const wave = new Float32Array( waveBuffer2 ); |
We have gone through this kind of API design issues before. This API expects too much from the developer and it would produce hard to find bugs if used incorrectly. Instead, we always try to make things work even if it's not optimal. An example of this is three.js/src/core/Raycaster.js Lines 214 to 226 in f9152b7 Ideally the developer would pass an If we tried to reuse an internal array for it, it would most likely result in a really bad experience for the developer. |
|
Lets try to find an API design that works for @gkjohnson 🙏 |
|
@mrdoob It makes sense, even if we cloned the CPU, we would still have the GPU buffer, but does it really make sense to recreate GPU buffers every time we use The If we were to use it the way it was or is here, I really wouldn’t recommend using What I want most is to be able to resolve the issue that @gkjohnson brought with great suggestions, but I would also like to have an easy-to-use API that balances performance and simplicity. |
In that case, it is also preferable to change the function signature and deprecate the previous usage warning the user and providing a internal fallback. await renderer.getArrayBufferAsync( attribute, readbackBuffer, offset, size )I might be mistaken, but |
|
Making it mandatory seems like a safer dev experience. It could even be like this: await renderer.getArrayBufferAsync( readbackBuffer, attribute, offset, size ) |
Seems like this is the main isssue... Have you measured this? Maybe it's a premature optimization? |
|
By the way, I'm putting the release on hold until this issue/PR is resolved. |
I think projects like Tetrament, for example, will benefit if we solve this issue; this would be just one of possible use cases that would use this in the update cycle. I think that if we update the order of the parameters now we can avoid patches later too. I will made these changes in my alternative PR, this should help with the decisions as well. |
|
@sunag Perhaps the question was unclear. In #33322 (comment) you say that recreating a GPU Buffer for copying every function call is so performance intensive that we should prevent users from using the function in such a way:
How have you tested the performance of creating a new buffer? How have you determined that it is so intensive? |
|
I created a fiddle, you can control it in 1k float because bytes are usually irrelevant, WebGL seems to be 2.5 slower than WebGPU. https://jsfiddle.net/eLbr76dx/ I understand it may be a premature optimization, or maybe we are just ignoring a micro-optimization, depending on the use case and point of view. In any case, I would like @gkjohnson @mrdoob to consider the analysis of the alternative I updated #33315; it should resolves this issue about cache the buffer and alerts the user if the It was what I managed to improve by collecting the feedback from here. |
|
Here are concrete numbers, which are necessary to have a discussion around this. And this is the fiddle. The WebGL path does not require an intermediate buffer to copy data so the timing in that case is irrelevant.
So to be clear: we are discussing thousandths of a millisecond as an "optimization". And in order to "optimize" this away we're discussing retaining GPU memory that the user has no control over, silently doubling the footprint the storage buffer until it's disposal, as well as reducing the simple usability of this code path. I'm having a hard time seeing how this is a benefit. Can someone please explain this? It seems closer to a memory leak for any user more than it is an optimization. And again, I am asking for this code path because I would like to use it during development & prototyping and I also know I will be bitten by this memory thing in the future, as will other users. Here is my proposal: I have adjusted this PR to use the suggested signature (attribute, target, offset, count) because I agree it's an improvement. This also places the "target" object in the second field so we can make it mandatory at a later time if we feel the need. But generally I would like to wait to do anything around this GPU-side buffer creation until it's actually proven to be a problem, which it has not yet. Creating and disposing the buffer allows for a simple, easy to use path that will work for any use case without memory accumulation and the ReadbackBuffer serves well for optimized paths. If this is not acceptable then I will adjust the API so providing a ReadbackBuffer is required. Regarding #33315, it still does not address the core goals of the original issue. I have run out of energy for repeating the goals of the readback buffer at this point, unfortunately, and I'm sorry for that. I have tested this PR with the use case that spurned this need (a more complex version of this demo) and have verified that it works, reduces the memory usage, and improves the performance of the task (16sec to 14sec, or 2sec / ~14% improvement) when ReadbackBuffer is used. |
Fix #33281
Fix #33282
This PR fixes a number of issues introduced in #33300, adjusts support for "ReadbackBuffer" (detailed below), and adds support for passing "offset" and "count" to the function for mapping and reading back a portion of the buffer attribute.
Signature
This is the new call signature. The function optionally takes a
targetArrayBuffer to write into or ReadbackBuffer to copy the data into and map. If no "target" is provided then a new ArrayBuffer is constructed and returned, as is done in r183 meaning this function is backwards compatible. Both "count" and "offsets" are in bytes. And the "target" object is the one returned from the function, aligning with the patterns used for functions that take a "target" object across the project:The ReadbackBuffer shape has changed like so. It represents an intermediate buffer of "maxByteLength" in size on the GPU for copying data into and mapping to the CPU. The "buffer" field is set to the mapped WebGPU-api provided buffer and is subsequently set to "null" when releasing or disposing the readback buffer instance.
Behavior Changes
BufferAttribute.arrayis no longer read in either WebGPU nor WebGL codepaths so users can continue to truncate the CPU-side arrays after upload, as designed.ReadbackBufferswhen the user passes only a BufferAttribute. This was creating a lot of unnecessary retained GPU memory, created confusing overwrite behavior when running multiple interleaved kernel operations and readbacks, and caused an error to throw when trying to issue a subsequent readback before the first one finished.Testing
Running this code snippet with WebGPURenderer demonstrates the code working as expected and the ability to run multiple readbacks in parallel. This works with both "forceWebGL" set to true and false.
Tested with basic integration on a small buffer in this three-edge-projection demo which runs multiple compute kernels and readbacks in parallel to perform hidden edge removal.