[3.4] Split Vertex Buffer Stream in Positions and Attributes #46574

The-O-King · 2021-03-01T21:55:04Z

Provide an import option for enabling vertex stream splitting - where the vertex
buffer is organized into a contiguous block of position data and a contiguous
block of interleaved attribute data

This is an optimization particularly for mobile, tile-based renderers which can
use less memory bandwidth during their binning phase

Implementation of godotengine/godot-proposals#2350

fire · 2021-03-01T22:32:42Z

Is there a godot master version I can review?

The-O-King · 2021-03-01T22:40:16Z

At the moment I only have this implemented in 3.2 since 4.0 is still undergoing work on the rendering side from what I understand

I think discussion is still ongoing as to what/if 4.0 will need something like this implemented (see the proposal linked for discussion)

clayjohn · 2021-05-06T23:50:23Z

What do you think about making this a ProjectSetting instead of an import setting. I don't see a reason to import some meshes using the split vertex stream and some without. To me it seems like Desktop should use non-split and mobile should use split. In which case there can be a check in the rasterizer. I think making this a project setting would result in a simpler PR and would be more user-friendly.

Please let me know if there is a good reason to have this in the importer.

lawnjelly · 2021-05-07T07:00:32Z

Also have you got any timing figures for the performance with and without the change on different hardware? It would be kind of nice to have some figures to see the difference.

The-O-King · 2021-05-08T01:35:44Z

Ah that's a good thought and not something that I had considered before, looking at it I think that a project setting would make more sense also, I'll look into what that would take! Thank you for pointing me towards that!

For performance, I know that for mobile it helps primarily with memory bandwidth metrics (and was the primary motivation for this change, 50% improvement) rather than gpu frame times, and I did do some cursory testing on desktop (nvidia 2070 super Max-Q laptop) and did not find much in the way of performance difference, but I will try to get more concrete numbers for that as well to share

The-O-King · 2021-05-10T19:35:44Z

Implementation question - is there any way to trigger a reimport of all mesh assets when a project setting is changed? At the moment someone could change the project setting, but all previously imported assets would still have the old mesh data format and they would have to reimport the assets manually

Calinou · 2021-05-10T20:29:41Z

Implementation question - is there any way to trigger a reimport of all mesh assets when a project setting is changed? At the moment someone could change the project setting, but all previously imported assets would still have the old mesh data format and they would have to reimport the assets manually

As far as I know, there is no way to do that yet (aside of removing the .import/ folder). This is also an issue when changing the VRAM compressed texture formats in the Project Settings.

The-O-King · 2021-05-10T21:00:13Z

Implementation question - is there any way to trigger a reimport of all mesh assets when a project setting is changed? At the moment someone could change the project setting, but all previously imported assets would still have the old mesh data format and they would have to reimport the assets manually

As far as I know, there is no way to do that yet (aside of removing the .import/ folder). This is also an issue when changing the VRAM compressed texture formats in the Project Settings.

Gotcha, thanks for letting me know! Maybe we can make that an option in the future haha :)

For now I was able to update the implementation to use project settings rather than import settings for splitting the vertex buffer streams, and will have similar functionality to the way that the VRAM texture compression settings will behave then

I also removed the commit that added the mesh compression options since it is much less relevant now to this PR, and created a new PR with that feature addition here #48625

clayjohn · 2021-05-11T04:22:58Z

drivers/gles2/rasterizer_storage_gles2.cpp

-					attribs[i].type = _GL_HALF_FLOAT_OES;
-					stride += attribs[i].size * 2;
-					uses_half_float = true;
+					attribs[i].type = GL_HALF_FLOAT;


I'm guessing this is a copy-paste error from GLES3. This line is causing the CI to fail because it should be _GL_HALF_FLOAT_OES

clayjohn · 2021-05-11T04:24:42Z

Implementation looks good on a preliminary review!

Have you had any luck getting performance numbers on mobile? I would be really interested to see if there is a measurable difference. The TPS demo may be a good test project if you don't already have one.

servers/visual_server.cpp

The-O-King · 2021-05-11T15:57:20Z

I have figured out the clang-format issue so hopefully static checker will be happy from now on :)

Oh yes I definitely have the numbers on mobile to show this haha, I'm working on getting the screenshots right now :)

Also working on getting the numbers for desktop using nvidia nsights, will also report that as well when I get them!

The-O-King · 2021-05-12T17:24:38Z

I also collected data for desktop and at the moment it looks like performance is basically identical - This was in a static scene with 90,000,000 verts that were offscreen, so I would say having it on desktop will at least keep the same performance

In Nvidia Nsight, the bottleneck here is in VRAM (both have it pinned at 97% SOL) and the L2 cache behavior is also basically the same (46% hit rate in this scene). There was a reduction in frame time (average 43ms to 41ms) but I would chalk that up to occasional power management/throttling

Unsplit:

Split:

I also did a test with the same scene but an additional dynamic light/shadow (but reduced down to 60,000,000 to make it a more reasonable framerate) and I got the same exact results between split and unsplit - although when I was checking things in renderdoc - I did notice that the shadow map generating draw calls had the vertex normal enabled - and I was wondering what they were used for in this scenario? As that could be a good area for improvement on both desktop and mobile

This is also more anecdotal evidence, but I would also note from performance profiling other games on desktop (such as Apex Legends, or the default vertex buffer format in Unreal Engine) they all usually have some sort of vertex buffer stream splitting in place where positions are their own stream and attributes are distributed amongst other separate buffers, grouped by what data is most commonly used together in a shader/renderpass

The-O-King · 2021-05-12T17:33:26Z

Ah sorry I think I misunderstood before about the static vs dynamic mesh thing - yes this is applied to static meshes only

I see definitely see where you're coming from in terms of seeing real-world improvements, but I think the TPS scene in particular isn't the best way to show the improvements one would get from something like this because it's unreasonably pixel bound haha, at least on mobile :P I think in all scenarios though it will come out with same or improved performance

We do have the data for static mesh scenes (single or not shouldn't make a difference) on mobile since that was what the first scenario I presented was testing (to be fair it wasn't a "single" static mesh but 30 copies of the same high poly single mesh, which I would argue end up with similar effect)

And for tile based renderers (which are all mobile GPUs) - rendering to a single tile vs offscreen would show the same performance improvements that the first test scenario I presented showed, just with the added pixel shader work for fragments that end up on screen, regardless of the tile - the binning process occurs for ALL vertices before shading fragments (or the rest of the vertex calculations) ever begins

For the power argument, I don't think I have a good way to measure the power improvement without running an example load/app on a device and running it until it dies haha - but we know that's it's a recommendation that both ARM[1][2] and Qualcomm make in their best practices guides
nvidia generally mentions stream splitting can help in some places for doing things like shadow/occlusion passes

lawnjelly · 2021-05-12T17:37:12Z

If performance is within a gnat's todger with the split vertex format for static buffers on desktop, then to me that does sound a fair argument for just using split as the path for all static buffers, and not bothering with the split / non-split option. Unless there is a need for backward compatibility. What do you think @clayjohn ? 😄

clayjohn · 2021-05-18T04:22:25Z

@lawnjelly I agree that we need to do more performance analysis before we commit to anything.

In theory this PR should result in performance benefits in some cases on mobile. And in theory it should degrade performance on desktop. The reduction in performance comes from the cache pressure resulting from reading the vertex position from one end of the vertex array while reading the other vertex attributes from another position in the array. Again, the performance consideration is only theoretical and may pan out to be negligible.

Calinou · 2021-05-18T15:49:39Z

@The-O-King Can you upload the vertex-heavy test project you used for testing (not the TPS demo)? This way, we can check the performance impact on desktop platforms. Thanks in advance 🙂

The-O-King · 2021-05-19T18:47:07Z

Oh yes sure thing - here's a google drive link to a zip of the project I've been using to test things - there's a sphere_scene_desktop.tscn which is what I used to measure the performance - currently there's a moving point light with shadows as well for when I was profiling that scenario

https://drive.google.com/file/d/1X_Dc9NJfN6p8xYf4-g64SVuOEJCxJzrW/view?usp=sharing

Let me know if this drive link works, or if there's a better/preferred way to share the project :)

Calinou · 2021-05-19T18:52:19Z

Let me know if this drive link works, or if there's a better/preferred way to share the project :)

The drive link works, thanks 🙂

lawnjelly · 2021-05-19T19:12:11Z

Nice little discussion of this on twitter today too, and on when the situation changed on this (looks like around 2015/2016?):
https://twitter.com/promit_roy/status/1394847846759911427

lawnjelly · 2021-05-20T07:40:26Z

Just been trying the PR on my desktop (Linux Mint 20, Intel HD Graphics 630) and seeing if there's any difference.

With this quick contrived test I can get 129fps without splitting and 126fps with splitting, but I doubt there is much in it in practice. My test was procedurally generating a mesh with 1000000 triangles (3000000 unique verts) with uv, uv2, color, normal and tangent, and very small screen size and identical small triangle size. I don't even know if it uses that many verts or limits to 65535 in fact .. it could do in GLES3 though.

Be interesting to try on more hardware these kinds of tests, especially mobile.
SplitVertBufferTest.zip

The-O-King · 2021-05-24T22:00:16Z

I got a chance today to run your test project (really elegant btw I love it haha) and on my laptop which has a RTX 2070 MAX-Q I got identical performance when using split vs unsplit vertex buffers (avg 708.66fps)

I was able to also capture some traces on my Pixel 4 and while average framerate was identical (avg 13.66fps) I did see a drop in vertex memory bandwidth in a frame (this benefit coming directly from bandwidth improvements during the binning phase I mentioned earlier) where for unsplit I saw peak and average bandwidth of 13.28GB/s and 3.7GB/s respectively, while on the split buffer test I saw peak and average bandwidth of 4.9GB/s and 2.8GB/s respectively.
The improvement seen in the average stems from the improvement during the binning phase of a frame, otherwise during the rendering phase of a frame we see that the vertex bandwidth is pretty much the same (avg 2GB/s) whether split or unsplit (which is to be expected)

I've included links for you to download the Android GPU Inspector trace files I've taken in case you were interested at looking at the data yourself! You just need to have Android GPU Inspector downloaded to take a peek :)

vertex buffer split
vertex buffer unsplit

lawnjelly · 2021-05-25T06:14:15Z

I was able to also capture some traces on my Pixel 4 and while average framerate was identical (avg 13.66fps)

That's quite interesting in itself. I deliberately used identical position triangles and scaled small so that hopefully they would end up in the same tile. The idea being that this would stress the tile based renderer the most as it didn't get the opportunity to spread the geometry over several tiles. But perhaps there is something we are missing and we could run a better test - I'm no expert on tiled renderers. I'm sure a hardware guy could tell us more.

The-O-King · 2021-06-09T14:38:50Z

Hello! Just wanted to bump this and see what the status is right now? Sorry it's been a while I got pulled into other work that's taken up my time

So what are the current concerns about this change? Are we still unsure about it's benefits? My current understanding is that on mobile we see the memory bandwidth reductions we were expecting (and implicitly that is battery life improvement as well) and on desktop performance was largely the same (at least on the hardware we've tested so far)

The-O-King · 2021-06-28T14:52:52Z

@clayjohn hello! Just wanted to bump this again, and see if there was anything I could do to help :) Thanks!

akien-mga · 2021-07-14T16:08:27Z

There's a couple new project settings added by this PR which should be included in the docs, see the diff here: https://github.com/godotengine/godot/pull/46574/checks?check_run_id=2559453240

You can generate the template with godot --doctool with a build of your PR, and then ideally fill the descriptions.

clayjohn

Approved, subject to addition of docs for the project settings.

The-O-King · 2021-07-16T18:21:16Z

Update the documentation with doctool so hopefully things are ready after this!

Thanks for taking a look everyone!

akien-mga · 2021-07-17T06:55:58Z

Could you rebase to squash the commits? See PR workflow for instructions.

The-O-King · 2021-07-19T16:23:22Z

Oh thank you for letting me know about that I hadn't seen that before, was able to squash everything, let me know if everything is now in order

akien-mga · 2021-07-19T18:17:27Z

doc/classes/ProjectSettings.xml

@@ -1187,6 +1187,9 @@
 		<member name="rendering/lossless_compression/webp_compression_level" type="int" setter="" getter="" default="2">
 			The default compression level for lossless WebP. Higher levels result in smaller files at the cost of compression speed. Decompression speed is mostly unaffected by the compression level. Supported values are 0 to 9. Note that compression levels above 6 are very slow and offer very little savings.
 		</member>
+		<member name="rendering/mesh_storage/split_stream" type="bool" setter="" getter="" default="false">
+			On import, mesh vertex data will be split into two streams within a single vertex buffer, one for position data and the other for interleaved attributes data. Recommended to be enable if targeting mobile devices. Requires manual reimport of meshes after toggling


Suggested change

On import, mesh vertex data will be split into two streams within a single vertex buffer, one for position data and the other for interleaved attributes data. Recommended to be enable if targeting mobile devices. Requires manual reimport of meshes after toggling

On import, mesh vertex data will be split into two streams within a single vertex buffer, one for position data and the other for interleaved attributes data. Recommended to be enabled if targeting mobile devices. Requires manual reimport of meshes after toggling.

Updated the word here!

You missed it in my diff but there's also a dot that should be added at the end of the sentence.

Oh shoot you're so right sorry about that I hope that takes care of it now!

I think you might have missed pushing that last amend?

Ah you're right I noticed in the logs that I was getting unresolved host errors in my repo for some reason

Implemented splitting of vertex positions and attributes in the vertex buffer Positions are sequential at the start of the buffer, followed by the additional attributes which are interleaved Made a project setting which enables/disabled the buffer formatting throughout the project Implemented in both GLES2 and GLES3 This improves performance particularly on tile-based GPUs as well as cache performance for something like shadow mapping which only needs position data Updated Docs and Project Setting

akien-mga · 2021-07-20T08:48:47Z

Thanks! And congrats for your first merged Godot contribution 🎉

The-O-King requested review from a team as code owners March 1, 2021 21:55

The-O-King force-pushed the split_stream_3.2 branch from 97974b2 to dee692c Compare March 1, 2021 22:26

The-O-King force-pushed the split_stream_3.2 branch from dee692c to dccfcbf Compare March 1, 2021 22:55

akien-mga added feature proposal topic:rendering labels Mar 2, 2021

akien-mga added this to the 3.2 milestone Mar 2, 2021

The-O-King mentioned this pull request Mar 8, 2021

[3.x] Implement Octahedral Map Normal/Tangent Attribute Compression #46800

Merged

Base automatically changed from 3.2 to 3.x March 16, 2021 11:11

aaronfranke modified the milestones: 3.2, 3.3, 3.4 Mar 16, 2021

aaronfranke changed the title ~~[3.2] Split Vertex Buffer Stream in Positions and Attributes, Provide UI Options~~ [3.4] Split Vertex Buffer Stream in Positions and Attributes, Provide UI Options Mar 18, 2021

The-O-King mentioned this pull request May 3, 2021

Compress Vertex Normal/Tangent Vectors Using Octahedral Mapping godotengine/godot-proposals#2395

Closed

The-O-King force-pushed the split_stream_3.2 branch from dccfcbf to d44a1c4 Compare May 10, 2021 20:53

The-O-King changed the title ~~[3.4] Split Vertex Buffer Stream in Positions and Attributes, Provide UI Options~~ [3.4] Split Vertex Buffer Stream in Positions and Attributes May 10, 2021

clayjohn reviewed May 11, 2021

View reviewed changes

servers/visual_server.cpp Outdated Show resolved Hide resolved

servers/visual_server.cpp Outdated Show resolved Hide resolved

servers/visual_server.cpp Outdated Show resolved Hide resolved

The-O-King force-pushed the split_stream_3.2 branch from d44a1c4 to 77b5792 Compare May 11, 2021 15:38

akien-mga requested a review from clayjohn July 14, 2021 16:08

clayjohn approved these changes Jul 15, 2021

View reviewed changes

The-O-King requested a review from a team as a code owner July 16, 2021 18:13

The-O-King force-pushed the split_stream_3.2 branch from af8c0b6 to 7bc9174 Compare July 19, 2021 16:20

akien-mga reviewed Jul 19, 2021

View reviewed changes

The-O-King force-pushed the split_stream_3.2 branch from 7bc9174 to 1d02f4d Compare July 19, 2021 20:04

The-O-King force-pushed the split_stream_3.2 branch from 1d02f4d to 7f8487a Compare July 20, 2021 08:01

akien-mga approved these changes Jul 20, 2021

View reviewed changes

akien-mga merged commit f131a77 into godotengine:3.x Jul 20, 2021

akien-mga mentioned this pull request Jul 20, 2021

Split Vertex Buffers Into Position and Interleaved Attribute Streams godotengine/godot-proposals#2350

Closed

The-O-King mentioned this pull request Aug 11, 2021

Fix undefined behaviour in visual server as reported by ubsan/asan. #51266

Closed

	On import, mesh vertex data will be split into two streams within a single vertex buffer, one for position data and the other for interleaved attributes data. Recommended to be enable if targeting mobile devices. Requires manual reimport of meshes after toggling
	On import, mesh vertex data will be split into two streams within a single vertex buffer, one for position data and the other for interleaved attributes data. Recommended to be enabled if targeting mobile devices. Requires manual reimport of meshes after toggling.

[3.4] Split Vertex Buffer Stream in Positions and Attributes #46574

[3.4] Split Vertex Buffer Stream in Positions and Attributes #46574

Conversation

The-O-King commented Mar 1, 2021 • edited Loading

fire commented Mar 1, 2021

The-O-King commented Mar 1, 2021

clayjohn commented May 6, 2021

lawnjelly commented May 7, 2021

The-O-King commented May 8, 2021

The-O-King commented May 10, 2021

Calinou commented May 10, 2021 • edited Loading

The-O-King commented May 10, 2021

clayjohn May 11, 2021

Choose a reason for hiding this comment

clayjohn commented May 11, 2021

The-O-King commented May 11, 2021

The-O-King commented May 12, 2021

The-O-King commented May 12, 2021

lawnjelly commented May 12, 2021

clayjohn commented May 18, 2021

Calinou commented May 18, 2021 • edited Loading

The-O-King commented May 19, 2021

Calinou commented May 19, 2021 • edited Loading

lawnjelly commented May 19, 2021

lawnjelly commented May 20, 2021 • edited Loading

The-O-King commented May 24, 2021

lawnjelly commented May 25, 2021

The-O-King commented Jun 9, 2021

The-O-King commented Jun 28, 2021

akien-mga commented Jul 14, 2021

clayjohn left a comment

Choose a reason for hiding this comment

The-O-King commented Jul 16, 2021

akien-mga commented Jul 17, 2021

The-O-King commented Jul 19, 2021

akien-mga Jul 19, 2021

Choose a reason for hiding this comment

The-O-King Jul 19, 2021

Choose a reason for hiding this comment

akien-mga Jul 19, 2021

Choose a reason for hiding this comment

The-O-King Jul 19, 2021

Choose a reason for hiding this comment

akien-mga Jul 20, 2021

Choose a reason for hiding this comment

The-O-King Jul 20, 2021

Choose a reason for hiding this comment

akien-mga commented Jul 20, 2021

The-O-King commented Mar 1, 2021 •

edited

Loading

Calinou commented May 10, 2021 •

edited

Loading

Calinou commented May 18, 2021 •

edited

Loading

Calinou commented May 19, 2021 •

edited

Loading

lawnjelly commented May 20, 2021 •

edited

Loading