Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meshlet software raster + start of cleanup #14623

Merged
merged 85 commits into from
Aug 26, 2024

Conversation

JMS55
Copy link
Contributor

@JMS55 JMS55 commented Aug 5, 2024

Objective

  • Faster meshlet rasterization path for small triangles
  • Avoid having to allocate and write out a triangle buffer
  • Refactor gpu_scene.rs

Solution

  • Replace the 32bit visbuffer texture with a 64bit visbuffer buffer, where the left 32 bits encode depth, and the right 32 bits encode the existing cluster + triangle IDs. Can't use 64bit textures, wgpu/naga doesn't support atomic ops on textures yet.
  • Instead of writing out a buffer of packed cluster + triangle IDs (per triangle) to raster, the culling pass now writes out a buffer of just cluster IDs (per cluster, so less memory allocated, cheaper to write out).
    • Clusters for software raster are allocated from the left side
    • Clusters for hardware raster are allocated in the same buffer, from the right side
    • The buffer size is fixed at MeshletPlugin build time, and should be set to a reasonable value for your scene (no warning on overflow, and no good way to determine what value you need outside of renderdoc - I plan to fix this in a future PR adding a meshlet stats overlay)
    • Currently I don't have a heuristic for software vs hardware raster selection for each cluster. The existing code is just a placeholder. I need to profile on a release scene and come up with a heuristic, probably in a future PR.
    • The culling shader is getting pretty hard to follow at this point, but I don't want to spend time improving it as the entire shader/pass is getting rewritten/replaced in the near future.
  • Software raster is a compute workgroup per-cluster. Each workgroup loads and transforms the <=64 vertices of the cluster, and then rasterizes the <=64 triangles of the cluster.
    • Two variants are implemented: Scanline for clusters with any larger triangles (still smaller than hardware is good at), and brute-force for very very tiny triangles
    • Once the shader determines that a pixel should be filled in, it does an atomicMax() on the visbuffer to store the results, copying how Nanite works
    • On devices with a low max workgroups per dispatch limit, an extra compute pass is inserted before software raster to convert from a 1d to 2d dispatch (I don't think 3d would ever be necessary).
    • I haven't implemented the top-left rule or subpixel precision yet, I'm leaving that for a future PR since I get usable results without it for now
    • Resources used: https://kristoffer-dyrkorn.github.io/triangle-rasterizer and chapters 6-8 of https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index
  • Hardware raster now spawns 64*3 vertex invocations per meshlet, instead of the actual meshlet vertex count. Extra invocations just early-exit.
    • While this is slower than the existing system, hardware draws should be rare now that software raster is usable, and it saves a ton of memory using the unified cluster ID buffer. This would be fixed if wgpu had support for mesh shaders.
    • Instead of writing to a color+depth attachment, the hardware raster pass also does the same atomic visbuffer writes that software raster uses.
    • We have to bind a dummy render target anyways, as wgpu doesn't currently support render passes without any attachments
    • Material IDs are no longer written out during the main rasterization passes.
    • If we had async compute queues, we could overlap the software and hardware raster passes.
  • New material and depth resolve passes run at the end of the visbuffer node, and write out view depth and material ID depth textures

Misc changes

  • Fixed cluster culling importing, but never actually using the previous view uniforms when doing occlusion culling
  • Fixed incorrectly adding the LOD error twice when building the meshlet mesh
  • Splitup gpu_scene module into meshlet_mesh_manager, instance_manager, and resource_manager
    • resource_manager is still too complex and inefficient (extract and prepare are way too expensive). I plan on improving this in a future PR, but for now ResourceManager is mostly a 1:1 port of the leftover MeshletGpuScene bits.
  • Material draw passes have been renamed to the more accurate material shade pass, as well as some other misc renaming (in the future, these will be compute shaders even, and not actual draw calls)

Migration Guide

  • TBD (ask me at the end of the release for meshlet changes as a whole)

@JMS55 JMS55 requested a review from IceSentry August 19, 2024 19:59
var max_x = u32(ceil(max3(vertex_0.x, vertex_1.x, vertex_2.x)));
var max_y = u32(ceil(max3(vertex_0.y, vertex_1.y, vertex_2.y)));
max_x = min(max_x, u32(view.viewport.z) - 1u);
max_y = min(max_y, u32(view.viewport.w) - 1u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you also need to min_x = max(0, min_x) etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iirc the u32(foo) should turn all negatives into 0.

// Scanline setup
let edge_012 = -w_x;
let open_edge = edge_012 < vec3(0.0);
let inverse_edge_012 = select(1.0 / edge_012, vec3(1e8), edge_012 == vec3(0.0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 1e8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk, it's what the nanite slides do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to understand what this is doing, but never quite figured it out.


// Iterate scanline X interval
for (var x = x0; x <= x1; x++) {
// Check if point at pixel is within triangle (TODO: this shouldn't be needed, but there's bugs without it)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's annoying, i'll see if i can figure this out but dont block merge on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might just be needed on the first and last pixel (x0, x1), and can be skipped for [x0 + 1, x1 - 1]

@@ -294,6 +289,7 @@ fn simplify_meshlet_groups(
let target_error = target_error_relative * mesh_scale;

// Simplify the group to ~50% triangle count
// TODO: Simplify using vertex attributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this about the bevy mesh api or more specific to meshlets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specific to meshlets. Simplification (i.e. the LOD building) only accounts for position atm, and not things like UVs/normals. See zeux/meshoptimizer#158.


/// Manages data for each entity with a [`MeshletMesh`].
#[derive(Resource)]
pub struct InstanceManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's fine with the name space, but that feels a bit generic? Why not something like VirtualGeometryMeshManager? To be clear, this isn't a blocker or anything, I'm just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh I didn't want to get hung up on naming. We can change it later.


/// Manages uploading [`MeshletMesh`] asset data to the GPU.
#[derive(Resource)]
pub struct MeshletMeshManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should have Gpu somewhere in the name to clarify it's puprose?

&BindGroupLayoutEntries::sequential(
ShaderStages::COMPUTE,
(
storage_buffer_read_only_sized(false, None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of bindings, but that does feel a lot easier to read then when we were still using the raw structs 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeahhh I want automated bind groups in the render graph for a reason... Arguably this explicitness is better for perf, but it can't be that much slower to hash-and-cache, and this sucks to write and tweak.

Copy link
Contributor

@IceSentry IceSentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm not seeing anything obviously wrong. I confirmed that the code doesn't break anything else. The various *Manager structs are a bit generically named, but that doesn't really matter for now.

I get a clippy lint locally but it's in a crate that isn't modified in this PR so not sure what's up with that. Other than that LGTM.

@IceSentry IceSentry added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Aug 24, 2024
@alice-i-cecile alice-i-cecile added this pull request to the merge queue Aug 26, 2024
Merged via the queue into bevyengine:main with commit 6cc96f4 Aug 26, 2024
27 checks passed
@mockersf
Copy link
Member

mockersf commented Aug 27, 2024

This PR maybe made things faster, but it also greatly reduces the range of hardware where it works. meshlets used to work on macOS or on Vulkan with software renderer, not anymore. I don't know about DX12 but there's a comment saying it's now Vulkan only with a recent GPU.

Is it worth it making it faster if it won't run anymore on hardware where it would be useful for it to be faster?

@JMS55
Copy link
Contributor Author

JMS55 commented Aug 27, 2024

I'm not open to supporting GPUs without 64bit atomics (meaning no software raster).

Besides being an insane amount of extra code to support (pretty much an entire extra copy of the codebase), there's no point in telling artists to design their scenes around this feature for pixel level geometry detail, and then have older users try to use it, completely fall over due to lack of software raster + low vram, and then blame the game developers for not optimizing their game when the developers can't do anything besides making an entire second copy of the scene designed for working without meshlets. This is not a feature meant for older platforms, and the only reason it ran before was a temporary setup so that I could develop the rest of the feature without waiting for wgpu to implement 64bit atomics.

It should run on DX12 (SM 6.6+ iirc) and Metal (M2+ iirc), except wgpu/naga is bugged on DX12/Metal with these shaders.

@JMS55
Copy link
Contributor Author

JMS55 commented Aug 28, 2024

For my future reference, another good SW raster tutorial: https://web.archive.org/web/20050408192410/http://sw-shader.sourceforge.net/rasterizer.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen D-Complex Quite challenging from either a design or technical perspective. Ask for help! S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants