Meshlet software raster + start of cleanup (#14623)

# Objective - Faster meshlet rasterization path for small triangles - Avoid having to allocate and write out a triangle buffer - Refactor gpu_scene.rs ## Solution - Replace the 32bit visbuffer texture with a 64bit visbuffer buffer, where the left 32 bits encode depth, and the right 32 bits encode the existing cluster + triangle IDs. Can't use 64bit textures, wgpu/naga doesn't support atomic ops on textures yet. - Instead of writing out a buffer of packed cluster + triangle IDs (per triangle) to raster, the culling pass now writes out a buffer of just cluster IDs (per cluster, so less memory allocated, cheaper to write out). - Clusters for software raster are allocated from the left side - Clusters for hardware raster are allocated in the same buffer, from the right side - The buffer size is fixed at MeshletPlugin build time, and should be set to a reasonable value for your scene (no warning on overflow, and no good way to determine what value you need outside of renderdoc - I plan to fix this in a future PR adding a meshlet stats overlay) - Currently I don't have a heuristic for software vs hardware raster selection for each cluster. The existing code is just a placeholder. I need to profile on a release scene and come up with a heuristic, probably in a future PR. - The culling shader is getting pretty hard to follow at this point, but I don't want to spend time improving it as the entire shader/pass is getting rewritten/replaced in the near future. - Software raster is a compute workgroup per-cluster. Each workgroup loads and transforms the <=64 vertices of the cluster, and then rasterizes the <=64 triangles of the cluster. - Two variants are implemented: Scanline for clusters with any larger triangles (still smaller than hardware is good at), and brute-force for very very tiny triangles - Once the shader determines that a pixel should be filled in, it does an atomicMax() on the visbuffer to store the results, copying how Nanite works - On devices with a low max workgroups per dispatch limit, an extra compute pass is inserted before software raster to convert from a 1d to 2d dispatch (I don't think 3d would ever be necessary). - I haven't implemented the top-left rule or subpixel precision yet, I'm leaving that for a future PR since I get usable results without it for now - Resources used: https://kristoffer-dyrkorn.github.io/triangle-rasterizer and chapters 6-8 of https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index - Hardware raster now spawns 64*3 vertex invocations per meshlet, instead of the actual meshlet vertex count. Extra invocations just early-exit. - While this is slower than the existing system, hardware draws should be rare now that software raster is usable, and it saves a ton of memory using the unified cluster ID buffer. This would be fixed if wgpu had support for mesh shaders. - Instead of writing to a color+depth attachment, the hardware raster pass also does the same atomic visbuffer writes that software raster uses. - We have to bind a dummy render target anyways, as wgpu doesn't currently support render passes without any attachments - Material IDs are no longer written out during the main rasterization passes. - If we had async compute queues, we could overlap the software and hardware raster passes. - New material and depth resolve passes run at the end of the visbuffer node, and write out view depth and material ID depth textures ### Misc changes - Fixed cluster culling importing, but never actually using the previous view uniforms when doing occlusion culling - Fixed incorrectly adding the LOD error twice when building the meshlet mesh - Splitup gpu_scene module into meshlet_mesh_manager, instance_manager, and resource_manager - resource_manager is still too complex and inefficient (extract and prepare are way too expensive). I plan on improving this in a future PR, but for now ResourceManager is mostly a 1:1 port of the leftover MeshletGpuScene bits. - Material draw passes have been renamed to the more accurate material shade pass, as well as some other misc renaming (in the future, these will be compute shaders even, and not actual draw calls) --- ## Migration Guide - TBD (ask me at the end of the release for meshlet changes as a whole) --------- Co-authored-by: vero <email@atlasdostal.com>
bevyengine · Aug 26, 2024 · 6cc96f4 · 6cc96f4
1 parent 7bb76ab
commit 6cc96f4
Show file tree

Hide file tree

Showing 26 changed files with 2,247 additions and 1,433 deletions.
diff --git a/crates/bevy_pbr/src/material.rs b/crates/bevy_pbr/src/material.rs
@@ -1,7 +1,7 @@
 #[cfg(feature = "meshlet")]
 use crate::meshlet::{
     prepare_material_meshlet_meshes_main_opaque_pass, queue_material_meshlet_meshes,
-    MeshletGpuScene,
+    InstanceManager,
 };
 use crate::*;
 use bevy_asset::{Asset, AssetId, AssetServer};
@@ -283,7 +283,7 @@ where
                 Render,
                 queue_material_meshlet_meshes::<M>
                     .in_set(RenderSet::QueueMeshes)
-                    .run_if(resource_exists::<MeshletGpuScene>),
+                    .run_if(resource_exists::<InstanceManager>),
             );
 
             #[cfg(feature = "meshlet")]
@@ -293,7 +293,7 @@ where
                     .in_set(RenderSet::QueueMeshes)
                     .after(prepare_assets::<PreparedMaterial<M>>)
                     .before(queue_material_meshlet_meshes::<M>)
-                    .run_if(resource_exists::<MeshletGpuScene>),
+                    .run_if(resource_exists::<InstanceManager>),
             );
         }
 

diff --git a/crates/bevy_pbr/src/meshlet/asset.rs b/crates/bevy_pbr/src/meshlet/asset.rs
@@ -35,8 +35,6 @@ pub const MESHLET_MESH_ASSET_VERSION: u64 = 1;
 /// See also [`super::MaterialMeshletMeshBundle`] and [`super::MeshletPlugin`].
 #[derive(Asset, TypePath, Clone)]
 pub struct MeshletMesh {
-    /// The total amount of triangles summed across all LOD 0 meshlets in the mesh.
-    pub(crate) worst_case_meshlet_triangles: u64,
     /// Raw vertex data bytes for the overall mesh.
     pub(crate) vertex_data: Arc<[u8]>,
     /// Indices into `vertex_data`.
@@ -57,6 +55,8 @@ pub struct Meshlet {
     pub start_vertex_id: u32,
     /// The offset within the parent mesh's [`MeshletMesh::indices`] buffer where the indices for this meshlet begin.
     pub start_index_id: u32,
+    /// The amount of vertices in this meshlet.
+    pub vertex_count: u32,
     /// The amount of triangles in this meshlet.
     pub triangle_count: u32,
 }
@@ -107,9 +107,6 @@ impl AssetSaver for MeshletMeshSaverLoader {
             .await?;
 
         // Compress and write asset data
-        writer
-            .write_all(&asset.worst_case_meshlet_triangles.to_le_bytes())
-            .await?;
         let mut writer = FrameEncoder::new(AsyncWriteSyncAdapter(writer));
         write_slice(&asset.vertex_data, &mut writer)?;
         write_slice(&asset.vertex_ids, &mut writer)?;
@@ -146,7 +143,6 @@ impl AssetLoader for MeshletMeshSaverLoader {
         }
 
         // Load and decompress asset data
-        let worst_case_meshlet_triangles = async_read_u64(reader).await?;
         let reader = &mut FrameDecoder::new(AsyncReadSyncAdapter(reader));
         let vertex_data = read_slice(reader)?;
         let vertex_ids = read_slice(reader)?;
@@ -155,7 +151,6 @@ impl AssetLoader for MeshletMeshSaverLoader {
         let bounding_spheres = read_slice(reader)?;
 
         Ok(MeshletMesh {
-            worst_case_meshlet_triangles,
             vertex_data,
             vertex_ids,
             indices,

diff --git a/crates/bevy_pbr/src/meshlet/copy_material_depth.wgsl b/crates/bevy_pbr/src/meshlet/copy_material_depth.wgsl
diff --git a/crates/bevy_pbr/src/meshlet/cull_clusters.wgsl b/crates/bevy_pbr/src/meshlet/cull_clusters.wgsl
@@ -9,9 +9,10 @@
     previous_view,
     should_cull_instance,
     cluster_is_second_pass_candidate,
-    meshlets,
-    draw_indirect_args,
-    draw_triangle_buffer,
+    meshlet_software_raster_indirect_args,
+    meshlet_hardware_raster_indirect_args,
+    meshlet_raster_clusters,
+    meshlet_raster_cluster_rightmost_slot,
 }
 #import bevy_render::maths::affine3_to_square
 
@@ -25,10 +26,10 @@
 fn cull_clusters(
     @builtin(workgroup_id) workgroup_id: vec3<u32>,
     @builtin(num_workgroups) num_workgroups: vec3<u32>,
-    @builtin(local_invocation_id) local_invocation_id: vec3<u32>,
+    @builtin(local_invocation_index) local_invocation_index: u32,
 ) {
     // Calculate the cluster ID for this thread
-    let cluster_id = local_invocation_id.x + 128u * dot(workgroup_id, vec3(num_workgroups.x * num_workgroups.x, num_workgroups.x, 1u));
+    let cluster_id = local_invocation_index + 128u * dot(workgroup_id, vec3(num_workgroups.x * num_workgroups.x, num_workgroups.x, 1u));
     if cluster_id >= arrayLength(&meshlet_cluster_meshlet_ids) { return; }
 
 #ifdef MESHLET_SECOND_CULLING_PASS
@@ -47,8 +48,8 @@ fn cull_clusters(
     let world_from_local = affine3_to_square(instance_uniform.world_from_local);
     let world_scale = max(length(world_from_local[0]), max(length(world_from_local[1]), length(world_from_local[2])));
     let bounding_spheres = meshlet_bounding_spheres[meshlet_id];
-    var culling_bounding_sphere_center = world_from_local * vec4(bounding_spheres.self_culling.center, 1.0);
-    var culling_bounding_sphere_radius = world_scale * bounding_spheres.self_culling.radius;
+    let culling_bounding_sphere_center = world_from_local * vec4(bounding_spheres.self_culling.center, 1.0);
+    let culling_bounding_sphere_radius = world_scale * bounding_spheres.self_culling.radius;
 
 #ifdef MESHLET_FIRST_CULLING_PASS
     // Frustum culling
@@ -59,17 +60,17 @@ fn cull_clusters(
         }
     }
 
-    // Calculate view-space LOD bounding sphere for the meshlet
+    // Calculate view-space LOD bounding sphere for the cluster
     let lod_bounding_sphere_center = world_from_local * vec4(bounding_spheres.self_lod.center, 1.0);
     let lod_bounding_sphere_radius = world_scale * bounding_spheres.self_lod.radius;
     let lod_bounding_sphere_center_view_space = (view.view_from_world * vec4(lod_bounding_sphere_center.xyz, 1.0)).xyz;
 
-    // Calculate view-space LOD bounding sphere for the meshlet's parent
+    // Calculate view-space LOD bounding sphere for the cluster's parent
     let parent_lod_bounding_sphere_center = world_from_local * vec4(bounding_spheres.parent_lod.center, 1.0);
     let parent_lod_bounding_sphere_radius = world_scale * bounding_spheres.parent_lod.radius;
     let parent_lod_bounding_sphere_center_view_space = (view.view_from_world * vec4(parent_lod_bounding_sphere_center.xyz, 1.0)).xyz;
 
-    // Check LOD cut (meshlet error imperceptible, and parent error not imperceptible)
+    // Check LOD cut (cluster error imperceptible, and parent error not imperceptible)
     let lod_is_ok = lod_error_is_imperceptible(lod_bounding_sphere_center_view_space, lod_bounding_sphere_radius);
     let parent_lod_is_ok = lod_error_is_imperceptible(parent_lod_bounding_sphere_center_view_space, parent_lod_bounding_sphere_radius);
     if !lod_is_ok || parent_lod_is_ok { return; }
@@ -79,16 +80,20 @@ fn cull_clusters(
 #ifdef MESHLET_FIRST_CULLING_PASS
     let previous_world_from_local = affine3_to_square(instance_uniform.previous_world_from_local);
     let previous_world_from_local_scale = max(length(previous_world_from_local[0]), max(length(previous_world_from_local[1]), length(previous_world_from_local[2])));
-    culling_bounding_sphere_center = previous_world_from_local * vec4(bounding_spheres.self_culling.center, 1.0);
-    culling_bounding_sphere_radius = previous_world_from_local_scale * bounding_spheres.self_culling.radius;
+    let occlusion_culling_bounding_sphere_center = previous_world_from_local * vec4(bounding_spheres.self_culling.center, 1.0);
+    let occlusion_culling_bounding_sphere_radius = previous_world_from_local_scale * bounding_spheres.self_culling.radius;
+    let occlusion_culling_bounding_sphere_center_view_space = (previous_view.view_from_world * vec4(occlusion_culling_bounding_sphere_center.xyz, 1.0)).xyz;
+#else
+    let occlusion_culling_bounding_sphere_center = culling_bounding_sphere_center;
+    let occlusion_culling_bounding_sphere_radius = culling_bounding_sphere_radius;
+    let occlusion_culling_bounding_sphere_center_view_space = (view.view_from_world * vec4(occlusion_culling_bounding_sphere_center.xyz, 1.0)).xyz;
 #endif
-    let culling_bounding_sphere_center_view_space = (view.view_from_world * vec4(culling_bounding_sphere_center.xyz, 1.0)).xyz;
 
-    let aabb = project_view_space_sphere_to_screen_space_aabb(culling_bounding_sphere_center_view_space, culling_bounding_sphere_radius);
+    var aabb = project_view_space_sphere_to_screen_space_aabb(occlusion_culling_bounding_sphere_center_view_space, occlusion_culling_bounding_sphere_radius);
     let depth_pyramid_size_mip_0 = vec2<f32>(textureDimensions(depth_pyramid, 0));
-    let width = (aabb.z - aabb.x) * depth_pyramid_size_mip_0.x;
-    let height = (aabb.w - aabb.y) * depth_pyramid_size_mip_0.y;
-    let depth_level = max(0, i32(ceil(log2(max(width, height))))); // TODO: Naga doesn't like this being a u32
+    var aabb_width_pixels = (aabb.z - aabb.x) * depth_pyramid_size_mip_0.x;
+    var aabb_height_pixels = (aabb.w - aabb.y) * depth_pyramid_size_mip_0.y;
+    let depth_level = max(0, i32(ceil(log2(max(aabb_width_pixels, aabb_height_pixels))))); // TODO: Naga doesn't like this being a u32
     let depth_pyramid_size = vec2<f32>(textureDimensions(depth_pyramid, depth_level));
     let aabb_top_left = vec2<u32>(aabb.xy * depth_pyramid_size);
 
@@ -102,11 +107,11 @@ fn cull_clusters(
     var cluster_visible: bool;
     if view.clip_from_view[3][3] == 1.0 {
         // Orthographic
-        let sphere_depth = view.clip_from_view[3][2] + (culling_bounding_sphere_center_view_space.z + culling_bounding_sphere_radius) * view.clip_from_view[2][2];
+        let sphere_depth = view.clip_from_view[3][2] + (occlusion_culling_bounding_sphere_center_view_space.z + occlusion_culling_bounding_sphere_radius) * view.clip_from_view[2][2];
         cluster_visible = sphere_depth >= occluder_depth;
     } else {
         // Perspective
-        let sphere_depth = -view.clip_from_view[3][2] / (culling_bounding_sphere_center_view_space.z + culling_bounding_sphere_radius);
+        let sphere_depth = -view.clip_from_view[3][2] / (occlusion_culling_bounding_sphere_center_view_space.z + occlusion_culling_bounding_sphere_radius);
         cluster_visible = sphere_depth >= occluder_depth;
     }
 
@@ -118,15 +123,29 @@ fn cull_clusters(
     }
 #endif
 
-    // Append a list of this cluster's triangles to draw if not culled
-    if cluster_visible {
-        let meshlet_triangle_count = meshlets[meshlet_id].triangle_count;
-        let buffer_start = atomicAdd(&draw_indirect_args.vertex_count, meshlet_triangle_count * 3u) / 3u;
-        let cluster_id_packed = cluster_id << 6u;
-        for (var triangle_id = 0u; triangle_id < meshlet_triangle_count; triangle_id++) {
-            draw_triangle_buffer[buffer_start + triangle_id] = cluster_id_packed | triangle_id;
-        }
+    // Cluster would be occluded if drawn, so don't setup a draw for it
+    if !cluster_visible { return; }
+
+    // Check how big the cluster is in screen space
+#ifdef MESHLET_FIRST_CULLING_PASS
+    let culling_bounding_sphere_center_view_space = (view.view_from_world * vec4(culling_bounding_sphere_center.xyz, 1.0)).xyz;
+    aabb = project_view_space_sphere_to_screen_space_aabb(culling_bounding_sphere_center_view_space, culling_bounding_sphere_radius);
+    aabb_width_pixels = (aabb.z - aabb.x) * view.viewport.z;
+    aabb_height_pixels = (aabb.w - aabb.y) * view.viewport.w;
+#endif
+    let cluster_is_small = all(vec2(aabb_width_pixels, aabb_height_pixels) < vec2(32.0)); // TODO: Nanite does something different. Come up with my own heuristic.
+
+    // TODO: Also check if needs depth clipping
+    var buffer_slot: u32;
+    if cluster_is_small {
+        // Append this cluster to the list for software rasterization
+        buffer_slot = atomicAdd(&meshlet_software_raster_indirect_args.x, 1u);
+    } else {
+        // Append this cluster to the list for hardware rasterization
+        buffer_slot = atomicAdd(&meshlet_hardware_raster_indirect_args.instance_count, 1u);
+        buffer_slot = meshlet_raster_cluster_rightmost_slot - buffer_slot;
     }
+    meshlet_raster_clusters[buffer_slot] = cluster_id;
 }
 
 // https://stackoverflow.com/questions/21648630/radius-of-projected-sphere-in-screen-space/21649403#21649403

diff --git a/crates/bevy_pbr/src/meshlet/downsample_depth.wgsl b/crates/bevy_pbr/src/meshlet/downsample_depth.wgsl
@@ -1,4 +1,8 @@
-@group(0) @binding(0) var mip_0: texture_depth_2d;
+#ifdef MESHLET_VISIBILITY_BUFFER_RASTER_PASS_OUTPUT
+@group(0) @binding(0) var<storage, read> mip_0: array<u64>; // Per pixel
+#else
+@group(0) @binding(0) var<storage, read> mip_0: array<u32>; // Per pixel
+#endif
 @group(0) @binding(1) var mip_1: texture_storage_2d<r32float, write>;
 @group(0) @binding(2) var mip_2: texture_storage_2d<r32float, write>;
 @group(0) @binding(3) var mip_3: texture_storage_2d<r32float, write>;
@@ -12,11 +16,16 @@
 @group(0) @binding(11) var mip_11: texture_storage_2d<r32float, write>;
 @group(0) @binding(12) var mip_12: texture_storage_2d<r32float, write>;
 @group(0) @binding(13) var samplr: sampler;
-var<push_constant> max_mip_level: u32;
+struct Constants { max_mip_level: u32, view_width: u32 }
+var<push_constant> constants: Constants;
 
 /// Generates a hierarchical depth buffer.
 /// Based on FidelityFX SPD v2.1 https://github.com/GPUOpen-LibrariesAndSDKs/FidelityFX-SDK/blob/d7531ae47d8b36a5d4025663e731a47a38be882f/sdk/include/FidelityFX/gpu/spd/ffx_spd.h#L528
 
+// TODO:
+// * Subgroup support
+// * True single pass downsampling
+
 var<workgroup> intermediate_memory: array<array<f32, 16>, 16>;
 
 @compute
@@ -70,7 +79,7 @@ fn downsample_mips_0_and_1(x: u32, y: u32, workgroup_id: vec2u, local_invocation
     v[3] = reduce_load_mip_0(tex);
     textureStore(mip_1, pix, vec4(v[3]));
 
-    if max_mip_level <= 1u { return; }
+    if constants.max_mip_level <= 1u { return; }
 
     for (var i = 0u; i < 4u; i++) {
         intermediate_memory[x][y] = v[i];
@@ -100,19 +109,19 @@ fn downsample_mips_0_and_1(x: u32, y: u32, workgroup_id: vec2u, local_invocation
 }
 
 fn downsample_mips_2_to_5(x: u32, y: u32, workgroup_id: vec2u, local_invocation_index: u32) {
-    if max_mip_level <= 2u { return; }
+    if constants.max_mip_level <= 2u { return; }
     workgroupBarrier();
     downsample_mip_2(x, y, workgroup_id, local_invocation_index);
 
-    if max_mip_level <= 3u { return; }
+    if constants.max_mip_level <= 3u { return; }
     workgroupBarrier();
     downsample_mip_3(x, y, workgroup_id, local_invocation_index);
 
-    if max_mip_level <= 4u { return; }
+    if constants.max_mip_level <= 4u { return; }
     workgroupBarrier();
     downsample_mip_4(x, y, workgroup_id, local_invocation_index);
 
-    if max_mip_level <= 5u { return; }
+    if constants.max_mip_level <= 5u { return; }
     workgroupBarrier();
     downsample_mip_5(workgroup_id, local_invocation_index);
 }
@@ -191,27 +200,27 @@ fn downsample_mips_6_and_7(x: u32, y: u32) {
     v[3] = reduce_load_mip_6(tex);
     textureStore(mip_7, pix, vec4(v[3]));
 
-    if max_mip_level <= 7u { return; }
+    if constants.max_mip_level <= 7u { return; }
 
     let vr = reduce_4(v);
     textureStore(mip_8, vec2(x, y), vec4(vr));
     intermediate_memory[x][y] = vr;
 }
 
 fn downsample_mips_8_to_11(x: u32, y: u32, local_invocation_index: u32) {
-    if max_mip_level <= 8u { return; }
+    if constants.max_mip_level <= 8u { return; }
     workgroupBarrier();
     downsample_mip_8(x, y, local_invocation_index);
 
-    if max_mip_level <= 9u { return; }
+    if constants.max_mip_level <= 9u { return; }
     workgroupBarrier();
     downsample_mip_9(x, y, local_invocation_index);
 
-    if max_mip_level <= 10u { return; }
+    if constants.max_mip_level <= 10u { return; }
     workgroupBarrier();
     downsample_mip_10(x, y, local_invocation_index);
 
-    if max_mip_level <= 11u { return; }
+    if constants.max_mip_level <= 11u { return; }
     workgroupBarrier();
     downsample_mip_11(local_invocation_index);
 }
@@ -275,8 +284,11 @@ fn remap_for_wave_reduction(a: u32) -> vec2u {
 }
 
 fn reduce_load_mip_0(tex: vec2u) -> f32 {
-    let uv = (vec2f(tex) + 0.5) / vec2f(textureDimensions(mip_0));
-    return reduce_4(textureGather(mip_0, samplr, uv));
+    let a = load_mip_0(tex.x, tex.y);
+    let b = load_mip_0(tex.x + 1u, tex.y);
+    let c = load_mip_0(tex.x, tex.y + 1u);
+    let d = load_mip_0(tex.x + 1u, tex.y + 1u);
+    return reduce_4(vec4(a, b, c, d));
 }
 
 fn reduce_load_mip_6(tex: vec2u) -> f32 {
@@ -288,6 +300,15 @@ fn reduce_load_mip_6(tex: vec2u) -> f32 {
     ));
 }
 
+fn load_mip_0(x: u32, y: u32) -> f32 {
+    let i = y * constants.view_width + x;
+#ifdef MESHLET_VISIBILITY_BUFFER_RASTER_PASS_OUTPUT
+    return bitcast<f32>(u32(mip_0[i] >> 32u));
+#else
+    return bitcast<f32>(mip_0[i]);
+#endif
+}
+
 fn reduce_4(v: vec4f) -> f32 {
     return min(min(v.x, v.y), min(v.z, v.w));
 }
diff --git a/crates/bevy_pbr/src/meshlet/fill_cluster_buffers.wgsl b/crates/bevy_pbr/src/meshlet/fill_cluster_buffers.wgsl
@@ -13,11 +13,11 @@
 fn fill_cluster_buffers(
     @builtin(workgroup_id) workgroup_id: vec3<u32>,
     @builtin(num_workgroups) num_workgroups: vec3<u32>,
-    @builtin(local_invocation_id) local_invocation_id: vec3<u32>
+    @builtin(local_invocation_index) local_invocation_index: u32,
 ) {
     // Calculate the cluster ID for this thread
-    let cluster_id = local_invocation_id.x + 128u * dot(workgroup_id, vec3(num_workgroups.x * num_workgroups.x, num_workgroups.x, 1u));
-    if cluster_id >= cluster_count { return; }
+    let cluster_id = local_invocation_index + 128u * dot(workgroup_id, vec3(num_workgroups.x * num_workgroups.x, num_workgroups.x, 1u));
+    if cluster_id >= cluster_count { return; } // TODO: Could be an arrayLength?
 
     // Binary search to find the instance this cluster belongs to
     var left = 0u;