Change Entity::generation from u32 to NonZeroU32 for niche optimization #9907

notverymoe · 2023-09-23T08:56:04Z

Objective

Implements change described in Allow Option<Entity> to use memory layout optimization #3022
Goal is to allow Entity to benefit from niche optimization, especially in the case of Option to reduce memory overhead with structures with empty slots

Discussion

First PR attempt: Allow Option<Entity> to leverage niche optimization #3029
Discord: https://discord.com/channels/691052431525675048/1154573759752183808/1154573764240093224

Solution

Change Entity::generation from u32 to NonZeroU32 to allow for niche optimization.
- The reason for changing generation rather than index is so that the costs are only encountered on Entity free, instead of on Entity alloc
- There was some concern with generations being used, due to there being some desire to introduce flags. This was more to do with the original retirement approach, however, in reality even if generations were reduced to 24-bits, we would still have 16 million generations available before wrapping and current ideas indicate that we would be using closer to 4-bits for flags.
- Additionally, another concern was the representation of relationships where NonZeroU32 prevents us using the full address space, talking with Joy it seems unlikely to be an issue. The majority of the time these entity references will be low-index entries (ie. ChildOf, Owes), these will be able to be fast lookups, and the remainder of the range can use slower lookups to map to the address space.
- It has the additional benefit of being less visible to most users, since generation is only ever really set through from_bits type methods.
EntityMeta was changed to match
On free, generation now explicitly wraps:
- Originally, generation would panic in debug mode and wrap in release mode due to using regular ops.
- The first attempt at this PR changed the behavior to "retire" slots and remove them from use when generations overflowed. This change was controversial, and likely needs a proper RFC/discussion.
- Wrapping matches current release behaviour, and should therefore be less controversial.
- Wrapping also more easily migrates to the retirement approach, as users likely to exhaust the exorbitant supply of generations will code defensively against aliasing and that defensive code is less likely to break than code assuming that generations don't wrap.
- We use some unsafe code here when wrapping generations, to avoid branch on NonZeroU32 construction. It's guaranteed safe due to how we perform wrapping and it results in significantly smaller ASM code.
  - https://godbolt.org/z/6b6hj8PrM

Migration

Previous bevy_scene serializations have a high likelihood of being broken, as they contain 0th generation entities.

Current Issues

Entities::reserve_generations and EntityMapper wrap now, even in debug - although they technically did in release mode already so this probably isn't a huge issue. It just depends if we need to change anything here?

…rialization

A-Walrus · 2023-09-23T17:42:32Z

We might want to write a little migration script that increases the generation of every entity in a scene file by one

atlv24

appears sensible, should probably merge after #9797 though

notverymoe · 2023-09-25T05:08:27Z

We might want to write a little migration script that increases the generation of every entity in a scene file by one

Do we have any prior art for this? Be good to look at before I give it a go

notverymoe · 2023-09-25T05:10:58Z

appears sensible, should probably merge after #9797 though

Seems like this'll need a rework after that lands, which is fine - I'll keep my eyes out for it. I'll look at doing the benchmarking after that's in and I fix things up.

Kolsky · 2023-09-25T05:51:17Z

The wrapping stuff

If you want to do the usual wrapping for addition of two integers, here is how (playground)

Proof that it works: adding two integers will result in exactly 0 or 1 overflows (2 × uN::MAX = (oveflow_flag, uN::MAX - 1)). The case with 0 overflows is trivial (just add them normally). If overflow happens, the result would be off-by-one with the normal wrapping. There are two extreme cases: if result was uN::MIN, adding 1 will make it NonZeroUN::MIN. If result was uN::MAX - 1, adding 1 will make it NonZeroUN::MAX. Thus, neither underflow nor overflow of NonZero domain is possible. There's also a commented out simple test for Miri, albeit it needs to be run locally.

Tl;dr: it's as simple as (x + y + overflow_flag)

notverymoe · 2023-09-25T06:09:10Z

@Kolsky Oh, that's really excellent :O Thanks for that!

notverymoe · 2023-09-25T06:10:28Z

The optimized assembly is also really clean, nice!

…zation-generation

notverymoe · 2023-09-30T08:10:04Z

Still waiting on that other PR to merge, but I ran bench with 857fb9c:
https://gist.github.com/notverymoe/52b3d4605ccac8aaead2a730387f6e9d

Cut down table excluding < 5% changes grepped for "entit":

group                                           main_ecs                niche_ecs
-----                                           --------                ---------
busy_systems/01x_entities_03_systems            1.00     23.1±0.83µs    1.06     24.4±2.05µs
busy_systems/01x_entities_06_systems            1.12     48.0±9.59µs    1.00     42.7±9.09µs
busy_systems/01x_entities_09_systems            1.26    79.9±19.15µs    1.00     63.7±8.43µs
busy_systems/01x_entities_12_systems            1.38   122.1±15.87µs    1.00    88.3±15.01µs
busy_systems/01x_entities_15_systems            1.00   105.9±26.01µs    1.10   116.2±23.13µs
busy_systems/02x_entities_06_systems            1.14    84.9±20.00µs    1.00     74.4±9.35µs
busy_systems/02x_entities_09_systems            1.00   104.0±15.44µs    1.23   128.0±11.44µs
busy_systems/02x_entities_15_systems            1.15   230.6±32.33µs    1.00   200.9±44.35µs
busy_systems/03x_entities_06_systems            1.00   116.5±27.66µs    1.13   132.0±23.04µs
busy_systems/03x_entities_09_systems            1.00   168.3±40.31µs    1.11   186.2±40.31µs
busy_systems/03x_entities_12_systems            1.00   200.4±38.06µs    1.12   224.2±43.29µs
busy_systems/04x_entities_03_systems            1.20    77.4±13.84µs    1.00     64.8±7.06µs
busy_systems/04x_entities_12_systems            1.21   291.6±56.49µs    1.00   240.2±40.80µs
busy_systems/05x_entities_03_systems            1.00    80.4±13.44µs    1.06    85.0±14.62µs
contrived/01x_entities_03_systems               1.00     18.1±4.14µs    1.28     23.2±6.56µs
contrived/01x_entities_06_systems               1.00     29.1±6.59µs    1.17     34.1±6.44µs
contrived/01x_entities_12_systems               1.23    64.4±14.72µs    1.00    52.2±12.26µs
contrived/02x_entities_03_systems               1.19     31.4±6.97µs    1.00     26.4±6.17µs
contrived/02x_entities_06_systems               1.00     37.0±5.99µs    1.21    44.7±14.30µs
contrived/02x_entities_09_systems               1.00     52.3±9.52µs    1.19    62.3±18.32µs
contrived/02x_entities_12_systems               1.32    92.1±23.38µs    1.00    69.8±12.67µs
contrived/03x_entities_03_systems               1.08     39.6±9.53µs    1.00     36.5±6.89µs
contrived/03x_entities_09_systems               1.18    85.5±22.96µs    1.00    72.4±16.48µs
contrived/03x_entities_12_systems               1.00    90.6±18.26µs    1.09    99.1±25.23µs
contrived/04x_entities_09_systems               1.00   102.1±26.46µs    1.19   121.4±24.96µs
contrived/04x_entities_12_systems               1.23   141.9±31.90µs    1.00   115.5±21.08µs
contrived/05x_entities_03_systems               1.52     59.9±7.86µs    1.00     39.5±7.64µs
contrived/05x_entities_06_systems               1.00    72.0±13.10µs    1.20    86.6±18.86µs
contrived/05x_entities_12_systems               1.09   172.2±44.12µs    1.00   158.2±31.92µs
spawn_world/10000_entities                      1.00   473.3±42.80µs    1.10   520.0±84.80µs
spawn_world/1000_entities                       1.00     47.3±4.68µs    1.08     51.0±9.35µs
spawn_world/100_entities                        1.00      4.8±0.52µs    1.06      5.1±0.93µs
world_query_for_each/50000_entities_sparse      1.05     43.1±0.20µs    1.00     41.0±0.15µs
world_query_get/50000_entities_table_wide       1.00    119.4±0.39µs    1.12    133.3±0.64µs
world_query_iter/50000_entities_sparse          1.00     50.4±0.34µs    1.12     56.2±1.15µs

notverymoe · 2023-09-30T08:15:40Z

I'm not too familiar with those numbers, but they seem rather inconsistent between different magnitudes of test:
AMD 5950X, 32GiB ram, Running EndeavourOS with Linux 6.5.4, rustc 1.72.1

Shut everything down on the machine except the terminal running the test.

alice-i-cecile · 2023-10-01T01:18:32Z

Yeah, the benchmarks are definitely a bit noisy, both here and in general 🤔 I can't make clear sense of this unfortunately.

notverymoe · 2023-10-06T04:13:55Z

Elabajaba on discord pointed out that since I'm on a Zen 3 CPU, I'm probably getting different boosts on every core. I'll try disabling that and running the tests again, see if I get something more stable and less noisey.

notverymoe · 2023-10-12T21:38:02Z

Yep, that was it. I disabled Precision Boost and locked my CPU scaling to 2.2GHz on linux. Much more stable, not as much swing between tests, still a little noisey. But I think maybe i need to run a few more tests, because the results are little strange in places (30% gain on events_iter??)

Ran against bb13d06 on main, and merged into the branch, threshold of 5%:

group                                           main                    niche_ecs
-----                                           ------                  -----------
added_archetypes/archetype_count/100            1.35     59.2±1.25µs    1.00     43.8±0.33µs
added_archetypes/archetype_count/1000           1.14   1059.4±5.62µs    1.00    930.6±3.45µs
added_archetypes/archetype_count/10000          1.18     12.8±0.53ms    1.00     10.9±0.27ms
added_archetypes/archetype_count/200            1.24    108.4±2.56µs    1.00     87.4±4.70µs
added_archetypes/archetype_count/2000           1.13      2.2±0.02ms    1.00   1957.9±9.80µs
added_archetypes/archetype_count/500            1.22   413.1±29.71µs    1.00   338.7±22.49µs
added_archetypes/archetype_count/5000           1.16      5.9±0.03ms    1.00      5.1±0.02ms
build_schedule/1000_schedule                    1.08       3.6±0.02s    1.00       3.3±0.04s
build_schedule/1000_schedule_noconstraints      1.20     32.0±0.27ms    1.00     26.5±0.35ms
build_schedule/100_schedule                     1.09     22.0±0.06ms    1.00     20.3±0.04ms
build_schedule/500_schedule                     1.09   687.7±10.20ms    1.00    633.1±7.57ms
busy_systems/01x_entities_06_systems            1.00     72.9±4.26µs    1.08     78.4±4.59µs
busy_systems/01x_entities_12_systems            1.00    131.8±3.45µs    1.05    138.9±3.91µs
busy_systems/01x_entities_15_systems            1.00    164.4±4.84µs    1.06    173.7±7.20µs
busy_systems/02x_entities_06_systems            1.00    120.8±2.33µs    1.06    127.5±3.24µs
busy_systems/05x_entities_03_systems            1.00    139.9±1.11µs    1.13    158.4±3.67µs
busy_systems/05x_entities_06_systems            1.00    256.1±1.77µs    1.16    296.3±1.77µs
busy_systems/05x_entities_09_systems            1.00    379.2±2.08µs    1.15    437.7±6.34µs
busy_systems/05x_entities_12_systems            1.00    502.6±6.63µs    1.15    575.8±7.26µs
busy_systems/05x_entities_15_systems            1.00    622.2±8.26µs    1.15    713.9±8.03µs
contrived/01x_entities_06_systems               1.12     58.7±4.38µs    1.00     52.4±2.42µs
contrived/01x_entities_15_systems               1.08    117.7±9.58µs    1.00    109.3±8.53µs
contrived/03x_entities_03_systems               1.00     54.0±3.66µs    1.11     60.2±4.75µs
contrived/03x_entities_09_systems               1.08   134.5±10.70µs    1.00    124.6±3.17µs
contrived/04x_entities_06_systems               1.06    113.1±5.34µs    1.00    106.6±3.01µs
contrived/05x_entities_09_systems               1.06   192.8±11.52µs    1.00    182.3±7.92µs
empty_systems/002_systems                       1.25     18.5±1.74µs    1.00     14.8±0.36µs
empty_systems/025_systems                       1.05     34.8±2.43µs    1.00     33.1±1.79µs
empty_systems/030_systems                       1.00     37.5±1.44µs    1.07     40.3±5.48µs
empty_systems/035_systems                       1.00     42.7±5.36µs    1.25    53.6±11.26µs
empty_systems/040_systems                       1.33    64.1±12.02µs    1.00     48.0±7.18µs
empty_systems/045_systems                       1.33    65.8±13.64µs    1.00     49.5±0.81µs
empty_systems/050_systems                       1.12    69.1±12.46µs    1.00    61.8±11.82µs
empty_systems/055_systems                       1.13    78.0±16.51µs    1.00    69.3±11.69µs
empty_systems/060_systems                       1.08    80.5±13.07µs    1.00     74.5±9.37µs
empty_systems/065_systems                       1.12    86.8±13.46µs    1.00     77.7±9.74µs
empty_systems/070_systems                       1.00    95.0±15.66µs    1.07   101.4±13.63µs
empty_systems/075_systems                       1.09    97.3±14.35µs    1.00    89.4±11.86µs
empty_systems/080_systems                       1.11   114.7±16.54µs    1.00   103.2±19.28µs
empty_systems/085_systems                       1.00     97.7±7.42µs    1.17   114.0±18.74µs
events_iter/size_16_events_100                  1.45    146.6±0.13ns    1.00    100.8±0.51ns
events_iter/size_16_events_1000                 1.50   1380.8±5.29ns    1.00    920.5±0.44ns
events_iter/size_16_events_10000                1.50     13.7±0.03µs    1.00      9.1±0.00µs
events_iter/size_16_events_50000                1.50     68.4±0.25µs    1.00     45.7±0.50µs
events_iter/size_4_events_100                   1.46    146.7±0.17ns    1.00    100.8±0.53ns
events_iter/size_4_events_1000                  1.50  1381.9±30.73ns    1.00    920.4±0.59ns
events_iter/size_4_events_10000                 1.50     13.7±0.05µs    1.00      9.1±0.01µs
events_iter/size_4_events_50000                 1.50     68.6±0.11µs    1.00     45.6±0.14µs
events_send/size_16_events_100                  1.00    238.7±0.42ns    1.14    271.7±0.57ns
events_send/size_512_events_100                 1.00      2.6±0.00µs    1.06      2.8±0.01µs
iter_fragmented/base                            1.00   799.9±24.48ns    1.29  1030.2±24.57ns
iter_fragmented_sparse/wide                     1.13     85.2±6.57ns    1.00    75.5±18.44ns
iter_simple/system                              1.64     22.9±0.03µs    1.00     14.0±0.07µs
iter_simple/wide                                1.00     65.1±0.88µs    1.13     73.8±4.25µs
query_get/50000_entities_table                  1.08    476.5±1.61µs    1.00    440.5±4.31µs
query_get_component/50000_entities_table        1.07   1039.1±7.86µs    1.00    966.6±8.49µs
query_get_many_2/50000_calls_table              1.07    835.6±7.38µs    1.00    780.8±1.99µs
run_condition/no/021_systems                    1.00     14.9±0.15µs    1.08     16.1±1.63µs
run_condition/no/026_systems                    1.00     15.7±1.47µs    1.10     17.3±1.16µs
run_condition/no/041_systems                    1.00     16.5±0.90µs    1.06     17.5±1.65µs
run_condition/no/066_systems                    1.00     17.2±1.37µs    1.14     19.7±0.58µs
run_condition/no/071_systems                    1.08     20.1±0.63µs    1.00     18.6±1.81µs
run_condition/no/076_systems                    1.00     17.1±0.64µs    1.23     21.1±2.06µs
run_condition/no/081_systems                    1.00     19.0±1.54µs    1.06     20.2±1.68µs
run_condition/no/086_systems                    1.00     20.0±1.60µs    1.06     21.3±0.41µs
run_condition/no/096_systems                    1.21     22.3±0.73µs    1.00     18.4±0.67µs
run_condition/yes/016_systems                   1.13     31.1±4.65µs    1.00     27.6±2.89µs
run_condition/yes/031_systems                   1.00     39.3±5.34µs    1.07     42.0±7.37µs
run_condition/yes/036_systems                   1.05     45.1±3.84µs    1.00     42.8±2.24µs
run_condition/yes/051_systems                   1.00     63.0±7.82µs    1.05    66.4±12.62µs
run_condition/yes/056_systems                   1.00    70.1±10.10µs    1.15    80.7±14.06µs
run_condition/yes/061_systems                   1.00     74.6±8.93µs    1.07    80.1±15.08µs
run_condition/yes/066_systems                   1.15   104.7±15.10µs    1.00    90.7±16.92µs
run_condition/yes/071_systems                   1.00    89.3±11.59µs    1.13   100.6±22.13µs
run_condition/yes/076_systems                   1.00     89.4±9.15µs    1.12   100.1±14.48µs
run_condition/yes/081_systems                   1.06    99.5±11.61µs    1.00     93.9±8.29µs
run_condition/yes/086_systems                   1.40   142.7±14.29µs    1.00   102.3±10.79µs
run_condition/yes/091_systems                   1.12   150.2±16.58µs    1.00   133.8±20.46µs
run_condition/yes/096_systems                   1.00   126.7±17.67µs    1.19   150.3±17.23µs
run_condition/yes_using_query/016_systems       1.07     28.6±2.33µs    1.00     26.8±0.42µs
run_condition/yes_using_query/031_systems       1.00     40.6±6.55µs    1.07     43.7±7.94µs
run_condition/yes_using_query/036_systems       1.11     47.2±7.30µs    1.00     42.7±0.89µs
run_condition/yes_using_query/046_systems       1.12    64.8±13.27µs    1.00     58.0±6.96µs
run_condition/yes_using_query/051_systems       1.00     64.4±7.90µs    1.07    69.0±12.70µs
run_condition/yes_using_query/061_systems       1.00    82.2±11.63µs    1.06    87.5±14.79µs
run_condition/yes_using_query/066_systems       1.08    88.3±13.49µs    1.00     81.7±9.15µs
run_condition/yes_using_query/071_systems       1.00    91.4±12.91µs    1.08    98.4±11.83µs
run_condition/yes_using_query/081_systems       1.00   105.3±19.28µs    1.07   112.2±14.51µs
run_condition/yes_using_query/086_systems       1.29   130.5±22.39µs    1.00    101.2±7.04µs
run_condition/yes_using_query/091_systems       1.15   126.8±18.31µs    1.00   110.2±12.02µs
run_condition/yes_using_query/101_systems       1.08   138.4±18.16µs    1.00   128.7±21.49µs
run_condition/yes_using_resource/036_systems    1.00     49.7±9.44µs    1.12    55.6±11.58µs
run_condition/yes_using_resource/046_systems    1.00     58.2±7.93µs    1.15    67.0±13.45µs
run_condition/yes_using_resource/051_systems    1.14    67.9±11.02µs    1.00     59.7±5.49µs
run_condition/yes_using_resource/056_systems    1.00    70.1±10.68µs    1.16    81.1±14.84µs
run_condition/yes_using_resource/061_systems    1.00     72.3±6.30µs    1.10    79.4±11.07µs
run_condition/yes_using_resource/066_systems    1.08    93.1±16.91µs    1.00    86.1±12.60µs
run_condition/yes_using_resource/076_systems    1.14    117.5±9.96µs    1.00   103.3±18.74µs
run_condition/yes_using_resource/081_systems    1.06   118.9±17.52µs    1.00   112.4±14.63µs
run_condition/yes_using_resource/086_systems    1.15   124.9±17.57µs    1.00   108.7±16.91µs
run_condition/yes_using_resource/096_systems    1.00   119.5±15.52µs    1.17   139.9±18.14µs
sized_commands_0_bytes/2000_commands            1.17      7.5±0.01µs    1.00      6.4±0.00µs
sized_commands_0_bytes/4000_commands            1.14     14.6±0.01µs    1.00     12.8±0.01µs
sized_commands_0_bytes/6000_commands            1.16     22.3±0.60µs    1.00     19.3±0.02µs
sized_commands_0_bytes/8000_commands            1.14     29.6±0.03µs    1.00     25.9±0.23µs
world_query_get/50000_entities_sparse_wide      1.00    389.1±0.51µs    1.07   415.6±40.24µs
world_query_iter/50000_entities_table           1.34     91.3±0.19µs    1.00     68.3±0.03µs

Seems like gains on added_archetype, contrived, events_iter and losses on busy_systems and events_send. Definitely lots of noise on run_condition, it swings back and forth test to test.

scottmcm · 2023-11-13T19:06:56Z

crates/bevy_ecs/src/entity/map_entities.rs

@@ -147,7 +149,7 @@ mod tests {
        let mut world = World::new();
        let mut mapper = EntityMapper::new(&mut map, &mut world);

-        let mapped_ent = Entity::new(FIRST_IDX, 0);
+        let mapped_ent = Entity::new(FIRST_IDX, 1).unwrap();


Looks like this file could just use from_raw instead of the cfg(test)-only function? And that'd save some unwrap clutter too:

Suggested change

let mapped_ent = Entity::new(FIRST_IDX, 1).unwrap();

let mapped_ent = Entity::from_raw(FIRST_IDX);

Excellent point, I can also make this change in the lib.rs tests in a few places.

scottmcm · 2023-11-13T19:07:13Z

crates/bevy_ecs/src/entity/map_entities.rs

@@ -156,7 +158,9 @@ mod tests {
            "should persist the allocated mapping from the previous line"
        );
        assert_eq!(
-            mapper.get_or_reserve(Entity::new(SECOND_IDX, 0)).index(),
+            mapper
+                .get_or_reserve(Entity::new(SECOND_IDX, 1).unwrap())


Suggested change

.get_or_reserve(Entity::new(SECOND_IDX, 1).unwrap())

.get_or_reserve(Entity::from_raw(SECOND_IDX))

scottmcm · 2023-11-13T19:08:10Z

crates/bevy_ecs/src/entity/mod.rs

@@ -191,7 +196,7 @@ impl Entity {
    pub const fn from_raw(index: u32) -> Entity {
        Entity {
            index,
-            generation: 0,
+            generation: NonZeroU32::MIN,


Note that this function's rustdoc says "and a generation of 0", so that probably needs to be updated.

scottmcm · 2023-11-13T19:08:29Z

crates/bevy_ecs/src/entity/map_entities.rs

@@ -174,7 +178,7 @@ mod tests {
        let mut world = World::new();

        let dead_ref = EntityMapper::world_scope(&mut map, &mut world, |_, mapper| {
-            mapper.get_or_reserve(Entity::new(0, 0))
+            mapper.get_or_reserve(Entity::new(0, 1).unwrap())


Suggested change

mapper.get_or_reserve(Entity::new(0, 1).unwrap())

mapper.get_or_reserve(Entity::from_raw(0))

(This is my first PR here, so I've probably missed some things. Please let me know what else I should do to help you as a reviewer!) # Objective Due to rust-lang/rust#117800, the `derive`'d `PartialEq::eq` on `Entity` isn't as good as it could be. Since that's used in hashtable lookup, let's improve it. ## Solution The derived `PartialEq::eq` short-circuits if the generation doesn't match. However, having a branch there is sub-optimal, especially on 64-bit systems like x64 that could just load the whole `Entity` in one load anyway. Due to complications around `poison` in LLVM and the exact details of what unsafe code is allowed to do with reference in Rust (rust-lang/unsafe-code-guidelines#346), LLVM isn't allowed to completely remove the short-circuiting. `&Entity` is marked `dereferencable(8)` so LLVM knows it's allowed to *load* all 8 bytes -- and does so -- but it has to assume that the `index` might be undef/poison if the `generation` doesn't match, and thus while it finds a way to do it without needing a branch, it has to do something slightly more complicated than optimal to combine the results. (LLVM is allowed to change non-short-circuiting code to use branches, but not the other way around.) Here's a link showing the codegen today: <https://rust.godbolt.org/z/9WzjxrY7c> ```rust #[no_mangle] pub fn demo_eq_ref(a: &Entity, b: &Entity) -> bool { a == b } ``` ends up generating the following assembly: ```asm demo_eq_ref: movq xmm0, qword ptr [rdi] movq xmm1, qword ptr [rsi] pcmpeqd xmm1, xmm0 pshufd xmm0, xmm1, 80 movmskpd eax, xmm0 cmp eax, 3 sete al ret ``` (It's usually not this bad in real uses after inlining and LTO, but it makes a strong demo.) This PR manually implements `PartialEq::eq` *without* short-circuiting, and because that tells LLVM that neither the generations nor the index can be poison, it doesn't need to be so careful and can generate the "just compare the two 64-bit values" code you'd have probably already expected: ```asm demo_eq_ref: mov rax, qword ptr [rsi] cmp qword ptr [rdi], rax sete al ret ``` Since this doesn't change the representation of `Entity`, if it's instead passed by *value*, then each `Entity` is two `u32` registers, and the old and the new code do exactly the same thing. (Other approaches, like changing `Entity` to be `[u32; 2]` or `u64`, affect this case.) This should hopefully merge easily with changes like #9907 that also want to change `Entity`. ## Benchmarks I'm not super-confident that I got my machine fully consistent for benchmarking, but whether I run the old or the new one first I get reasonably consistent results. Here's a fairly typical example of the benchmarks I added in this PR: ![image](https://github.com/bevyengine/bevy/assets/18526288/24226308-4616-4082-b0ff-88fc06285ef1) Building the sets seems to be basically the same. It's usually reported as noise, but sometimes I see a few percent slower or faster. But lookup hits in particular -- since a hit checks that the key is equal -- consistently shows around 10% improvement. `cargo run --example many_cubes --features bevy/trace_tracy --release -- --benchmark` showed as slightly faster with this change, though if I had to bet I'd probably say it's more noise than meaningful (but at least it's not worse either): ![image](https://github.com/bevyengine/bevy/assets/18526288/58bb8c96-9c45-487f-a5ab-544bbfe9fba0) This is my first PR here -- and my first time running Tracy -- so please let me know what else I should run, or run things on your own more reliable machines to double-check. --- ## Changelog (probably not worth including) Changed: micro-optimized `Entity::eq` to help LLVM slightly. ## Migration Guide (I really hope nobody was using this on uninitialized entities where sufficiently tortured `unsafe` could could technically notice that this has changed.)

scottmcm · 2023-11-15T01:13:00Z

30% gain on events_iter??

While I have no idea if that change is meaningful, note that in a microbenchmark dealing with iterators where the item type is Entity, a fairly substantial change isn't impossible. If it's dealing with .next() calls that are returning Option<Entity>, today that's an ABI like

define void @next1(ptr noalias nocapture noundef writeonly sret(%"core::option::Option<Entity1>") align 4 dereferenceable(12) %_0, ptr noalias nocapture noundef readonly align 4 dereferenceable(4) %it) unnamed_addr #0 {

where

%"core::option::Option<Entity1>" = type { i32, [2 x i32] }

vs with the niche it's a simple pair that can be returned directly (without going through stack)

define { i32, i32 } @next2(ptr noalias nocapture noundef readonly align 4 dereferenceable(4) %it) unnamed_addr #1 {

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=1019a1da83891162de90a4472f1c1b47

Hopefully inlining and LTO and such would remove the effects of those differences, but sometimes zero-cost abstractions aren't ☹️

(2-variant enums are sometimes worse than one would like -- see rust-lang/rust#85133 (comment) -- but once niched that stops happening.)

scottmcm · 2023-11-20T22:15:58Z

crates/bevy_ecs/src/entity/mod.rs

+    let (lo, hi) = lhs.get().overflowing_add(rhs);
+    let ret = lo + hi as u32;


Ohh, clever! This codegens really elegantly; nice work 👍

@Kolsky up above came up with it:
#9907 (comment)

I was really surprised at it's elegance, definitely going to be using it in one or two places in my personal projects now that i know about it. I think, unfortunately, given my reading of:
#9797 (comment)

it means we won't be able to keep it (I think)

Actually, I think it we will be able to do something similar, but we'd need to wrap on a however many bits we have available for the generation segment. So at the moment, that would be 31 bits. So you'd have to do something like:

pub const fn nonzero_wrapping_high_increment(value: NonZeroU32) -> NonZeroU32 { let next_value = value.get().wrapping_add(1); // Mask the overflow bit let overflowed = (next_value & 0x8000_0000) >> 31; // Remove the overflow bit from the next value, but then add to it unsafe { NonZeroU32::new_unchecked((next_value & 0x7FFF_FFFF) + overflowed) } }

As long as we know we are only incrementing by one each time, then it should still output fairly terse asm: https://rust.godbolt.org/z/PnYTPfGb6 Basically the same principle as before, just applied to wrapping on 31 bits (or whatever amount of bits we need later)

Oh I see, that's interesting 🤔 We can definitely use that for the regular increment, it was just nice that this version worked for both slot incrementing and Entities::reserve_generations (which requires an arbitrary increment, but honestly might not even be correct).

Tried a version with checked_add, it has a similar instruction count, but is probably more expensive.

…zation-generation

maniwani

This looks good to me. I don't really have anything to add since we discussed most concerns in the linked Discord discussion. It's cool that we even avoided adding a branch to free.

(This is my first PR here, so I've probably missed some things. Please let me know what else I should do to help you as a reviewer!) # Objective Due to rust-lang/rust#117800, the `derive`'d `PartialEq::eq` on `Entity` isn't as good as it could be. Since that's used in hashtable lookup, let's improve it. ## Solution The derived `PartialEq::eq` short-circuits if the generation doesn't match. However, having a branch there is sub-optimal, especially on 64-bit systems like x64 that could just load the whole `Entity` in one load anyway. Due to complications around `poison` in LLVM and the exact details of what unsafe code is allowed to do with reference in Rust (rust-lang/unsafe-code-guidelines#346), LLVM isn't allowed to completely remove the short-circuiting. `&Entity` is marked `dereferencable(8)` so LLVM knows it's allowed to *load* all 8 bytes -- and does so -- but it has to assume that the `index` might be undef/poison if the `generation` doesn't match, and thus while it finds a way to do it without needing a branch, it has to do something slightly more complicated than optimal to combine the results. (LLVM is allowed to change non-short-circuiting code to use branches, but not the other way around.) Here's a link showing the codegen today: <https://rust.godbolt.org/z/9WzjxrY7c> ```rust #[no_mangle] pub fn demo_eq_ref(a: &Entity, b: &Entity) -> bool { a == b } ``` ends up generating the following assembly: ```asm demo_eq_ref: movq xmm0, qword ptr [rdi] movq xmm1, qword ptr [rsi] pcmpeqd xmm1, xmm0 pshufd xmm0, xmm1, 80 movmskpd eax, xmm0 cmp eax, 3 sete al ret ``` (It's usually not this bad in real uses after inlining and LTO, but it makes a strong demo.) This PR manually implements `PartialEq::eq` *without* short-circuiting, and because that tells LLVM that neither the generations nor the index can be poison, it doesn't need to be so careful and can generate the "just compare the two 64-bit values" code you'd have probably already expected: ```asm demo_eq_ref: mov rax, qword ptr [rsi] cmp qword ptr [rdi], rax sete al ret ``` Since this doesn't change the representation of `Entity`, if it's instead passed by *value*, then each `Entity` is two `u32` registers, and the old and the new code do exactly the same thing. (Other approaches, like changing `Entity` to be `[u32; 2]` or `u64`, affect this case.) This should hopefully merge easily with changes like bevyengine#9907 that also want to change `Entity`. ## Benchmarks I'm not super-confident that I got my machine fully consistent for benchmarking, but whether I run the old or the new one first I get reasonably consistent results. Here's a fairly typical example of the benchmarks I added in this PR: ![image](https://github.com/bevyengine/bevy/assets/18526288/24226308-4616-4082-b0ff-88fc06285ef1) Building the sets seems to be basically the same. It's usually reported as noise, but sometimes I see a few percent slower or faster. But lookup hits in particular -- since a hit checks that the key is equal -- consistently shows around 10% improvement. `cargo run --example many_cubes --features bevy/trace_tracy --release -- --benchmark` showed as slightly faster with this change, though if I had to bet I'd probably say it's more noise than meaningful (but at least it's not worse either): ![image](https://github.com/bevyengine/bevy/assets/18526288/58bb8c96-9c45-487f-a5ab-544bbfe9fba0) This is my first PR here -- and my first time running Tracy -- so please let me know what else I should run, or run things on your own more reliable machines to double-check. --- ## Changelog (probably not worth including) Changed: micro-optimized `Entity::eq` to help LLVM slightly. ## Migration Guide (I really hope nobody was using this on uninitialized entities where sufficiently tortured `unsafe` could could technically notice that this has changed.)

Since #9907 the generation starts at `1` instead of `0` so `Entity::to_bits` now returns `4294967296` (ie. `u32::MAX + 1`) as the lowest number instead of `0`. Without this change scene loading fails with this error message: `ERROR bevy_asset::server: Failed to load asset 'scenes/load_scene_example.scn.ron' with asset loader 'bevy_scene::scene_loader::SceneLoader': Could not parse RON: 8:6: Invalid generation bits`

Testare · 2024-02-19T21:42:08Z

Couldn't find information on this in the migration guide

github-actions · 2024-02-19T22:09:15Z

It looks like your PR is a breaking change, but you didn't provide a migration guide.

Could you add some context on what users should update when this change get released in a new version of Bevy?
It will be used to help writing the migration guide for the version. Putting it after a ## Migration Guide will help it get automatically picked up by our tooling.

notverymoe · 2024-02-20T03:09:00Z

I'm deeply sorry for the inconvenience I've caused with this. I'll look into some solutions.

@NathanSWard

# Objective Adoption of #2104 and #11843. The `Option<usize>` wastes 3-7 bytes of memory per potential entry, and represents a scaling memory overhead as the ID space grows. The goal of this PR is to reduce memory usage without significantly impacting common use cases. Co-Authored By: @NathanSWard Co-Authored By: @tygyh ## Solution Replace `usize` in `SparseSet`'s sparse array with `nonmax::NonMaxUsize`. NonMaxUsize wraps a NonZeroUsize, and applies a bitwise NOT to the value when accessing it. This allows the compiler to niche the value and eliminate the extra padding used for the `Option` inside the sparse array, while moving the niche value from 0 to usize::MAX instead. Checking the [diff in x86 generated assembly](james7132/bevy_asm_tests@6e4da65), this change actually results in fewer instructions generated. One potential downside is that it seems to have moved a load before a branch, which means we may be incurring a cache miss even if the element is not there. Note: unlike #2104 and #11843, this PR only targets the metadata stores for the ECS and not the component storage itself. Due to #9907 targeting `Entity::generation` instead of `Entity::index`, `ComponentSparseSet` storing only up to `u32::MAX` elements would become a correctness issue. This will come with a cost when inserting items into the SparseSet, as now there is a potential for a panic. These cost are really only incurred when constructing a new Table, Archetype, or Resource that has never been seen before by the World. All operations that are fairly cold and not on any particular hotpath, even for command application. --- ## Changelog Changed: `SparseSet` now can only store up to `usize::MAX - 1` elements instead of `usize::MAX`. Changed: `SparseSet` now uses 33-50% less memory overhead per stored item.

@NathanSWard

# Objective Adoption of bevyengine#2104 and bevyengine#11843. The `Option<usize>` wastes 3-7 bytes of memory per potential entry, and represents a scaling memory overhead as the ID space grows. The goal of this PR is to reduce memory usage without significantly impacting common use cases. Co-Authored By: @NathanSWard Co-Authored By: @tygyh ## Solution Replace `usize` in `SparseSet`'s sparse array with `nonmax::NonMaxUsize`. NonMaxUsize wraps a NonZeroUsize, and applies a bitwise NOT to the value when accessing it. This allows the compiler to niche the value and eliminate the extra padding used for the `Option` inside the sparse array, while moving the niche value from 0 to usize::MAX instead. Checking the [diff in x86 generated assembly](james7132/bevy_asm_tests@6e4da65), this change actually results in fewer instructions generated. One potential downside is that it seems to have moved a load before a branch, which means we may be incurring a cache miss even if the element is not there. Note: unlike bevyengine#2104 and bevyengine#11843, this PR only targets the metadata stores for the ECS and not the component storage itself. Due to bevyengine#9907 targeting `Entity::generation` instead of `Entity::index`, `ComponentSparseSet` storing only up to `u32::MAX` elements would become a correctness issue. This will come with a cost when inserting items into the SparseSet, as now there is a potential for a panic. These cost are really only incurred when constructing a new Table, Archetype, or Resource that has never been seen before by the World. All operations that are fairly cold and not on any particular hotpath, even for command application. --- ## Changelog Changed: `SparseSet` now can only store up to `usize::MAX - 1` elements instead of `usize::MAX`. Changed: `SparseSet` now uses 33-50% less memory overhead per stored item.

Change Entity::generation from u32 to NonZeroU32, fix tests, fix dese…

92ba6a1

…rialization

This was referenced Sep 23, 2023

Allow Option<Entity> to leverage niche optimization #3029

Closed

Refactor entity serialization projectharmonia/bevy_replicon#51

Merged

atlv24 approved these changes Sep 23, 2023

View reviewed changes

notverymoe and others added 3 commits September 25, 2023 14:52

Change method of wrapping generation count

460a4ed

Add comment

b1aac2a

Merge remote-tracking branch 'upstream/main' into niche-entity-optimi…

9dc7f17

…zation-generation

alice-i-cecile removed the S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help label Oct 12, 2023

alice-i-cecile added this to the 0.13 milestone Oct 12, 2023

Trashtalk217 mentioned this pull request Oct 31, 2023

Unified identifer for entities & relations #9797

Merged

scottmcm mentioned this pull request Nov 12, 2023

Optimize Entity::eq #10519

Merged

scottmcm reviewed Nov 13, 2023

View reviewed changes

scottmcm mentioned this pull request Nov 14, 2023

Optimise Entity with repr align & manual PartialOrd/Ord #10558

Merged

scottmcm reviewed Nov 20, 2023

View reviewed changes

Natalie Baker added 2 commits December 5, 2023 08:07

Merge remote-tracking branch 'upstream/main' into niche-entity-optimi…

2b69f13

…zation-generation

Swap to seperate failable from_bits impl to better match identifier PR

a61c525

JoJoJet self-requested a review December 13, 2023 18:34

This comment was marked as outdated.

Sign in to view

maniwani approved these changes Jan 8, 2024

View reviewed changes

alice-i-cecile added this pull request to the merge queue Jan 8, 2024

alice-i-cecile added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Jan 8, 2024

Merged via the queue into bevyengine:main with commit b257fff Jan 8, 2024
23 checks passed

tim-blackbird mentioned this pull request Jan 10, 2024

Fix scene example #11289

Merged

Bluefinger mentioned this pull request Jan 13, 2024

scene.rs example crashes #3986

Closed

This was referenced Jan 13, 2024

Clients visibility projectharmonia/bevy_replicon#174

Merged

Bevy v0.13 projectharmonia/bevy_replicon#185

Closed

This was referenced Jan 29, 2024

Allow Option<Entity> to use memory layout optimization #3022

Closed

0.13 announcement post bevyengine/bevy-website#891

Merged

rparrett mentioned this pull request Feb 19, 2024

0.13 migration: Add section for entity niche optimization bevyengine/bevy-website#1029

Open

alice-i-cecile added the C-Breaking-Change A breaking change to Bevy's public API that needs to be noted in a migration guide label Feb 19, 2024

UkoeHB mentioned this pull request Feb 20, 2024

Update entity serialization for bevy v0.13 projectharmonia/bevy_replicon#200

Closed

james7132 mentioned this pull request Feb 24, 2024

Use NonMaxUsize for non-component SparseSets #12083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Entity::generation from u32 to NonZeroU32 for niche optimization #9907

Change Entity::generation from u32 to NonZeroU32 for niche optimization #9907

notverymoe commented Sep 23, 2023 •

edited

Loading

A-Walrus commented Sep 23, 2023

atlv24 left a comment

notverymoe commented Sep 25, 2023

notverymoe commented Sep 25, 2023

Kolsky commented Sep 25, 2023

notverymoe commented Sep 25, 2023

notverymoe commented Sep 25, 2023

notverymoe commented Sep 30, 2023 •

edited

Loading

notverymoe commented Sep 30, 2023 •

edited

Loading

alice-i-cecile commented Oct 1, 2023

notverymoe commented Oct 6, 2023

notverymoe commented Oct 12, 2023 •

edited

Loading

scottmcm Nov 13, 2023

notverymoe Nov 15, 2023

scottmcm Nov 13, 2023

scottmcm Nov 13, 2023

scottmcm Nov 13, 2023

scottmcm commented Nov 15, 2023

scottmcm Nov 20, 2023

notverymoe Nov 21, 2023

Bluefinger Nov 21, 2023

notverymoe Nov 22, 2023 •

edited

Loading

notverymoe Nov 22, 2023

This comment was marked as outdated.

maniwani left a comment

Testare commented Feb 19, 2024

github-actions bot commented Feb 19, 2024

notverymoe commented Feb 20, 2024

	let mapped_ent = Entity::new(FIRST_IDX, 1).unwrap();
	let mapped_ent = Entity::from_raw(FIRST_IDX);

	.get_or_reserve(Entity::new(SECOND_IDX, 1).unwrap())
	.get_or_reserve(Entity::from_raw(SECOND_IDX))

	mapper.get_or_reserve(Entity::new(0, 1).unwrap())
	mapper.get_or_reserve(Entity::from_raw(0))

		let (lo, hi) = lhs.get().overflowing_add(rhs);
		let ret = lo + hi as u32;

Change Entity::generation from u32 to NonZeroU32 for niche optimization #9907

Change Entity::generation from u32 to NonZeroU32 for niche optimization #9907

Conversation

notverymoe commented Sep 23, 2023 • edited Loading

Objective

Discussion

Solution

Migration

Current Issues

A-Walrus commented Sep 23, 2023

atlv24 left a comment

Choose a reason for hiding this comment

notverymoe commented Sep 25, 2023

notverymoe commented Sep 25, 2023

Kolsky commented Sep 25, 2023

notverymoe commented Sep 25, 2023

notverymoe commented Sep 25, 2023

notverymoe commented Sep 30, 2023 • edited Loading

notverymoe commented Sep 30, 2023 • edited Loading

alice-i-cecile commented Oct 1, 2023

notverymoe commented Oct 6, 2023

notverymoe commented Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottmcm commented Nov 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notverymoe Nov 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

maniwani left a comment

Choose a reason for hiding this comment

Testare commented Feb 19, 2024

github-actions bot commented Feb 19, 2024

notverymoe commented Feb 20, 2024

notverymoe commented Sep 23, 2023 •

edited

Loading

notverymoe commented Sep 30, 2023 •

edited

Loading

notverymoe commented Sep 30, 2023 •

edited

Loading

notverymoe commented Oct 12, 2023 •

edited

Loading

notverymoe Nov 22, 2023 •

edited

Loading