Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize octree and fix leak #41123

Merged
merged 1 commit into from
Sep 30, 2020
Merged

Conversation

lawnjelly
Copy link
Member

@lawnjelly lawnjelly commented Aug 8, 2020

Octant Limit

Prevents adding new octants until a limiting number of elements have been added to the current octant. This enables balancing the benefits of brute force against the benefits of spatial partitioning. The limit can be set per octree.

Project settings are added for the rendering octree only to set the best balance per project depending on number of tests per frame / tick, and the amount of editing of the octree.

Octant Leak

Fixes octants being leaked when removing elements.
This is mirroring @naithar fix for the leak #41122, so it can be self contained.
On suggestion from @clayjohn I have updated the leak fix to be similar to the other PR, and added naithar as co-author in case this ends up getting merged together, to make things easier for Akien.

Cached linear local_vectors

(second commit) Instead of using linked lists at runtime this maintains a dirty flag per octant and builds on demand a linear list of Element pointers and AABBs which can be used for quicker cull tests. This can lead to significantly better test performance.

This is offered as a second version of the octree, Octree_CL. It is only used for the rendering octree currently, as it has only been shown to offer performance increases there so far.

Background

As discussed with reduz, this prevents adding new octants until a limiting number of elements have been added to the current octant, and is probably a better option than directly limiting the depth of the tree. This enables balancing the benefits of brute force against the benefits of spatial partitioning. The limit can be set per octree.

Preliminary tests indicate that for visibility culling brute force is significantly faster than using the octree as is.

The implementation so far is very simple, it doesn't split octants until the limit is reached. This means that on splitting, some of the elements in the parent octant may have been better off in the children. I have written code to rebalance the tree on splitting, but it isn't significantly faster so is left out at this stage.

Scenes which particularly benefit are those containing large numbers of objects, especially large number of moving objects / created / erased. It can also significantly speed up cull tests with the cached lists. Most 3d games will see a modest improvement, but some may gain a 2x improvement in fps or more. This is highly dependent on to what degree octree is a bottleneck in particular game.

Fixes #38142
Fixes #41480
Related to #40059

Notes

  • The PR is less extensive than it looks in terms of line numbers, because it moves most of the existing octree into a separate file for multiple inclusion, and the addition of local_vector from Godot 4.0.
  • This is made against 3.2, there will be a bit of porting to get it to work with master because of changes to geometry. Also there may be further modifications to the version in the 4.x branch in future as we move to other spatial partitioning schemes.
  • It also adds some basic debugging facilities to the octree (TOOLS only).
  • I've experimentally tried a system to automatically set the limit depending on the use case - but it is a very difficult problem. For instance some frames may do more tests than others, or a lot of tests on e.g. level startup. I've tried maintaining a running average but the more I think about it the more I tend to think we should have this as a project setting, set to a reasonable value, and then tweakable for better performance. I'm aware of the project settings bloat problem but I can't see an easy better way around this, so I'm going to add this to the PR.

Local Vector

This PR introduces local_vector.h from Godot 4.x. This might be better as a separate commit so the octree changes could be reverted without the local_vector being reverted, if something else is relying on local_vector. Note that PouleyKetchoupp's PR #40313 also contains local_vector.

We could potentially merge local_vector as a separate commit first, then me and Pouley could modify our PRs. Leave this up to Akien though as to whatever works best.

Demo

To try it out, load this project, and try changing the project settings rendering/quality/spatial_partitioning/render_tree_balance.

  • At a value of 0.0 I get 26fps, and at 1.0 I get 240fps.
  • For reference, with the old octree I get somewhere around 23fps.
  • The relative speedup in this example depends how many can be culled, when all are culled the speedup is higher, when all are showing, less so, because the bottlenecks are elsewhere.
  • In most cases for rendering, shifting the balance towards 1.0 (in favour of octree editing rather than tests) will give best performance. Tests only start to become the bottleneck when you are doing lots of them (100s or 1000s per frame), in which case shifting towards 0.0 can help (and the benefits of the linear cache start to really show).
  • A balance of 0.0 is closest to the old octree behaviour. However it should still be faster due to linear caching which is always on.

octree_test.zip

@pouleyKetchoupp pouleyKetchoupp added this to the 3.2 milestone Aug 9, 2020
@lawnjelly lawnjelly changed the title [WIP] Fix octree leak and optimize octree [WIP] Optimize octree and fix leak Aug 10, 2020
@lawnjelly lawnjelly force-pushed the octree_fix branch 4 times, most recently from 19a9239 to 99443b4 Compare August 10, 2020 19:35
@lawnjelly
Copy link
Member Author

@clayjohn Just for reference, I've done some profiling of the godot physics octree and it looks like the lack of improvement with larger octant limits may be due to the pairing, which I haven't altered (and am not planning on changing).

For this reason I'm now thinking in terms of removing the project setting for the physics octree and just sticking with old version functionality for the physics octree and probably the visibility notifier octree (for safety), as the performance improvements are probably limited to the rendering octree.

I'm also trying to get some non-ugly compile time switching of whether to use the cached lists too (yes for rendering, no for physics as it will save memory and makes little difference to performance).

@clayjohn
Copy link
Member

That sounds like a pragmatic approach. Enable the optimization only for where we know it helps in all cases. That way we can reduce the risk of the change as much as possible.

@clayjohn
Copy link
Member

I tested out this PR a little bit this evening. Unfortunately, I spotted a regression.

In the TPS demo, just moving the character at full speed causes the character to disappear until they are moved back to the original position, and then if you walk slowly they will not disappear. I am guess something is getting lost when objects move too fast.

The performance was slightly better with world/3d/spatial_partitioning/render_tree_balance set to 1.0. Although, I didn't expect much of an improvement as there are not very many moving objects. And not much is being created or destroyed.

@lawnjelly
Copy link
Member Author

lawnjelly commented Aug 12, 2020

I tested out this PR a little bit this evening. Unfortunately, I spotted a regression.
In the TPS demo, just moving the character at full speed causes the character to disappear until they are moved back to the original position, and then if you walk slowly they will not disappear. I am guess something is getting lost when objects move too fast.

Ah that's not good I'll try and track that down. Possibly somewhere I'm not refreshing the cached list. The limit per octant should not (in an obvious way at least) be able to introduce regressions, unless there is an existing bug in e.g. the pairing.

There is actually a small existing bug in the PR in the brackets which I've noticed while trying to come up with an elegant refactor to allow optionally removing the cached lists. I doubt it is causing the regression but I'll know soon.

The performance was slightly better with world/3d/spatial_partitioning/render_tree_balance set to 1.0. Although, I didn't expect much of an improvement as there are not very many moving objects. And not much is being created or destroyed.

Yes I think despite the hype in vblanco's thread #23998, optimizing the octree seems to only give a large overall improvement in certain scenarios - like you say lots of moving objects (say an asteroid type game). The octree is often running significantly faster, but when everything else is also slowing things down, it isn't always that much of a bottleneck in the overall frame time.

I don't think this is a good reason not to sort these problems though. I get the impression there are multiple key bottlenecks in Godot currently (including things like octree, gdscript, variants, nodes, threading overheads using things like visual server / physics, GPU fill rate, occlusion) and they will all need to be improved in order to get great performance overall. Improving any one of them will give good improvements in specific benchmarks, but won't massively improve an 'average' game, only their cumulative effect.

Edit: Ok there are now 2 versions of octree, one with the cached list. I had to spend a bit of time coming up with a non ugly way of turning off the cached lists at compile time. Usually this kind of thing can be done purely with templates (and may be able to be in a way I haven't figured out) but here I've just used macros. I did consider using some template methods such as CRTP but that could have made the code hard to read, it should be easily understandable using the macro approach.

I still have to find this regression, being difficult because I've had limited success installing the tps demo project so far.

@lawnjelly lawnjelly force-pushed the octree_fix branch 3 times, most recently from 747c8ac to 23ee913 Compare August 12, 2020 13:24
@Calinou
Copy link
Member

Calinou commented Aug 12, 2020

If anyone needs the tps-demo, you can download it from here (link valid until 2020-09-13): https://0x0.st/iY8e.zip

@lawnjelly
Copy link
Member Author

lawnjelly commented Aug 12, 2020

Ok, regression hopefully fixed. I missed out setting the dirty flag on the octant in Move(). I've put in an optional debug check to catch any of these cases in future (not enabled by default as it slows performance).

Edit: Ah no finally got TPS demo working and the player is still disappearing sometimes. I'll see if I can find out what is causing it. It's something in the cached lists, when I turn them off it works fine. So it is probably another simple fix with them getting dirty. FPS goes from 13 -> 17 with cached lists on for me so well worth having them working (presuming that it isn't because it is rendering less).

Ah got it, it is moving the elements is changing the AABB, so all the owner octants need to get made dirty.

@lawnjelly lawnjelly force-pushed the octree_fix branch 2 times, most recently from 5a024bb to 7dd5d73 Compare August 12, 2020 15:31
@clayjohn
Copy link
Member

I tested the TPS demo on the latest commit and it appears to work great! :)

@lawnjelly lawnjelly changed the title [WIP] Optimize octree and fix leak Optimize octree and fix leak Aug 18, 2020
Prevents adding new octants until a limiting number of elements have been added to the current octant. This enables balancing the benefits of brute force against the benefits of spatial partitioning. The limit can be set per octree.

Project settings are added for rendering octree to set the best balance per project depending on number of tests per frame / tick, and the amount of editing of the octree.

Fixes octants being leaked when removing elements.

Optimize octree with cached linear lists

Storing elements in octants using linked lists is efficient for housekeeping but very slow for testing. This optimization stores additional local_vectors with Element pointers and AABBs which are cached and only updated when a dirty flag is set on the octant.

This is selectable with 2 versions of Octree : Octree and Octree_CL, Octree being the old behaviour. At present the cached list version is only used for the visual server octree (rendering) as it has only been demonstrated to be faster there so far.

This uses slightly more memory (probably a few kb in most cases) but can be significantly faster during testing (culling etc).

Co-authored-by: Sergey Minakov <naithar@icloud.com>
@lawnjelly
Copy link
Member Author

For future reference:

Although I think this PR should be merged as is, based on some discussion about this maybe fixing #38142, in the future it may be worth investigating whether tiny objects can also cause problems due to runaway octants in the visibility notifier and godot physics octrees.

There are now 2 versions of the octree, the old one and the new optimized one, and the new one is only applied in the visual server octree, where it was shown to give performance benefit - the visibility notifier and the godot physics still use the old one.

However if it turns out the other octrees are susceptible to the runaway behaviour (I hesitate to call it bug, but it is an unwanted feature) we should look at solving this, either by swapping to the new octree, or modifying the old one to guard against this without affecting performance, perhaps with a simple depth limit.

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell this looks good. I like the approach to splitting the octree implementations. But this is really going to need @reduz' approval before merging. Hopefully we can steal his attention for a bit once he finishes with particles.

@@ -1168,6 +1168,10 @@
<member name="rendering/quality/shadows/filter_mode.mobile" type="int" setter="" getter="" default="0">
Lower-end override for [member rendering/quality/shadows/filter_mode] on mobile devices, due to performance concerns or driver support.
</member>
<member name="rendering/quality/spatial_partitioning/render_tree_balance" type="float" setter="" getter="" default="0.17">
The rendering octree balance can be changed to favor smaller ([code]0[/code]), or larger ([code]1[/code]) branches.
Larger branches can increase performance significantly in some projects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the downside about having large branches?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the downside about having large branches?

It is a complicated relationship from what I could see. On a very basic level, having larger branches is quicker for editing (moving objects around, inserting, erasing), but can be slower in certain types of test (particularly those in physics).

However, the difference due to editing is a one off cost within a frame (or tick), whereas the cost for testing depends how many tests are done.

For instance for visibility, it most situations I have tested there is benefit from having larger branches. It could be argued to have the default set higher, however at the moment I am tending towards a setting which can give a performance increase when tweaked, rather than one that can remove a performance drop when tweaked.

For visibility, in a lot of cases there may only be one or a few tests performed on the octree, hence the balance for editing speed is usually better. However, if you were making a game that made a large number of visibility tests the balance could shift the other way.

For godot physics on the other hand, with a large number of objects (thousands) and thus many thousands of tests, smaller branches always seemed to be better. Hence removing the new tree from there unless evidence shows it can lead to better performance.

@akien-mga
Copy link
Member

Alright, time to merge and get this battle tested in 3.2.4 beta 1 :)

@reduz says the changes seem fine but he's not so familiar with this code anymore to give an in-depth review, yet he trusts the extensive work done by @lawnjelly and @naithar.

So let's merge and see what happens, we should have plenty of time until 3.2.4-stable to catch potential regressions, if any.

@akien-mga akien-mga merged commit 66cbcc1 into godotengine:3.2 Sep 30, 2020
@akien-mga
Copy link
Member

Thanks hugely!

@jamie-pate
Copy link
Contributor

I was hopeful this would fix my bottleneck, but I think it's related to 'pairing' which you mentioned is left out.
image
Unfortunately I don't get what 'pairing' means in this context or why it's running so slow :)

image

@lawnjelly
Copy link
Member Author

The pairing is used for e.g. linking a light with objects, and the code for it is quite convoluted, and would need quite a bit more time for me personally to invest as I didn't write it (and I have many higher priority tasks). It is known to be inefficient, that's why it is planned to be removed in 4.x, with a bit of a paradigm shift.

For 3.x the emphasis is on compatibility and not breaking existing functionality, so it is difficult to remove pairing, it would either need to be made more efficient as is or rewritten in a 'black box' fashion. If someone else has time to spend on this I'm sure it would be welcome.

Incidentally make sure to try changing the rendering/quality/spatial_partitioning/render_tree_balance project setting if you are just using defaults, it may have at least some effect on your performance.

@jamie-pate
Copy link
Contributor

I'm still getting the 30ms frame times when setting the visibility of these lights. I have a workaround in mind where I will use a proxy script to relocate the actual light nodes to the root so they do not have to have their visibility toggled when the geometry is hidden. Hopefully that will smooth it over until Godot 4

@lawnjelly
Copy link
Member Author

If you create an issue with a minimum reproduction project, it is more likely to get looked at (I might be able to profile it and find something obvious for example).

@jamie-pate
Copy link
Contributor

jamie-pate commented Oct 4, 2020

Done #42563

I'm be curious at what your method is for profiling? I've only hacked in some statistics to the object visibility calls. I guess with it toggling every frame any old profiling will do? (I was trying to catch stuff that only happens every once in a while, which is harder)
object-profiling.diff.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants