Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10599: Improve LogMergePolicy's handling of maxMergeSize. #935

Merged
merged 4 commits into from
Jun 9, 2022

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented May 31, 2022

With this change, segments are more likely to be considered for merging until
they reach the max merge size. Before this change, LogMergePolicy would exclude
an entire window of mergeFactor segments from merging if this window had a
too large segment and other segments were on the same tier.

With this change, segments are more likely to be considered for merging until
they reach the max merge size. Before this change, LogMergePolicy would exclude
an entire window of `mergeFactor` segments from merging if this window had a
too large segment and other segments were on the same tier.
Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This PR looks reasonable to me, but I'm a bit confused by existing logic

if (anyMerging || end - start <= 1) {
// skip: there is an ongoing merge at the current level or the computed merge has a single
// segment and this merge policy doesn't do singleton merges
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically for a level which meets the merge factor we're merging every segments slice that is not exceed the max size specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@@ -568,23 +568,41 @@ public MergeSpecification findMerges(
// Finally, record all merges that are viable at this level:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the level quantization process above actually:

  1. From L527 it determines the max level, and then a level bottom based on that
  2. Then it find the right boundary by search backwards for the first qualified segment in L556-

This seems assuming the levels are sorted, but I can't find the sorting anywhere. Or otherwise how could it guarantee that the level decided in range of [start,end) won't contain segments that have lower level than levelBottom?

Sorry the question is not quite related to the change itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could it guarantee that the level decided in range of [start,end) won't contain segments that have lower level than levelBottom?

LogMergePolicy doesn't try to provide this guarantee. It's actually important it doesn't try to provide this guarantee, otherwise it could end up with lots of unmerged segments. For instance imagine that you have 9 segments (S1..S9) on level 10 then one segment (S10) on level 9, one more segment on level 10 (S11) and then potentially other segments.
If LogMergePolicy refused to merge segments that are on a lower level then it could never merge together segments S1..S10. This is because segment S10 can only be merged with segments that are on a higher level because both the previous and the next segment are on a higher level, and LogMergePolicy only merges adjacent segments.

This is a downside of LogMergePolicy compared to TieredMergePolicy: because it only performs merges of adjacent segments, it sometimes has to return merges where not all segments are on the same level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

@jpountz jpountz merged commit b1eddec into apache:main Jun 9, 2022
@jpountz jpountz deleted the log_mp_max_merge_size branch June 9, 2022 15:44
@jpountz
Copy link
Contributor Author

jpountz commented Jun 9, 2022

Thanks @zhaih !

jpountz added a commit that referenced this pull request Jun 9, 2022
With this change, segments are more likely to be considered for merging until
they reach the max merge size. Before this change, LogMergePolicy would exclude
an entire window of `mergeFactor` segments from merging if this window had a
too large segment and other segments were on the same tier.
shaie pushed a commit to mdmarshmallow/lucene that referenced this pull request Jun 22, 2022
…che#935)

With this change, segments are more likely to be considered for merging until
they reach the max merge size. Before this change, LogMergePolicy would exclude
an entire window of `mergeFactor` segments from merging if this window had a
too large segment and other segments were on the same tier.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants