Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: native histogram: exemplars index out of range #1608

Conversation

krajorama
Copy link
Member

@krajorama krajorama commented Aug 31, 2024

Fixes: #1607
Fixes: #1605

fix: native histogram: Simplify and fix addExemplar

mdIdx was redundant when len(exemplars)>1, so got rid of it, rIdx
is enough.

Don't compare timestamp of incoming exemplar to timestamp of
minimal distance exemplar. Most of the time the incoming exemplar
will be newer. And if not, the previous code just replaced an
exemplar one index after the minimal distance exemplar. Which had
an index out of range bug, plus is essentially random.

Contains an unoptimized fix for #1605 , see discussion / fix in #1609

panic: runtime error: index out of range [10] with length 10

goroutine 10 [running]:
github.com/prometheus/client_golang/prometheus.(*nativeExemplars).addExemplar(0xc0002307c8, 0xc00055f4f0)
	/home/krajo/go/github.com/krajorama/client_golang/prometheus/histogram.go:1791 +0x1969
github.com/prometheus/client_golang/prometheus.(*histogram).updateExemplar(0xc000230700, 0x3ff22643022060a2, 0x0, 0xc00054fcb0)
	/home/krajo/go/github.com/krajorama/client_golang/prometheus/histogram.go:1140 +0x117
github.com/prometheus/client_golang/prometheus.(*histogram).ObserveWithExemplar(0xb64180?, 0xc00054fcb0?, 0xbe246f?)
	/home/krajo/go/github.com/krajorama/client_golang/prometheus/histogram.go:770 +0x6a
github.com/prometheus/client_golang/prometheus.TestNativeHistogramConcurrency.func1.2({0xc0005d0000, 0x46e4, 0x420820?})
	/home/krajo/go/github.com/krajorama/client_golang/prometheus/histogram_test.go:1058 +0x17b
created by github.com/prometheus/client_golang/prometheus.TestNativeHistogramConcurrency.func1 in goroutine 21
	/home/krajo/go/github.com/krajorama/client_golang/prometheus/histogram_test.go:1050 +0x325

@krajorama krajorama marked this pull request as ready for review August 31, 2024 09:56
@krajorama krajorama requested a review from beorn7 August 31, 2024 09:57
@krajorama
Copy link
Member Author

cc @fatsheep9146

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but @fatsheep9146 might have an opinion here, too.

Also, leaving it to the maintainers @ArthurSens @kakkoyun @bwplotka to make the final call.

prometheus/histogram.go Outdated Show resolved Hide resolved
Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bwplotka
Copy link
Member

bwplotka commented Sep 2, 2024

Does it needs a minor 1.20 release with this? Or we are ok with 1.21 in a few months or so?

@krajorama
Copy link
Member Author

Does it needs a minor 1.20 release with this? Or we are ok with 1.21 in a few months or so?

I'd love to get a minor release so we can do clean dependency update in Mimir.

@ArthurSens
Copy link
Member

ArthurSens commented Sep 2, 2024

I'll keep this open for another day or so just in case Bartek or Kemal want to give another review. I can create another release, but then we need to point this PR to the release-1.20 branch

Copy link
Contributor

@fatsheep9146 fatsheep9146 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this bug!
This fix is LGTM.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some readability suggestion, some question likely due to my lack of context for the details. Plus please rebase on release-1.20 for this to be in minor patch. Otherwise LGTM!

prometheus/histogram.go Show resolved Hide resolved
@@ -1764,23 +1776,22 @@ func (n *nativeExemplars) addExemplar(e *dto.Exemplar) {
if nIdx > 0 {
diff := math.Abs(elog - math.Log(n.exemplars[nIdx-1].GetValue()))
if diff < md {
// The closest exemplar pair is this: |e.Value - n.exemplars[nIdx-1].Value| is minimal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is off here. Do you mean...

Suggested change
// The closest exemplar pair is this: |e.Value - n.exemplars[nIdx-1].Value| is minimal.
// The smaller exemplar is closer (n.exemplars[nIdx-1].Value), replace that one.

if n.exemplars[nIdx].Timestamp.AsTime().Before(e.Timestamp.AsTime()) {
mdIdx = nIdx
}
// The closest exemplar pair is this: |n.exemplars[nIdx].Value - e.Value| is minimal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

if n.exemplars[nIdx-1].Timestamp.AsTime().Before(e.Timestamp.AsTime()) {
mdIdx = nIdx - 1
}
rIdx = nIdx - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not forgetting to choose larger exemplar rIdx = nIdx for diff > md case? Why we have to recalculate another diff below?

@@ -1764,23 +1776,22 @@ func (n *nativeExemplars) addExemplar(e *dto.Exemplar) {
if nIdx > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did is odd, why we need this, we just set it to len if it's -1, plus we know len is > 0 because we are in the logic that has to remove something. Is it because nIdx can be zero which means we have to put exemplar upfront?

Then let's fix comment on line 1766, it's wrong:

// Here, we have the following relationships:
// n.exemplars[nIdx-1].Value < e.Value <= n.exemplars[nIdx].Value

if n.exemplars[nIdx-1].Timestamp.AsTime().Before(e.Timestamp.AsTime()) {
mdIdx = nIdx - 1
}
rIdx = nIdx - 1
}
}
if nIdx < len(n.exemplars) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is then guarding the case when nIdx is the last element? Then again, comment 1766 is kind of wrong, or misleading.

@krajorama
Copy link
Member Author

@bwplotka I've updated the comments in the code with more explanations and my understanding of the code :)

@bwplotka bwplotka changed the base branch from main to release-1.20 September 4, 2024 18:37
@bwplotka
Copy link
Member

bwplotka commented Sep 4, 2024

I forced this PR to be against release-1.20 branch.. which brought new commits from main. Do you mind rebasing your commits to release-1.20 branch? (:

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! (just rebase)

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
mdIdx was redundant when len(exemplars)>1, so got rid of it, rIdx
is enough.

Don't compare timestamp of incoming exemplar to timestamp of
minimal distance exemplar. Most of the time the incoming exemplar
will be newer. And if not, the previous code just replaced an
exemplar one index after the minimal distance exemplar. Which had
an index out of range bug, plus is essentially random.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama force-pushed the index-out-of-range-native-histogram-exemplar branch from d82721b to d6b8c89 Compare September 4, 2024 19:10
@krajorama
Copy link
Member Author

Thanks! (just rebase)

Rebased and PR re-targetted

@krajorama krajorama merged commit 6e9914d into prometheus:release-1.20 Sep 4, 2024
8 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Index out of range error in native histogram Data race in native histogram
5 participants