-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix[mimirtool]: order samples when performing export #5700
fix[mimirtool]: order samples when performing export #5700
Conversation
Hi, you say that
Can you point to the place where this happens? |
The block gets committed here: https://github.com/grafana/mimir/blob/main/pkg/mimirtool/backfill/backfill.go#L138 This in turn calls this massive function https://github.com/prometheus/prometheus/blob/651b5a049a3d2ca877f8cf960a8add1a366f6da0/tsdb/head_append.go#L832 which commits writes to the WAL and keeps track of the minimum timestamp in At the end, this value gets used to create the new head https://github.com/prometheus/prometheus/blob/main/tsdb/head_append.go#L1083 |
I think I see the problem now. It's not when cutting new blocks, rather it's when committing some appended samples.
This only commits and appender, it doesn't cut a new block. Instead, this is the place which triggers a block upload mimir/pkg/mimirtool/backfill/backfill.go Line 150 in eb754c6
So what I suspect is happening is that the first series in the remote-write response have a very late timestamp. A flush is triggered, which sets the min-time of the head. Next series which have a timestamp closer to the requested min time then try to be appended. Am I on the right track? Would be interesting to have this reproduced with a unit test. Do you think that's doable? |
Yeah, sorry for mixing up the terms, still new to the codebase 😅 I'll try to build a unit-test and ping you once it's done! |
Alright, I've managed to condense the mimir data I was getting from my instance into a testcase. To reproduce the issue in the old implementation, simply switch out While creating this testcase, I encounterd some very strange conditions for this bug. The issue does not only depend on the number of samples/their timestamps but also the value of the start/endtime and the interval between sample timestamps. The amount of samples per block also affects this. I'm now even more unsure what is happening here 😅, all I know, is that the issues can be fixed by ordering the samples. |
I think I understand now why this problem is happening. The TSDB head (in-memory series storage) allows to append only up to 1 hour before the most recent sample. Compactions run on In this test we append by series (if we use the original Each group of 1000 committed samples is bound to a 2-hour period. This allows to append samples of one series which are 2h ahead of the sample of the next series. One easy way to fix this is to reduce the period to 1 hour. This way we won't commit samples which are more than 1h ahead of other samples we have. The period is defined in the The alternative is the ordered iterator that you added. I think I prefer changing the append period to the iterator because it's simpler and you already have added tests for the change - TestEarlyCommit. What's your take? |
I've tried your suggestion of setting |
I had overlooked all the usages of diff --git a/pkg/mimirtool/backfill/backfill.go b/pkg/mimirtool/backfill/backfill.go
index c3320c9b3..032540b21 100644
--- a/pkg/mimirtool/backfill/backfill.go
+++ b/pkg/mimirtool/backfill/backfill.go
@@ -90,7 +90,7 @@ func CreateBlocks(input IteratorCreator, mint, maxt int64, maxSamplesInAppender
var wroteHeader bool
- for t := mint; t <= maxt; t = t + blockDuration {
+ for t := mint; t <= maxt; t += blockDuration / 2 {
err := func() error {
w, err := tsdb.NewBlockWriter(log.NewNopLogger(), outputDir, blockDuration)
if err != nil {
@@ -106,7 +106,7 @@ func CreateBlocks(input IteratorCreator, mint, maxt int64, maxSamplesInAppender
ctx := context.Background()
app := w.Appender(ctx)
i := input()
- tsUpper := t + blockDuration
+ tsUpper := t + blockDuration/2
samplesCount := 0
for {
err := i.Next()
diff --git a/pkg/mimirtool/commands/remote_read_test.go b/pkg/mimirtool/commands/remote_read_test.go
index 433bf8011..9371157a0 100644
--- a/pkg/mimirtool/commands/remote_read_test.go
+++ b/pkg/mimirtool/commands/remote_read_test.go
@@ -10,10 +10,11 @@ import (
"io"
"testing"
- "github.com/grafana/mimir/pkg/mimirtool/backfill"
"github.com/pkg/errors"
"github.com/prometheus/prometheus/prompb"
"github.com/stretchr/testify/assert"
+
+ "github.com/grafana/mimir/pkg/mimirtool/backfill"
)
func TestTimeSeriesIterator(t *testing.T) {
@@ -154,9 +155,9 @@ func TestTimeSeriesIterator(t *testing.T) {
}
func TestEarlyCommit(t *testing.T) {
- maxSamplesPerBlock := 1000
- series := 9
- samples := 140
+ maxSamplesPerBlock := 10
+ series := 18
+ samples := 1400
start := int64(1691408300722)
inc := int64(60000)
@@ -181,7 +182,7 @@ func TestEarlyCommit(t *testing.T) {
ts[i] = s
}
iterator := func() backfill.Iterator {
- return newOrderedIterator(ts)
+ return newTimeSeriesIterator(ts)
}
err := backfill.CreateBlocks(iterator, start, end, maxSamplesPerBlock, t.TempDir(), true, io.Discard)
assert.NoError(t, err) |
That makes sense! I've updated this PR to only include these changes. Thanks for the detailed explanation - I've learned a lot! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I left only two minor comments about the test. Happy to merge the PR afterwards
The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase |
Alright, should be resolved now! I hope I understood the changelog instructions correctly 😬 |
What this PR does
When iterating over the samples returned by the remote read endpoint, the samples are grouped by series, e.g.:
As long as all samples fall into the same block timespan (default is 2h and not configurable), this does not cause any issues. The same is the case if all series start at the same timestamp.
However if a block gets committed, the new minimum timestamp gets set to the lowest timestamp of the ingested samples up to now. If a series with an earlier starting timestamp exists in the remote_read response, this causes mimirtool to crash with an out of order error.
This PR fixes this by using a heap based iterator to iterate over the samples based on the timestamp value.
Checklist
Documentation addedCHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]