Add strong read consistency support in ingester for experimental ingest storage #7030

pracucci · 2024-01-02T16:50:20Z

What this PR does

This is another step forward in the experimental ingest storage conditional read-after-write implementation. In this PR I'm introducing a per-tenant config option to set the default read consistency guarantee (used only when ingest storage is enabled). When the read consistency is set to strong, then in the ingester we wait until all the data has been replayed from Kafka up until that point before executing a query.

Notes:

The per-query conditional support will be later upstreamed by @dimitarvdimitrov (work was done by him).
In the ingester unit tests I've picked a simple path, just testing QueryStream(). Reason why I would suggest to not test every function for now is because we may change the strong consistency implementation later, moving it from the ingester to an upper layer in the read path.
Some tests on runtime config / limits have changed because we need to use the Limits initialised with defaults (like Mimir real logic does) and not the zero value. If we don't do it, then the new validation I've introduced will fail.

Which issue(s) this PR fixes or relates to

N/A

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

dimitarvdimitrov

only a few comments on the testing approaches

dimitarvdimitrov · 2024-01-03T11:00:29Z

pkg/ingester/ingester_ingest_storage_test.go

+			}()
+
+			// Wait some time and then unblock the Fetch.
+			time.Sleep(time.Second)


is it possible to do this without a Sleep? Sleeps are usually flaky it might end up just not doing what we need it to (also slows down the test).

If my understanding is correct we want to wait so that the Fetch fails a couple of times before we unblock it. So we want to test resiliency during startup. Can we just close a channel in the ControlKey once we get the first error? Or perhaps count the number of invocations and unblock the fetch only after the first few invocations have failed?

I start from the assumption that "every sleep in a test is bad" is not the right approach (similarly to "any use of panic() in the code is bad"... well it depends). There are times when a sleep is good enough to get what you want, it's reliable when the sleep time is large enough, and sometimes it's 10x times easier than alternatives. That being said...

We want to unblock the Fetch while the query is waiting on Ingester.enforceReadConsistency(). If we do as you say, that is failing the first few Fetch requests, well it's not much different than a sleep because Fetch (or ListOffsets, that's the other thing we could fail) are issued in a loop which is completely independent from the query path.

A better approach may be unblock the Fetch right before calling runTestQuery(). There's still no guarantee that we would get execution in the other we want (at the end, the Fetch may run before the actual query is waiting for consistency) so to make it further reliable we may have to also add a short sleep (oh, welcome back sleep!), or we could ignore the problem assuming that in practice it will be very unlikely to happen.

A better approach may be unblock the Fetch right before calling runTestQuery()

Done here: 79d0a19. Even if there's no strong guarantee that the query will wait for read consistency before we unblock the fetch, I think that in practice it's nearly impossible to get the execution in the wrong order because of the ListOffsets fetched in a loop.

fair enough, thanks for the explanation. Now that you put it like that the sleep makes slightly more sense than the closed channel before running the query. Both are ok, so happy to keep it as-is

pkg/ingester/ingester_ingest_storage_test.go

dimitarvdimitrov · 2024-01-03T11:05:37Z

pkg/ingester/ingester_ingest_storage_test.go

+	"github.com/grafana/mimir/pkg/util/validation"
+)
+
+func TestIngester_QueryStream_IngestStorageReadConsistency(t *testing.T) {


this test looks a lot like an end-to-end/integration test rather than a unit test. Do you want to move it to e2e instead? I'd even go as far as running the distributor and writing to it instead of writing to kafka. I can help with doing this change if you want in a follow-up PR

it would also remove the need to share ConsumerGroup, which IMO is an implementation detail of the storage/ingest package

this test looks a lot like an end-to-end/integration test rather than a unit test. Do you want to move it to e2e instead?

The way I see it is that an integration test shouldn't mock any component, while this test requires to mock the Kafka cluster, reason why I think it's appropriate for a unit test (also, we do pretty much the same kind of testing in most of the functions of ingester_test.go).

it would also remove the need to share ConsumerGroup, which IMO is an implementation detail of the storage/ingest package

I agree on this. I also was in doubt. For now I've hardcoded the consumer group name in testkafka package. We can worry about it when we'll need multiple ones: c6af570

this test looks a lot like an end-to-end/integration test rather than a unit test. Do you want to move it to e2e instead?

The way I see it is that an integration test shouldn't mock any component, while this test requires to mock the Kafka cluster, reason why I think it's appropriate for a unit test (also, we do pretty much the same kind of testing in most of the functions of ingester_test.go).

ok, I'm not feeling super strong about this. And testing is usually more conventions-based, so let's keep as-is

dimitarvdimitrov · 2024-01-03T11:20:47Z

pkg/mimir/runtime_config_test.go

 }

 func TestRuntimeConfigLoader_ShouldLoadEmptyFile(t *testing.T) {
+	validation.SetDefaultLimitsForYAMLUnmarshalling(getDefaultLimits())


you can add a TestMain() and call this once for the whole package as opposed to calling it in every test. Something along the lines of

func TestMain(m *testing.M) { validation.SetDefaultLimitsForYAMLUnmarshalling(getDefaultLimits()) m.Run() }

same applies to pkg/util/validation/limits_test.go

Yes, we can do it. The tricky thing is that we have some tests overriding the default limits, so we have to remember to revert back the default limits in the test cleanup: 3156fcf

dimitarvdimitrov

LGTM

dimitarvdimitrov · 2024-01-04T09:28:12Z

pkg/util/validation/limits_test.go

@@ -776,11 +813,16 @@ func TestExtensions(t *testing.T) {
 	})

 	t.Run("default limits does not interfere with tenants extensions", func(t *testing.T) {
+		// Reset the default limits at the end of the test.
+		t.Cleanup(func() {
+			SetDefaultLimitsForYAMLUnmarshalling(getDefaultLimits())


I left this comment on the individual commit too, but it didn't show on the PR

Should we move this closed to the other SetDefaultLimitsForYAMLUnmarshalling so it's easier to spot what's being cleaned up. Currently they're 10 lines apart

I think it's a good idea. Done here: 0077fd1

…st storage Signed-off-by: Marco Pracucci <marco@pracucci.com>

Signed-off-by: Marco Pracucci <marco@pracucci.com>

…dConsistency Signed-off-by: Marco Pracucci <marco@pracucci.com>

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci force-pushed the add-read-after-write-support-to-ingesters branch from c80608a to 728e961 Compare January 3, 2024 10:56

pracucci marked this pull request as ready for review January 3, 2024 10:56

pracucci requested a review from a team as a code owner January 3, 2024 10:56

pracucci requested a review from dimitarvdimitrov January 3, 2024 10:57

dimitarvdimitrov reviewed Jan 3, 2024

View reviewed changes

dimitarvdimitrov approved these changes Jan 4, 2024

View reviewed changes

dimitarvdimitrov reviewed Jan 4, 2024

View reviewed changes

pracucci added 8 commits January 4, 2024 11:34

Add strong read consistency support in ingester for experimental inge…

b662dbd

…st storage Signed-off-by: Marco Pracucci <marco@pracucci.com>

Fix linter issue

9449e4c

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Fix tests

fd6bcd2

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Remove superfluous defer() call

e08673b

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Init default limits in TestMain()

659dfc1

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Do not export ingest.ConsumerGroup

0b2f71e

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Avoid the usage of sleep in TestIngester_QueryStream_IngestStorageRea…

3d36c4a

…dConsistency Signed-off-by: Marco Pracucci <marco@pracucci.com>

Move default limits resetting close to where we override it

0077fd1

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci force-pushed the add-read-after-write-support-to-ingesters branch from 79d0a19 to 0077fd1 Compare January 4, 2024 10:37

pracucci merged commit 7cb75d7 into main Jan 4, 2024
28 checks passed

pracucci deleted the add-read-after-write-support-to-ingesters branch January 4, 2024 10:53

dimitarvdimitrov mentioned this pull request Jan 10, 2024

ingest storage: per-query X-Read-Consistency HTTP header #7091

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add strong read consistency support in ingester for experimental ingest storage #7030

Add strong read consistency support in ingester for experimental ingest storage #7030

pracucci commented Jan 2, 2024 •

edited

Loading

dimitarvdimitrov left a comment

dimitarvdimitrov Jan 3, 2024

pracucci Jan 3, 2024

pracucci Jan 3, 2024

dimitarvdimitrov Jan 4, 2024

dimitarvdimitrov Jan 3, 2024

pracucci Jan 3, 2024

pracucci Jan 3, 2024

dimitarvdimitrov Jan 4, 2024

dimitarvdimitrov Jan 3, 2024

pracucci Jan 3, 2024

dimitarvdimitrov left a comment

dimitarvdimitrov Jan 4, 2024

pracucci Jan 4, 2024

Add strong read consistency support in ingester for experimental ingest storage #7030

Add strong read consistency support in ingester for experimental ingest storage #7030

Conversation

pracucci commented Jan 2, 2024 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pracucci commented Jan 2, 2024 •

edited

Loading