Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy: Make sure there is no race condition between sender and recv of resp channel #745

Merged
merged 1 commit into from
Jan 21, 2019

Conversation

bwplotka
Copy link
Member

@bwplotka bwplotka commented Jan 16, 2019

This fixes potential race condition. Thanks to @jojohappy who highlighted the problem in this PR: #744

Sorry for getting duplicate PR, but found other race conditions as well (what if sending stream gives err!) and this fixes CI and edge cases for users so it's quite urgent.

Particular race conditions and known issues this PR fixed:

  • Start stream series set sending to buffered chamber before we are even reading. When one of store returns err and want to send warn it is blocked forever. (@jojohappy finding)
  • When error happens while stream.recv on start stream we were warning only despite our partial response disable flag.
  • When client stream is broken we end response channel reader loop. This means that if stores sends us data and channel is buffered we are locked forever!

@domgreen you were fixing something similar. This should fix all cases. More unittests to be added tomorrow.

PTAL @jojohappy

Signed-off-by: Bartek Plotka bwplotka@gmail.com

@bwplotka bwplotka changed the title proxy: Make sure all go routines are schedules using single errgroup. proxy: Make sure there is no race condition between sender and recv of resp channel Jan 16, 2019
Copy link
Member

@jojohappy jojohappy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM! But I've a question.

BTW, It should be add into CHANGELOG! 😝

pkg/store/proxy.go Outdated Show resolved Hide resolved
pkg/store/proxy.go Outdated Show resolved Hide resolved
This fixes potential race condition. Bit more efficient version of #744

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
var storeDebugMsgs []string
// Allow to buffer max 10 series response.
// Each might be quite large (multi chunk long series given by sidecar).
respSender, respRecv, closeFn = newRespCh(gctx, 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hardcode 10 ?

Is that something that could impact performances (in both good and bad way) ? What if thanos is running on big machines, might people want to be able to tune that parameter ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

This is a tradeoff on buffering memory vs time when connection is open.

The problem right now that this message can be bit large as we send THE whole series for sidecar. Imagine scrape inveral 5 (kind of worse scenario), 6w query, single series. Message might be up to 2MB.. single query will buffer up to 20MB in that worst case. Imagine multiple queries like this.. etc (: So keeping it low makes sense more.

In terms of giving control above we can do it later, but we need to make sure it easy to understand and if there is nothing else that limits/extands that buffer (e.g gRPC itself on both ends)

@adrien-f
Copy link
Member

I'm not an expert in Goroutines so I'm afraid I won't be able to give a good review on this but I still found a comment to make 😄

The logic seems sound, do those changes deserves extra testing ? Race conditions are hard to test though.

@bwplotka
Copy link
Member Author

Yea - we have quite a lot tests for this, also I ran it with -race flag and there seems like no issue. In fact we could repro it quite a lot (even locally), so tests are there - just not deterministic because... race conditions. I feel like there is no value in adding more tests.

Ok, I will merge this PR today if no objections will arise - quite urgent. @domgreen @improbable-ludwik

Copy link
Contributor

@domgreen domgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - still think we could easily end up OOMing with a large set of stores ... however, this would at least restrict it.

@@ -171,48 +202,74 @@ func (s *ProxyStore) Series(r *storepb.SeriesRequest, srv storepb.Store_SeriesSe

}

type warnSender interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - would move the declaration of this interface up near the ctxRespSender which satisfies this interface

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sooo I think it can be in both places, but the pattern is to have consumers of the interface to define the interface, so I vote for leaving it near consumer.

Example references to this pattern: https://stackoverflow.com/questions/53381694/where-should-we-define-go-interface-in-multiple-consumer-scenario-what-about-i#comment93639200_53381694 https://twitter.com/davecheney/status/942593128355192832?lang=en


if len(seriesSet) == 0 {
// This is indicates that configured StoreAPIs are not the ones end user expects
err := errors.New("No store matched for this query")
level.Warn(s.logger).Log("err", err, "stores", strings.Join(storeDebugMsgs, ";"))
respCh <- storepb.NewWarnSeriesResponse(err)
respSender.send(storepb.NewWarnSeriesResponse(err))
return nil
}

mergedSet := storepb.MergeSeriesSets(seriesSet...)
for mergedSet.Next() {
var series storepb.Series
series.Labels, series.Chunks = mergedSet.At()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed in this PR but would be good to add documentation to the interface of the iterator methods ... Next() At() etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes 👍 This came from the TSDB I think. Worth to contribute such there.. or actually hosting this in Thanos is important as well I think since we use it heavily. Where to put it though? Maybe blog post?

closeFn()
}()

for _, st := range stores {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are a large number of stores e.g. 100+ we would end up spinning up 100+ goroutines that would then all go and fetch 10 series from their store ... does this also risk OOMing us? That could be then 1000 series in memory on the query as we are merging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm.... Go routines are not a problem, there can be thousands of those.

But you are right.. the 10 buffer for each store and 10 for global channel makes it more problematic for large number of stores. Wonder if we should start some ticket to discuss potential optimization for those. Having pool of go routines could work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka
Copy link
Member Author

Merging if no other objection @domgreen

@bwplotka bwplotka merged commit 705d6a2 into master Jan 21, 2019
@bwplotka bwplotka deleted the fix-flakiness-proxy branch January 21, 2019 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants