Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to close BlockSeriesClient cause store-gateway deadlock #6085

Closed
alanprot opened this issue Jan 30, 2023 · 0 comments · Fixed by #6086
Closed

Failure to close BlockSeriesClient cause store-gateway deadlock #6085

alanprot opened this issue Jan 30, 2023 · 0 comments · Fixed by #6086

Comments

@alanprot
Copy link
Contributor

alanprot commented Jan 30, 2023

What happened:
Seems that since this change we can have a dead lock when syncing blocks on store gateway.

During the sync process we may try to remove block and in order to do it safely we wait all the reads on that block to finish. See:

if err := b.Close(); err != nil {

thanos/pkg/store/bucket.go

Lines 2023 to 2026 in 48e2806

func (b *bucketBlock) Close() error {
b.pendingReaders.Wait()
return b.indexHeaderReader.Close()
}

The pendingReaders wg is incremented when we create a new bucket client here:

thanos/pkg/store/bucket.go

Lines 873 to 882 in 48e2806

chunkr = b.chunkReader()
}
return &blockSeriesClient{
ctx: ctx,
logger: logger,
extLset: b.extLset,
mint: req.MinTime,
maxt: req.MaxTime,
indexr: b.indexReader(),

thanos/pkg/store/bucket.go

Lines 1995 to 1998 in 48e2806

func (b *bucketBlock) indexReader() *bucketIndexReader {
b.pendingReaders.Add(1)
return newBucketIndexReader(b)
}

And decremented on the close method:

thanos/pkg/store/bucket.go

Lines 897 to 903 in 48e2806

func (b *blockSeriesClient) Close() {
if !b.skipChunks {
runutil.CloseWithLogOnErr(b.logger, b.chunkr, "series block")
}
runutil.CloseWithLogOnErr(b.logger, b.indexr, "series block")
}

thanos/pkg/store/bucket.go

Lines 2641 to 2644 in 48e2806

func (r *bucketIndexReader) Close() error {
r.block.pendingReaders.Done()
return nil
}

This seems to mean that if we create a newBlockSeriesClient but not call close, the blocks' pendingReaders wg will never be decremented causing the deadlock.

This seems to be the case at least in 2 places:

blockClient := newBlockSeriesClient(newCtx, s.logger, b, seriesReq, nil, bytesLimiter, nil, true, SeriesBatchSize, s.metrics.chunkFetchDuration)

and

blockClient := newBlockSeriesClient(newCtx, s.logger, b, seriesReq, nil, bytesLimiter, nil, true, SeriesBatchSize, s.metrics.chunkFetchDuration)

What you expected to happen:
pendingReaders wg should be decremented after the read is done.

How to reproduce it (as minimally and precisely as possible):
Call GetLabels on a block and wait it to be removed from store gateway.

Full logs to relevant components:

Go routine that shows the deadlock:

1 @ 0x43e7f6 0x44f75e 0x44f735 0x46bc05 0x47bcf2 0x1aad44a 0x1a9f7c8 0x1a9d573 0x1b86a32 0x1b878c4 0x470061
#   0x46bc04    sync.runtime_Semacquire+0x24                                                        GoLang/GoLang-1.x.119322.0/AL2_x86_64/DEV.STD.PTHREAD/build/lib/src/runtime/sema.go:62
#   0x47bcf1    sync.(*WaitGroup).Wait+0x51                                                     GoLang/GoLang-1.x.119322.0/AL2_x86_64/DEV.STD.PTHREAD/build/lib/src/sync/waitgroup.go:139
#   0x1aad449   Cortex/vendor/github.com/thanos-io/thanos/pkg/store.(*bucketBlock).Close+0x29               Cortex/Cortex-1.0.343.0/AL2_x86_64/DEV.STD.PTHREAD/build/gopath/src/Cortex/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go:2006
#   0x1a9f7c7   Cortex/vendor/github.com/thanos-io/thanos/pkg/store.(*BucketStore).removeBlock+0x107            Cortex/Cortex-1.0.343.0/AL2_x86_64/DEV.STD.PTHREAD/build/gopath/src/Cortex/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go:721
#   0x1a9d572   Cortex/vendor/github.com/thanos-io/thanos/pkg/store.(*BucketStore).SyncBlocks+0x3f2             Cortex/Cortex-1.0.343.0/AL2_x86_64/DEV.STD.PTHREAD/build/gopath/src/Cortex/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go:564
#   0x1b86a31   Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway.(*BucketStores).SyncBlocks.func1+0x31    Cortex/Cortex-1.0.343.0/AL2_x86_64/DEV.STD.PTHREAD/build/gopath/src/Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway/bucket_stores.go:157
#   0x1b878c3   Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway.(*BucketStores).syncUsersBlocks.func2+0x143  Cortex/Cortex-1.0.343.0/AL2_x86_64/DEV.STD.PTHREAD/build/gopath/src/Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway/bucket_stores.go:228

1 @ 0x43e7f6 0x44f75e 0x44f735 0x46bc05 0x47bcf2 0x1b87126 0x1b86beb 0x1b86aaa 0x1b8e6d6 0x1b8e233 0x1663628 0x470061
#   0x46bc04    sync.runtime_Semacquire+0x24                                                            GoLang/GoLang-1.x.119322.0/AL2_x86_64/DEV.STD.PTHREAD/build/lib/src/runtime/sema.go:62
#   0x47bcf1    sync.(*WaitGroup).Wait+0x51                                                         GoLang/GoLang-1.x.119322.0/AL2_x86_64/DEV.STD.PTHREAD/build/lib/src/sync/waitgroup.go:139
#   0x1b87125   Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway.(*BucketStores).syncUsersBlocks+0x4a5        Cortex/Cortex-1.0.343.0/AL2_x86_64/DEV.STD.PTHREAD/build/gopath/src/Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway/bucket_stores.go:267
#   0x1b86bea   Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway.(*BucketStores).syncUsersBlocksWithRetries+0x10a Cortex/Cortex-1.0.343.0/AL2_x86_64/DEV.STD.PTHREAD/build/gopath/src/Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway/bucket_stores.go:170
#   0x1b86aa9   Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway.(*BucketStores).SyncBlocks+0x29          Cortex/Cortex-1.0.343.0/AL2_x86_64/DEV.STD.PTHREAD/build/gopath/src/Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway/bucket_stores.go:156
#   0x1b8e6d5   Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway.(*StoreGateway).syncStores+0x1f5         Cortex/Cortex-1.0.343.0/AL2_x86_64/DEV.STD.PTHREAD/build/gopath/src/Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway/gateway.go:325
#   0x1b8e232   Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway.(*StoreGateway).running+0x412            Cortex/Cortex-1.0.343.0/AL2_x86_64/DEV.STD.PTHREAD/build/gopath/src/Cortex/vendor/github.com/cortexproject/cortex/pkg/storegateway/gateway.go:296
#   0x1663627   Cortex/vendor/github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main+0x1e7              Cortex/Cortex-1.0.343.0/AL2_x86_64/DEV.STD.PTHREAD/build/gopath/src/Cortex/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant