Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r266] Use BucketIndexBlocksFinder instead of BucketScanBlocksFinder #6790

Merged
merged 1 commit into from
Dec 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 0 additions & 8 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -522,14 +522,6 @@ How to **fix** it:
- Set the shard size of one or more tenants to `0`; this will shard the given tenant's rule groups across all ingesters.
- Decrease the total number of ruler replicas by the number of idle replicas.

### MimirQuerierHasNotScanTheBucket

This alert fires when a Mimir querier is not successfully scanning blocks in the storage (bucket). A querier is expected to periodically iterate the bucket to find new and deleted blocks (defaults to every 5m) and if it's not successfully synching the bucket since a long time, it may end up querying only a subset of blocks, thus leading to potentially partial results.

How to **investigate**:

- Look for any scan error in the querier logs (ie. networking or rate limiting issues)

### MimirStoreGatewayHasNotSyncTheBucket

This alert fires when a Mimir store-gateway is not successfully scanning blocks in the storage (bucket). A store-gateway is expected to periodically iterate the bucket to find new and deleted blocks (defaults to every 5m) and if it's not successfully synching the bucket for a long time, it may end up querying only a subset of blocks, thus leading to potentially partial results.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@ weight: 50

The bucket index is a per-tenant file that contains the list of blocks and block deletion marks in the storage. The bucket index is stored in the backend object storage, is periodically updated by the compactor, and used by queriers, store-gateways, and rulers (in [internal]({{< relref "../components/ruler#internal" >}}) operational mode) to discover blocks in the storage.

The bucket index is enabled by default, but is optional. It can be disabled via `-blocks-storage.bucket-store.bucket-index.enabled=false` (or its respective YAML configuration option).
Disabling the bucket index is not recommended.

## Benefits

The [querier]({{< relref "../components/querier" >}}), [store-gateway]({{< relref "../components/store-gateway" >}}) and [ruler]({{< relref "../components/ruler" >}}) must have an almost[^1] up-to-date view of the storage bucket, in order to find the right blocks to look up at query time (querier) and to load a block's [index-header]({{< relref "../binary-index-header" >}}) (store-gateway).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,28 +16,10 @@ The querier uses the [store-gateway]({{< relref "./store-gateway" >}}) component

## How it works

To find the correct blocks to look up at query time, the querier requires an almost up-to-date view of the bucket in long-term storage. The querier performs one of the following actions to ensure that the bucket view is updated:

1. Periodically download the [bucket index]({{< relref "../bucket-index" >}}) (default)
2. Periodically scan the bucket

Queriers do not need any content from blocks except their metadata, which includes the minimum and maximum timestamp of samples within the block.

### Bucket index enabled (default)

Queriers lazily download the bucket index when they receive the first query for a given tenant. The querier caches the bucket index in memory and periodically keeps it up-to-date.
To find the correct blocks to look up at query time, queriers lazily download the bucket index when they receive the first query for a given tenant. The querier caches the bucket index in memory and periodically keeps it up-to-date.

The bucket index contains a list of blocks and block deletion marks of a tenant. The querier later uses the list of blocks and block deletion marks to locate the set of blocks that need to be queried for the given query.

When the querier runs with the bucket index enabled, the querier startup time and the volume of API calls to object storage are reduced.
We recommend that you keep the bucket index enabled.

### Bucket index disabled

When [bucket index]({{< relref "../bucket-index" >}}) is disabled, queriers iterate over the storage bucket to discover blocks for all tenants and download the `meta.json` of each block. During this initial bucket scanning phase, a querier cannot process incoming queries and its `/ready` readiness probe endpoint will not return the HTTP status code `200`.

When running, queriers periodically iterate over the storage bucket to discover new tenants and recently uploaded blocks.

### Anatomy of a query request

When a querier receives a query range request, the request contains the following parameters:
Expand Down
21 changes: 12 additions & 9 deletions integration/querier_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,6 @@ func testQuerierWithBlocksStorageRunningInMicroservicesMode(t *testing.T, stream
require.NoError(t, ingester.WaitSumMetrics(e2e.Equals(2), "cortex_ingester_memory_series_removed_total"))

// Start the compactor to have the bucket index created before querying.
// This is only required for tests using the bucket index, but doesn't hurt doing it for all of them.
compactor := e2emimir.NewCompactor("compactor", consul.NetworkHTTPEndpoint(), commonFlags)
require.NoError(t, s.StartAndWaitReady(compactor))

Expand Down Expand Up @@ -187,9 +186,6 @@ func testQuerierWithBlocksStorageRunningInMicroservicesMode(t *testing.T, stream
// the store-gateway ring if blocks sharding is enabled.
require.NoError(t, querier.WaitSumMetrics(e2e.Equals(float64(512+(512*storeGateways.NumInstances()))), "cortex_ring_tokens_total"))

// Wait until the querier has discovered the uploaded blocks.
require.NoError(t, querier.WaitSumMetrics(e2e.Equals(2), "cortex_blocks_meta_synced"))

// Wait until the store-gateway has synched the new uploaded blocks. When sharding is enabled
// we don't known which store-gateway instance will synch the blocks, so we need to wait on
// metrics extracted from all instances.
Expand Down Expand Up @@ -235,6 +231,9 @@ func testQuerierWithBlocksStorageRunningInMicroservicesMode(t *testing.T, stream
// thanos_store_index_cache_requests_total: ExpandedPostings: 5, Postings: 2, Series: 2
instantQueriesCount++

// Make sure the querier is using the bucket index blocks finder.
require.NoError(t, querier.WaitSumMetrics(e2e.Greater(0), "cortex_bucket_index_loads_total"))

comparingFunction := e2e.Equals
if streamingEnabled {
// Some metrics can be higher when streaming is enabled. The exact number is not deterministic in every case.
Expand Down Expand Up @@ -439,15 +438,15 @@ func TestQuerierWithBlocksStorageRunningInSingleBinaryMode(t *testing.T) {
require.NoError(t, cluster.WaitSumMetrics(e2e.Equals(float64(3*cluster.NumInstances())), "cortex_ingester_memory_series_created_total"))
require.NoError(t, cluster.WaitSumMetrics(e2e.Equals(float64(2*cluster.NumInstances())), "cortex_ingester_memory_series_removed_total"))

// Wait until the querier has discovered the uploaded blocks (discovered both by the querier and store-gateway).
require.NoError(t, cluster.WaitSumMetricsWithOptions(e2e.Equals(float64(2*cluster.NumInstances()*2)), []string{"cortex_blocks_meta_synced"}, e2e.WithLabelMatchers(
labels.MustNewMatcher(labels.MatchEqual, "component", "querier"))))

// Wait until the store-gateway has synched the new uploaded blocks. The number of blocks loaded
// may be greater than expected if the compactor is running (there may have been compacted).
const shippedBlocks = 2
require.NoError(t, cluster.WaitSumMetrics(e2e.GreaterOrEqual(float64(shippedBlocks*seriesReplicationFactor)), "cortex_bucket_store_blocks_loaded"))

// Start the compactor to have the bucket index created before querying.
compactor := e2emimir.NewCompactor("compactor", consul.NetworkHTTPEndpoint(), flags)
require.NoError(t, s.StartAndWaitReady(compactor))

var expectedCacheRequests int

// Query back the series (1 only in the storage, 1 only in the ingesters, 1 on both).
Expand Down Expand Up @@ -822,9 +821,13 @@ func TestQuerierWithBlocksStorageOnMissingBlocksFromStorage(t *testing.T) {
require.NoError(t, querier.WaitSumMetrics(e2e.Equals(512*2), "cortex_ring_tokens_total"))
require.NoError(t, storeGateway.WaitSumMetrics(e2e.Equals(512), "cortex_ring_tokens_total"))

// Start the compactor to have the bucket index created before querying.
compactor := e2emimir.NewCompactor("compactor", consul.NetworkHTTPEndpoint(), flags)
require.NoError(t, s.StartAndWaitReady(compactor))

// Wait until the blocks are old enough for consistency check
// 1 sync on startup, 3 to go over the consistency check limit explained above
require.NoError(t, querier.WaitSumMetrics(e2e.GreaterOrEqual(1+3), "cortex_blocks_meta_syncs_total"))
require.NoError(t, storeGateway.WaitSumMetrics(e2e.GreaterOrEqual(1+3), "cortex_blocks_meta_syncs_total"))

// Query back the series.
c, err = e2emimir.NewClient("", querier.HTTPEndpoint(), "", "", "user-1")
Expand Down
5 changes: 3 additions & 2 deletions integration/store_gateway_limits_hit_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -106,9 +106,10 @@ func Test_MaxSeriesAndChunksPerQueryLimitHit(t *testing.T) {
// they discovered the blocks in the storage.
querier := e2emimir.NewQuerier("querier", consul.NetworkHTTPEndpoint(), mergeFlags(flags, testData.additionalQuerierFlags))
storeGateway := e2emimir.NewStoreGateway("store-gateway", consul.NetworkHTTPEndpoint(), mergeFlags(flags, testData.additionalStoreGatewayFlags))
require.NoError(t, scenario.StartAndWaitReady(querier, storeGateway))
compactor := e2emimir.NewCompactor("compactor", consul.NetworkHTTPEndpoint(), flags)
require.NoError(t, scenario.StartAndWaitReady(querier, storeGateway, compactor))
t.Cleanup(func() {
require.NoError(t, scenario.Stop(querier, storeGateway))
require.NoError(t, scenario.Stop(querier, storeGateway, compactor))
})

client, err = e2emimir.NewClient("", querier.HTTPEndpoint(), "", "", "test")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -772,19 +772,6 @@ spec:
for: 3m
labels:
severity: critical
- alert: MimirQuerierHasNotScanTheBucket
annotations:
message: Mimir Querier {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
}} has not successfully scanned the bucket since {{ $value | humanizeDuration
}}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirquerierhasnotscanthebucket
expr: |
(time() - cortex_querier_blocks_last_successful_scan_timestamp_seconds > 60 * 30)
and
cortex_querier_blocks_last_successful_scan_timestamp_seconds > 0
for: 5m
labels:
severity: critical
- alert: MimirStoreGatewayHasNotSyncTheBucket
annotations:
message: Mimir store-gateway {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
Expand Down
13 changes: 0 additions & 13 deletions operations/mimir-mixin-compiled-baremetal/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -746,19 +746,6 @@ groups:
for: 3m
labels:
severity: critical
- alert: MimirQuerierHasNotScanTheBucket
annotations:
message: Mimir Querier {{ $labels.instance }} in {{ $labels.cluster }}/{{ $labels.namespace
}} has not successfully scanned the bucket since {{ $value | humanizeDuration
}}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirquerierhasnotscanthebucket
expr: |
(time() - cortex_querier_blocks_last_successful_scan_timestamp_seconds > 60 * 30)
and
cortex_querier_blocks_last_successful_scan_timestamp_seconds > 0
for: 5m
labels:
severity: critical
- alert: MimirStoreGatewayHasNotSyncTheBucket
annotations:
message: Mimir store-gateway {{ $labels.instance }} in {{ $labels.cluster }}/{{
Expand Down
13 changes: 0 additions & 13 deletions operations/mimir-mixin-compiled/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -760,19 +760,6 @@ groups:
for: 3m
labels:
severity: critical
- alert: MimirQuerierHasNotScanTheBucket
annotations:
message: Mimir Querier {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
}} has not successfully scanned the bucket since {{ $value | humanizeDuration
}}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirquerierhasnotscanthebucket
expr: |
(time() - cortex_querier_blocks_last_successful_scan_timestamp_seconds > 60 * 30)
and
cortex_querier_blocks_last_successful_scan_timestamp_seconds > 0
for: 5m
labels:
severity: critical
- alert: MimirStoreGatewayHasNotSyncTheBucket
annotations:
message: Mimir store-gateway {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
Expand Down
16 changes: 0 additions & 16 deletions operations/mimir-mixin/alerts/blocks.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -183,22 +183,6 @@
message: '%(product)s Ingester %(alert_instance_variable)s in %(alert_aggregation_variables)s is failing to write to TSDB WAL.' % $._config,
},
},
{
// Alert if the querier is not successfully scanning the bucket.
alert: $.alertName('QuerierHasNotScanTheBucket'),
'for': '5m',
expr: |||
(time() - cortex_querier_blocks_last_successful_scan_timestamp_seconds > 60 * 30)
and
cortex_querier_blocks_last_successful_scan_timestamp_seconds > 0
|||,
labels: {
severity: 'critical',
},
annotations: {
message: '%(product)s Querier %(alert_instance_variable)s in %(alert_aggregation_variables)s has not successfully scanned the bucket since {{ $value | humanizeDuration }}.' % $._config,
},
},
{
// Alert if the store-gateway is not successfully synching the bucket.
alert: $.alertName('StoreGatewayHasNotSyncTheBucket'),
Expand Down
1 change: 1 addition & 0 deletions pkg/querier/blocks_finder_bucket_index.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import (

var (
errBucketIndexBlocksFinderNotRunning = errors.New("bucket index blocks finder is not running")
errInvalidBlocksRange = errors.New("invalid blocks time range")
)

type BucketIndexBlocksFinderConfig struct {
Expand Down
Loading
Loading