Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: add query metrics to calls going through the Store API #5741

Merged
merged 37 commits into from
Oct 18, 2022
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
595b7e2
Implement granular query performance metrics for Thanos Query
douglascamata Sep 16, 2022
ff5a716
Merge branch 'main' of github.com:thanos-io/thanos into add-path-cont…
douglascamata Sep 28, 2022
f6e2511
Fix some linter warnings
douglascamata Sep 29, 2022
193d76f
Remove useless logs
douglascamata Sep 29, 2022
86f56e0
Refactor query tests
douglascamata Sep 29, 2022
f0c1a22
Fix long function definition (newQuerier)
douglascamata Sep 29, 2022
6803e4a
Remove TODO comment
douglascamata Sep 29, 2022
3b3bc0c
Fix query tests
douglascamata Sep 29, 2022
2e8778d
Reformat query docs
douglascamata Sep 29, 2022
d218f82
Merge branch 'main' of github.com:thanos-io/thanos into query-store-m…
douglascamata Sep 29, 2022
a738407
Remove useless return
douglascamata Sep 29, 2022
9057263
Put back old query docs
douglascamata Sep 29, 2022
f0567a9
Update query docs again
douglascamata Sep 29, 2022
792487f
Fix e2e env name
douglascamata Sep 29, 2022
d373d28
Retrigger CI
douglascamata Sep 29, 2022
3470ded
Add missing copyright notice.
douglascamata Sep 30, 2022
63d45e8
Retrigger CI
douglascamata Sep 30, 2022
2a4241d
Retrigger CI
douglascamata Sep 30, 2022
e04acb2
Bump wait time to twice scrape interval
douglascamata Sep 30, 2022
8e999ba
Retrigger CI
douglascamata Sep 30, 2022
6a097f0
Attempt to fix randomly failing test
douglascamata Oct 3, 2022
6131510
Checking more metrics to ensure the store is ready
douglascamata Oct 3, 2022
8bc38bb
Clean up test
douglascamata Oct 3, 2022
ffcc2d4
Do not record store api metrics when didn't touch series or samples
douglascamata Oct 3, 2022
6a13a7f
Retrigger CI
douglascamata Oct 3, 2022
31a9db8
Also skip store api metrics on zero chunks touched
douglascamata Oct 3, 2022
5568a14
Update changelog
douglascamata Oct 3, 2022
5888877
Merge branch 'main' of github.com:thanos-io/thanos into query-store-m…
douglascamata Oct 3, 2022
dfc1cf9
Fix broken changelog after merge
douglascamata Oct 4, 2022
dd1c104
Remove extra empty line
douglascamata Oct 4, 2022
fed8bf2
Refactor names and (un)exported types and fields
douglascamata Oct 4, 2022
8f6a245
Start listing metrics exported by Thanos Query
douglascamata Oct 4, 2022
65e94ee
Rename pkg/store/metrics -> pkg/store/telemetry
douglascamata Oct 4, 2022
8f1b75b
Merge branch 'main' of github.com:thanos-io/thanos into query-store-m…
douglascamata Oct 11, 2022
d6125e7
Merge branch 'main' of github.com:thanos-io/thanos into query-store-m…
douglascamata Oct 14, 2022
0951370
Get rid of the pkg/store/telemetry package
douglascamata Oct 17, 2022
5eda104
Merge branch 'main' into query-store-metrics
matej-g Oct 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re
- [#5734](https://github.com/thanos-io/thanos/pull/5734) Store: Support disable block viewer UI.
- [#5411](https://github.com/thanos-io/thanos/pull/5411) Tracing: Add OpenTelemetry Protocol exporter.
- [#5779](https://github.com/thanos-io/thanos/pull/5779) Objstore: Support specifying S3 storage class.
- [#5741](https://github.com/thanos-io/thanos/pull/5741) Query: add metrics on how much data is being selected by downstream Store APIs.

### Changed

Expand Down
20 changes: 19 additions & 1 deletion cmd/thanos/query.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ import (
"github.com/prometheus/prometheus/discovery/targetgroup"
"github.com/prometheus/prometheus/model/labels"
"github.com/prometheus/prometheus/promql"
"github.com/thanos-io/thanos/pkg/store/telemetry"
"google.golang.org/grpc"

v1 "github.com/prometheus/prometheus/web/api/v1"
"github.com/thanos-community/promql-engine/engine"
apiv1 "github.com/thanos-io/thanos/pkg/api/query"
Expand Down Expand Up @@ -54,7 +57,6 @@ import (
"github.com/thanos-io/thanos/pkg/targets"
"github.com/thanos-io/thanos/pkg/tls"
"github.com/thanos-io/thanos/pkg/ui"
"google.golang.org/grpc"
)

const (
Expand Down Expand Up @@ -194,6 +196,10 @@ func registerQuery(app *extkingpin.App) {
alertQueryURL := cmd.Flag("alert.query-url", "The external Thanos Query URL that would be set in all alerts 'Source' field.").String()
grpcProxyStrategy := cmd.Flag("grpc.proxy-strategy", "Strategy to use when proxying Series requests to leaf nodes. Hidden and only used for testing, will be removed after lazy becomes the default.").Default(string(store.EagerRetrieval)).Hidden().Enum(string(store.EagerRetrieval), string(store.LazyRetrieval))

queryTelemetryDurationQuantiles := cmd.Flag("query.telemetry.request-duration-seconds-quantiles", "The quantiles for exporting metrics about the request duration quantiles.").Default("0.1", "0.25", "0.75", "1.25", "1.75", "2.5", "3", "5", "10").Float64List()
queryTelemetrySamplesQuantiles := cmd.Flag("query.telemetry.request-samples-quantiles", "The quantiles for exporting metrics about the samples count quantiles.").Default("100", "1000", "10000", "100000", "1000000").Int64List()
queryTelemetrySeriesQuantiles := cmd.Flag("query.telemetry.request-series-seconds-quantiles", "The quantiles for exporting metrics about the series count quantiles.").Default("10", "100", "1000", "10000", "100000").Int64List()

cmd.Setup(func(g *run.Group, logger log.Logger, reg *prometheus.Registry, tracer opentracing.Tracer, _ <-chan struct{}, _ bool) error {
selectorLset, err := parseFlagLabels(*selectorLabels)
if err != nil {
Expand Down Expand Up @@ -305,6 +311,9 @@ func registerQuery(app *extkingpin.App) {
*alertQueryURL,
*grpcProxyStrategy,
component.Query,
*queryTelemetryDurationQuantiles,
*queryTelemetrySamplesQuantiles,
*queryTelemetrySeriesQuantiles,
promqlEngineType(*promqlEngine),
)
})
Expand Down Expand Up @@ -377,6 +386,9 @@ func runQuery(
alertQueryURL string,
grpcProxyStrategy string,
comp component.Component,
queryTelemetryDurationQuantiles []float64,
queryTelemetrySamplesQuantiles []int64,
queryTelemetrySeriesQuantiles []int64,
promqlEngine promqlEngineType,
) error {
if alertQueryURL == "" {
Expand Down Expand Up @@ -680,6 +692,12 @@ func runQuery(
extprom.WrapRegistererWithPrefix("thanos_query_concurrent_", reg),
maxConcurrentQueries,
),
telemetry.NewSeriesStatsAggregator(
reg,
queryTelemetryDurationQuantiles,
queryTelemetrySamplesQuantiles,
queryTelemetrySeriesQuantiles,
),
reg,
)

Expand Down
19 changes: 19 additions & 0 deletions docs/components/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,15 @@ Flags:
be able to query without deduplication using
'dedup=false' parameter. Data includes time
series, recording rules, and alerting rules.
--query.telemetry.request-duration-seconds-quantiles=0.1... ...
The quantiles for exporting metrics about the
request duration quantiles.
--query.telemetry.request-samples-quantiles=100... ...
The quantiles for exporting metrics about the
samples count quantiles.
--query.telemetry.request-series-seconds-quantiles=10... ...
The quantiles for exporting metrics about the
series count quantiles.
--query.timeout=2m Maximum time to process query by query node.
--request.logging-config=<content>
Alternative to 'request.logging-config-file'
Expand Down Expand Up @@ -460,3 +469,13 @@ Flags:
of Prometheus.

```

## Exported metrics

Thanos Query also exports metrics about its own performance. You can find a list with these metrics below.

**Disclaimer**: this list is incomplete. The remaining metrics will be added over time.
matej-g marked this conversation as resolved.
Show resolved Hide resolved

| Name | Type | Labels | Description |
|-----------------------------------------|-----------|-----------------------|-------------------------------------------------------------------------------------------------------------------|
| thanos_store_api_query_duration_seconds | Histogram | samples_le, series_le | Duration of the Thanos Store API select phase for a query according to the amount of samples and series selected. |
2 changes: 2 additions & 0 deletions pkg/api/query/grpc.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ func (g *GRPCAPI) Query(request *querypb.QueryRequest, server querypb.Query_Quer
request.EnableQueryPushdown,
false,
request.ShardInfo,
query.NoopSeriesStatsReporter,
)
qry, err := g.queryEngine.NewInstantQuery(queryable, &promql.QueryOpts{LookbackDelta: lookbackDelta}, request.Query, ts)
if err != nil {
Expand Down Expand Up @@ -168,6 +169,7 @@ func (g *GRPCAPI) QueryRange(request *querypb.QueryRangeRequest, srv querypb.Que
request.EnableQueryPushdown,
false,
request.ShardInfo,
query.NoopSeriesStatsReporter,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this set to no-op here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is an internally initiated request, not something started by a user.

)

startTime := time.Unix(request.StartTimeSeconds, 0)
Expand Down
98 changes: 88 additions & 10 deletions pkg/api/query/v1.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,9 @@ import (
"github.com/prometheus/prometheus/promql"
"github.com/prometheus/prometheus/promql/parser"
"github.com/prometheus/prometheus/storage"
v1 "github.com/prometheus/prometheus/web/api/v1"

"github.com/prometheus/prometheus/util/stats"
v1 "github.com/prometheus/prometheus/web/api/v1"
"github.com/thanos-io/thanos/pkg/store/telemetry"

"github.com/thanos-io/thanos/pkg/api"
"github.com/thanos-io/thanos/pkg/exemplars"
Expand Down Expand Up @@ -107,6 +107,13 @@ type QueryAPI struct {
defaultMetadataTimeRange time.Duration

queryRangeHist prometheus.Histogram

seriesStatsAggregator seriesQueryPerformanceMetricsAggregator
}

type seriesQueryPerformanceMetricsAggregator interface {
Aggregate(seriesStats storepb.SeriesStatsCounter)
Observe(duration float64)
}

// NewQueryAPI returns an initialized QueryAPI type.
Expand Down Expand Up @@ -134,8 +141,12 @@ func NewQueryAPI(
defaultMetadataTimeRange time.Duration,
disableCORS bool,
gate gate.Gate,
statsAggregator seriesQueryPerformanceMetricsAggregator,
reg *prometheus.Registry,
) *QueryAPI {
if statsAggregator == nil {
statsAggregator = &telemetry.NoopSeriesStatsAggregator{}
}
return &QueryAPI{
baseAPI: api.NewBaseAPI(logger, disableCORS, flagsMap),
logger: logger,
Expand All @@ -160,6 +171,7 @@ func NewQueryAPI(
defaultInstantQueryMaxSourceResolution: defaultInstantQueryMaxSourceResolution,
defaultMetadataTimeRange: defaultMetadataTimeRange,
disableCORS: disableCORS,
seriesStatsAggregator: statsAggregator,

queryRangeHist: promauto.With(reg).NewHistogram(prometheus.HistogramOpts{
Name: "thanos_query_range_requested_timespan_duration_seconds",
Expand Down Expand Up @@ -396,7 +408,24 @@ func (qapi *QueryAPI) query(r *http.Request) (interface{}, []error, *api.ApiErro
span, ctx := tracing.StartSpan(ctx, "promql_instant_query")
defer span.Finish()

qry, err := qapi.queryEngine.NewInstantQuery(qapi.queryableCreate(enableDedup, replicaLabels, storeDebugMatchers, maxSourceResolution, enablePartialResponse, qapi.enableQueryPushdown, false, shardInfo), &promql.QueryOpts{LookbackDelta: lookbackDelta}, r.FormValue("query"), ts)
var seriesStats []storepb.SeriesStatsCounter
qry, err := qapi.queryEngine.NewInstantQuery(
qapi.queryableCreate(
enableDedup,
replicaLabels,
storeDebugMatchers,
maxSourceResolution,
enablePartialResponse,
qapi.enableQueryPushdown,
false,
shardInfo,
query.NewAggregateStatsReporter(&seriesStats),
),
&promql.QueryOpts{LookbackDelta: lookbackDelta},
r.FormValue("query"),
ts,
)

if err != nil {
return nil, nil, &api.ApiError{Typ: api.ErrorBadData, Err: err}, func() {}
}
Expand All @@ -409,6 +438,7 @@ func (qapi *QueryAPI) query(r *http.Request) (interface{}, []error, *api.ApiErro
}
defer qapi.gate.Done()

beforeRange := time.Now()
res := qry.Exec(ctx)
if res.Err != nil {
switch res.Err.(type) {
Expand All @@ -421,6 +451,10 @@ func (qapi *QueryAPI) query(r *http.Request) (interface{}, []error, *api.ApiErro
}
return nil, nil, &api.ApiError{Typ: api.ErrorExec, Err: res.Err}, qry.Close
}
for i := range seriesStats {
qapi.seriesStatsAggregator.Aggregate(seriesStats[i])
}
qapi.seriesStatsAggregator.Observe(time.Since(beforeRange).Seconds())

// Optional stats field in response if parameter "stats" is not empty.
var qs stats.QueryStats
Expand Down Expand Up @@ -525,8 +559,19 @@ func (qapi *QueryAPI) queryRange(r *http.Request) (interface{}, []error, *api.Ap
span, ctx := tracing.StartSpan(ctx, "promql_range_query")
defer span.Finish()

var seriesStats []storepb.SeriesStatsCounter
qry, err := qapi.queryEngine.NewRangeQuery(
qapi.queryableCreate(enableDedup, replicaLabels, storeDebugMatchers, maxSourceResolution, enablePartialResponse, qapi.enableQueryPushdown, false, shardInfo),
qapi.queryableCreate(
enableDedup,
replicaLabels,
storeDebugMatchers,
maxSourceResolution,
enablePartialResponse,
qapi.enableQueryPushdown,
false,
shardInfo,
query.NewAggregateStatsReporter(&seriesStats),
),
&promql.QueryOpts{LookbackDelta: lookbackDelta},
r.FormValue("query"),
start,
Expand All @@ -545,6 +590,7 @@ func (qapi *QueryAPI) queryRange(r *http.Request) (interface{}, []error, *api.Ap
}
defer qapi.gate.Done()

beforeRange := time.Now()
res := qry.Exec(ctx)
if res.Err != nil {
switch res.Err.(type) {
Expand All @@ -555,6 +601,10 @@ func (qapi *QueryAPI) queryRange(r *http.Request) (interface{}, []error, *api.Ap
}
return nil, nil, &api.ApiError{Typ: api.ErrorExec, Err: res.Err}, qry.Close
}
for i := range seriesStats {
qapi.seriesStatsAggregator.Aggregate(seriesStats[i])
}
qapi.seriesStatsAggregator.Observe(time.Since(beforeRange).Seconds())

// Optional stats field in response if parameter "stats" is not empty.
var qs stats.QueryStats
Expand Down Expand Up @@ -600,8 +650,17 @@ func (qapi *QueryAPI) labelValues(r *http.Request) (interface{}, []error, *api.A
matcherSets = append(matcherSets, matchers)
}

q, err := qapi.queryableCreate(true, nil, storeDebugMatchers, 0, enablePartialResponse, qapi.enableQueryPushdown, true, nil).
Querier(ctx, timestamp.FromTime(start), timestamp.FromTime(end))
q, err := qapi.queryableCreate(
true,
nil,
storeDebugMatchers,
0,
enablePartialResponse,
qapi.enableQueryPushdown,
true,
nil,
query.NoopSeriesStatsReporter,
).Querier(ctx, timestamp.FromTime(start), timestamp.FromTime(end))
if err != nil {
return nil, nil, &api.ApiError{Typ: api.ErrorExec, Err: err}, func() {}
}
Expand Down Expand Up @@ -687,8 +746,18 @@ func (qapi *QueryAPI) series(r *http.Request) (interface{}, []error, *api.ApiErr
return nil, nil, apiErr, func() {}
}

q, err := qapi.queryableCreate(enableDedup, replicaLabels, storeDebugMatchers, math.MaxInt64, enablePartialResponse, qapi.enableQueryPushdown, true, nil).
Querier(r.Context(), timestamp.FromTime(start), timestamp.FromTime(end))
q, err := qapi.queryableCreate(
enableDedup,
replicaLabels,
storeDebugMatchers,
math.MaxInt64,
enablePartialResponse,
qapi.enableQueryPushdown,
true,
nil,
query.NoopSeriesStatsReporter,
).Querier(r.Context(), timestamp.FromTime(start), timestamp.FromTime(end))

if err != nil {
return nil, nil, &api.ApiError{Typ: api.ErrorExec, Err: err}, func() {}
}
Expand Down Expand Up @@ -737,8 +806,17 @@ func (qapi *QueryAPI) labelNames(r *http.Request) (interface{}, []error, *api.Ap
matcherSets = append(matcherSets, matchers)
}

q, err := qapi.queryableCreate(true, nil, storeDebugMatchers, 0, enablePartialResponse, qapi.enableQueryPushdown, true, nil).
Querier(r.Context(), timestamp.FromTime(start), timestamp.FromTime(end))
q, err := qapi.queryableCreate(
true,
nil,
storeDebugMatchers,
0,
enablePartialResponse,
qapi.enableQueryPushdown,
true,
nil,
query.NoopSeriesStatsReporter,
).Querier(r.Context(), timestamp.FromTime(start), timestamp.FromTime(end))
if err != nil {
return nil, nil, &api.ApiError{Typ: api.ErrorExec, Err: err}, func() {}
}
Expand Down
4 changes: 4 additions & 0 deletions pkg/api/query/v1_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ import (
promgate "github.com/prometheus/prometheus/util/gate"
"github.com/prometheus/prometheus/util/stats"
"github.com/thanos-io/thanos/pkg/compact"
"github.com/thanos-io/thanos/pkg/store/telemetry"

baseAPI "github.com/thanos-io/thanos/pkg/api"
"github.com/thanos-io/thanos/pkg/component"
Expand Down Expand Up @@ -198,6 +199,7 @@ func TestQueryEndpoints(t *testing.T) {
queryRangeHist: promauto.With(prometheus.NewRegistry()).NewHistogram(prometheus.HistogramOpts{
Name: "query_range_hist",
}),
seriesStatsAggregator: &telemetry.NoopSeriesStatsAggregator{},
}

start := time.Unix(0, 0)
Expand Down Expand Up @@ -737,6 +739,7 @@ func TestMetadataEndpoints(t *testing.T) {
queryRangeHist: promauto.With(prometheus.NewRegistry()).NewHistogram(prometheus.HistogramOpts{
Name: "query_range_hist",
}),
seriesStatsAggregator: &telemetry.NoopSeriesStatsAggregator{},
}
apiWithLabelLookback := &QueryAPI{
baseAPI: &baseAPI.BaseAPI{
Expand All @@ -750,6 +753,7 @@ func TestMetadataEndpoints(t *testing.T) {
queryRangeHist: promauto.With(prometheus.NewRegistry()).NewHistogram(prometheus.HistogramOpts{
Name: "query_range_hist",
}),
seriesStatsAggregator: &telemetry.NoopSeriesStatsAggregator{},
}

var tests = []endpointTestCase{
Expand Down
Loading