Jaeger cleanup: much fewer spans, but with more stats - and more stats for meta section #1380

Dieterbe · 2019-07-04T07:44:43Z

Goals:

honor our internal jaeger guidelines, specifically no spans on a per-series basis as this was overloading our jaeger cluster. for now this means taking away many spans about the per-series getTargetsLocal, persistent store, cache and tank (aggmetrics) gets, etc. this means we lose a lot of fidelity. we can reintroduce these later if needed if we come up with a better way, such as multi-series (batched) gets from tank/cache/store. but i'm not too concerned about this because the point below.
another way i want to provide the most useful insights without having spans for all these per-series operations, is by collecting stats (measured per-series at this point, but summed up per-request), and attaching these to the jaeger trace. stats such as num cache hit/miss/hit-partial, and amount chunks loaded (and perhaps also time spent loading) from tank/cache/store. this should all be a low-cost way to provide very useful insights in a digestible way. and i also see this as a way to extend the metadata stats returned with query responses as introduced in render response metadata: stats #1344

too soon for in-depth review, but curious to hear anyone's thoughts?

woodsaj · 2019-07-04T17:12:19Z

I like the idea of simplifying the traces.

On the implementation, instead of having the numerous getSeries* calls pass back RenderStats (which you will need to aggregate at each step, though the current code is missing this), i would just pass a single *RenderStats through the call pipeline. Each func can then just increment the counters as the data is processed. eg

type RenderStats struct {
	cacheMiss       uint32
	cacheHitPartial uint32
	cacheHit        uint32
	chunksFromTank  uint32
	chunksFromCache uint32
	chunksFromStore uint32
}
func (r *RenderStats) CacheMiss(i int) {
   atomic.AddInt(&r.cacheMiss, i)
}
func (r *RenderStats) CacheHit(i int) {
   atomic.AddInt(&r.cacheHit, i)
}
...

* don't trace on a per-series level, we don't want that many spans * consistent error logging via logger and tracer, in every callsite of getSeriesCachedStore. meaning we can remove the logs and tracing stuff out of getSeriesCachedStore and the store.Search functions

along these call trees: executePlan <- records to span + reports stats into api response getTargets getTargetsRemote getTargetsLocal getTarget getSeriesFixed getSeries getSeriesCachedStore prometheus.querier.Select <-- we ignore the stats here getTargets .... Server.getData <- records to span + reports stats into rpc response getTargetsLocal getTarget getSeriesFixed getSeries getSeriesCachedStore Note that: * GetDataRespV0 and GetDataRespV1 responses can be used interchangably in clusters * we blend the different sources of stats into a unified presentation for the user, this required making the json marshaling slightly more complicated. * in jaeger, the stats for executePlan are the sum of the (already reported) stats of the getData calls.

Dieterbe · 2019-07-06T12:48:29Z

extended version of #1344. I think @shanson7 will like this. note that the stats are collected across RPC boundaries and summed across all peers

curl (...) --data 'target=some.id.of.a.metric.1*&from=1562412706&until=1562412843&format=json&maxDataPoints=1920&meta=true' --compressed | jsonpp | less

{
    "version": "v0.1",
    "meta": {
        "stats": {
            "executeplan.resolve-series.ms": 1,
            "executeplan.get-targets.ms": 10,
            "executeplan.prepare-series.ms": 0,
            "executeplan.plan-run.ms": 0,
            "executeplan.series-fetch.count": 112,
            "executeplan.points-fetch.count": 15344,
            "executeplan.points-return.count": 15344,
            "executeplan.cache-miss.count": 0,
            "executeplan.cache-hit-partial.count": 0,
            "executeplan.cache-hit.count": 112,
            "executeplan.chunks-from-tank.count": 0,
            "executeplan.chunks-from-cache.count": 336,
            "executeplan.chunks-from-store.count": 0
        }
    },
    "series": [
        {
            "target": "some.id.of.a.metric.1",
            "tags": {
                "name": "some.id.of.a.metric.1"
            },
            "datapoints": [
                [
                    0.8159659229285896,
                    1562412707
                ],
                [
...

Dieterbe · 2019-07-06T12:52:01Z

jaeger traces much more plain, but with more stats:

otherwise we only had the total and the ones for other peers

Dieterbe · 2019-07-06T14:15:58Z

cc @tomwilkie

tomwilkie · 2019-07-07T18:52:01Z

Looks good!

woodsaj · 2019-07-08T08:10:01Z

api/models/storagestats.go

+ atomic.AddUint32(&ss.ChunksFromStore, n)
+}
+
+// Add adds a to ss. Note that a is presumed to be "done" (read unsafely)


While it would only make sense to try and add a once it is 'done', it would be unwise/unsafe to presume that is always true.
We should just use atomic operations to read a to avoid the possibility of a data race. eg

atomic.AddUint32(&ss.CacheHit, atomic.LoadUint32(&a.CacheHit)))

The call to Add() itself wont be atomic (i.e. individual counters might increase between when the first counter from a is read and when the last one is), but that wont have any impact on our use case.

i have pondered the same thing.
I considered it the programmer's responsibility to make sure a is "done". i agree using atomics for a is safer, but at the same time, it would mask issues should they appear: it doesn't make sense to compute the aggregate stats if we're not done yet generating or collecting the stats. similar to how it also wouldn't make sense to build up our series slice if the inputs haven't been decoded yet.

I think this is in line with best practices everywhere in go code: whenever calling a function it's the programmers responsibility to make sure there is no unsafe data access. this is true for stdlib and many 3rd party libraries.

note that Add in the current form is also not atomic. I agree there is no need for it.

it would mask issues should they appear

The main issue we need to avoid is panics. Panics are bad and we should avoid them were ever possible. Wrapping the the reads in atomic.LoadUint32 avoids panics. It doesn't avoid other issues that may result from users adding stats that are being updated and that is fine. The comment about a being done should still stay.

I think this is in line with best practices everywhere in go code:

As a general statement that is true. But a more specific "best practice" is "When using atomic, all reads and writes need to use atomic". We have run into lots of race issues because this simple rule hasn't been followed. So, we either need to always use atomic operations or not use them at all.

makes sense. actually the way the memory model works is you shouldn't mix atomic access with non-atomic ones, even when you know you're not accessing concurrently. Go's memory model provides no guarantees if you don't use atomics and the compiler/runtime actually take advantage of this to do memory optimizations behind the scenes.
it's so easy to forget this. funny, because i recently pointed it out somewhere else as well.

i'll fix

woodsaj · 2019-07-08T08:28:08Z

api/cluster.go

 if err != nil {
 // the only errors returned are from us catching panics, so we should treat them
 // all as internalServerErrors
 log.Errorf("HTTP getData() %s", err.Error())
 response.Write(ctx, response.WrapError(err))
 return
 }
- response.Write(ctx, response.NewMsgp(200, &models.GetDataResp{Series: series}))
+ ss.Trace(opentracing.SpanFromContext(ctx.Req.Context()))


I dont think this is needed here. ss.Trace() is already called at the end of s.getTargetsLocal()

makes sense. we only need it in executePlan (for individual nodes, both local and cluster peers), and executePlan (for aggregated-across-peers per-response stats)

woodsaj · 2019-07-08T08:41:25Z

api/models/storagestats.go

+
+func (ss StorageStats) MarshalJSONFastRaw(b []byte) ([]byte, error) {
+ b = append(b, `"executeplan.cache-miss.count":`...)
+ b = strconv.AppendUint(b, uint64(ss.CacheMiss), 10)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:41:30Z

api/models/storagestats.go

+ b = append(b, `"executeplan.cache-miss.count":`...)
+ b = strconv.AppendUint(b, uint64(ss.CacheMiss), 10)
+ b = append(b, `,"executeplan.cache-hit-partial.count":`...)
+ b = strconv.AppendUint(b, uint64(ss.CacheHitPartial), 10)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:41:36Z

api/models/storagestats.go

+ b = append(b, `,"executeplan.cache-hit-partial.count":`...)
+ b = strconv.AppendUint(b, uint64(ss.CacheHitPartial), 10)
+ b = append(b, `,"executeplan.cache-hit.count":`...)
+ b = strconv.AppendUint(b, uint64(ss.CacheHit), 10)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:41:42Z

api/models/storagestats.go

+ b = append(b, `,"executeplan.cache-hit.count":`...)
+ b = strconv.AppendUint(b, uint64(ss.CacheHit), 10)
+ b = append(b, `,"executeplan.chunks-from-tank.count":`...)
+ b = strconv.AppendUint(b, uint64(ss.ChunksFromTank), 10)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:41:46Z

api/models/storagestats.go

+ b = append(b, `,"executeplan.chunks-from-tank.count":`...)
+ b = strconv.AppendUint(b, uint64(ss.ChunksFromTank), 10)
+ b = append(b, `,"executeplan.chunks-from-cache.count":`...)
+ b = strconv.AppendUint(b, uint64(ss.ChunksFromCache), 10)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:41:54Z

api/models/storagestats.go

+ b = append(b, `,"executeplan.chunks-from-cache.count":`...)
+ b = strconv.AppendUint(b, uint64(ss.ChunksFromCache), 10)
+ b = append(b, `,"executeplan.chunks-from-store.count":`...)
+ b = strconv.AppendUint(b, uint64(ss.ChunksFromStore), 10)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:42:01Z

api/models/storagestats.go

+}
+
+func (ss StorageStats) Trace(span opentracing.Span) {
+ span.SetTag("cache-miss", ss.CacheMiss)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:42:06Z

api/models/storagestats.go

+
+func (ss StorageStats) Trace(span opentracing.Span) {
+ span.SetTag("cache-miss", ss.CacheMiss)
+ span.SetTag("cache-hit-partial", ss.CacheHitPartial)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:42:10Z

api/models/storagestats.go

+func (ss StorageStats) Trace(span opentracing.Span) {
+ span.SetTag("cache-miss", ss.CacheMiss)
+ span.SetTag("cache-hit-partial", ss.CacheHitPartial)
+ span.SetTag("cache-hit", ss.CacheHit)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:42:15Z

api/models/storagestats.go

+ span.SetTag("cache-miss", ss.CacheMiss)
+ span.SetTag("cache-hit-partial", ss.CacheHitPartial)
+ span.SetTag("cache-hit", ss.CacheHit)
+ span.SetTag("chunks-from-tank", ss.ChunksFromTank)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:42:19Z

api/models/storagestats.go

+ span.SetTag("cache-hit-partial", ss.CacheHitPartial)
+ span.SetTag("cache-hit", ss.CacheHit)
+ span.SetTag("chunks-from-tank", ss.ChunksFromTank)
+ span.SetTag("chunks-from-cache", ss.ChunksFromCache)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:42:24Z

api/models/storagestats.go

+ span.SetTag("cache-hit", ss.CacheHit)
+ span.SetTag("chunks-from-tank", ss.ChunksFromTank)
+ span.SetTag("chunks-from-cache", ss.ChunksFromCache)
+ span.SetTag("chunks-from-store", ss.ChunksFromStore)


needs atomic.LoadUint32()

woodsaj · 2019-07-08T08:49:26Z

api/models/graphite_render_meta.go

@@ -23,12 +23,18 @@ func (rwm ResponseWithMeta) MarshalJSONFast(b []byte) ([]byte, error) {
 // RenderMeta holds metadata about a render request/response
 type RenderMeta struct {
 Stats stats
+ StorageStats


This just seems hacky. Lets clean it up and use Stats and StorageStats consistently. eg

type RenderMeta struct { RenderStats StorageStats }

Dieterbe · 2019-07-09T11:52:34Z

addressed all feedback. PTAL

Dieterbe added 3 commits July 6, 2019 14:43

jaeger cleanup

edcbeb3

* don't trace on a per-series level, we don't want that many spans * consistent error logging via logger and tracer, in every callsite of getSeriesCachedStore. meaning we can remove the logs and tracing stuff out of getSeriesCachedStore and the store.Search functions

let cache provide more granular information rather than just "complete"

d8afe92

Dieterbe force-pushed the jaeger-cleanup branch from dc157e6 to 54d5d3e Compare July 6, 2019 12:46

instead of returning and copying, pass pointee to be updated downstream

230a2f0

Dieterbe force-pushed the jaeger-cleanup branch from 225debb to 230a2f0 Compare July 6, 2019 13:48

also report storagestats for getTargetsLocal

2a2db9a

otherwise we only had the total and the ones for other peers

Dieterbe changed the title ~~[WIP] Jaeger cleanup~~ Jaeger cleanup: much fewer spans, but with more stats - and more stats for meta section Jul 6, 2019

vendor go-cmp

5146ea7

Dieterbe requested review from woodsaj and replay July 6, 2019 14:15

woodsaj reviewed Jul 8, 2019

View reviewed changes

Dieterbe added 3 commits July 9, 2019 12:13

no need to log here

1909da1

don't mix atomic and non-atomic access

6012c27

s/Stats/RenderStats/ for consistency

412d3dd

Dieterbe merged commit da7e04a into master Jul 10, 2019

Dieterbe deleted the jaeger-cleanup branch July 10, 2019 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaeger cleanup: much fewer spans, but with more stats - and more stats for meta section #1380

Jaeger cleanup: much fewer spans, but with more stats - and more stats for meta section #1380

Dieterbe commented Jul 4, 2019 •

edited

woodsaj commented Jul 4, 2019

Dieterbe commented Jul 6, 2019 •

edited

Dieterbe commented Jul 6, 2019

Dieterbe commented Jul 6, 2019

tomwilkie commented Jul 7, 2019

woodsaj Jul 8, 2019

Dieterbe Jul 8, 2019 •

edited

woodsaj Jul 8, 2019

Dieterbe Jul 8, 2019

woodsaj Jul 8, 2019

Dieterbe Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

woodsaj Jul 8, 2019

Dieterbe commented Jul 9, 2019

Jaeger cleanup: much fewer spans, but with more stats - and more stats for meta section #1380

Jaeger cleanup: much fewer spans, but with more stats - and more stats for meta section #1380

Conversation

Dieterbe commented Jul 4, 2019 • edited

woodsaj commented Jul 4, 2019

Dieterbe commented Jul 6, 2019 • edited

Dieterbe commented Jul 6, 2019

Dieterbe commented Jul 6, 2019

tomwilkie commented Jul 7, 2019

Choose a reason for hiding this comment

Dieterbe Jul 8, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe commented Jul 9, 2019

Dieterbe commented Jul 4, 2019 •

edited

Dieterbe commented Jul 6, 2019 •

edited

Dieterbe Jul 8, 2019 •

edited