Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Jaeger cleanup: much fewer spans, but with more stats - and more stats for meta section #1380

Merged
merged 9 commits into from
Jul 10, 2019

Conversation

Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Jul 4, 2019

Goals:

  • honor our internal jaeger guidelines, specifically no spans on a per-series basis as this was overloading our jaeger cluster. for now this means taking away many spans about the per-series getTargetsLocal, persistent store, cache and tank (aggmetrics) gets, etc. this means we lose a lot of fidelity. we can reintroduce these later if needed if we come up with a better way, such as multi-series (batched) gets from tank/cache/store. but i'm not too concerned about this because the point below.
  • another way i want to provide the most useful insights without having spans for all these per-series operations, is by collecting stats (measured per-series at this point, but summed up per-request), and attaching these to the jaeger trace. stats such as num cache hit/miss/hit-partial, and amount chunks loaded (and perhaps also time spent loading) from tank/cache/store. this should all be a low-cost way to provide very useful insights in a digestible way. and i also see this as a way to extend the metadata stats returned with query responses as introduced in render response metadata: stats #1344

too soon for in-depth review, but curious to hear anyone's thoughts?

@woodsaj
Copy link
Member

woodsaj commented Jul 4, 2019

I like the idea of simplifying the traces.

On the implementation, instead of having the numerous getSeries* calls pass back RenderStats (which you will need to aggregate at each step, though the current code is missing this), i would just pass a single *RenderStats through the call pipeline. Each func can then just increment the counters as the data is processed. eg

type RenderStats struct {
	cacheMiss       uint32
	cacheHitPartial uint32
	cacheHit        uint32
	chunksFromTank  uint32
	chunksFromCache uint32
	chunksFromStore uint32
}
func (r *RenderStats) CacheMiss(i int) {
   atomic.AddInt(&r.cacheMiss, i)
}
func (r *RenderStats) CacheHit(i int) {
   atomic.AddInt(&r.cacheHit, i)
}
...

* don't trace on a per-series level, we don't want that many spans
* consistent error logging via logger and tracer, in every
  callsite of getSeriesCachedStore. meaning we can remove the logs
  and tracing stuff out of getSeriesCachedStore and the store.Search
  functions
along these call trees:

executePlan <- records to span + reports stats into api response
  getTargets
    getTargetsRemote
    getTargetsLocal
      getTarget
        getSeriesFixed
          getSeries
            getSeriesCachedStore

prometheus.querier.Select <-- we ignore the stats here
  getTargets
    ....

Server.getData <- records to span + reports stats into rpc response
  getTargetsLocal
    getTarget
      getSeriesFixed
        getSeries
          getSeriesCachedStore

Note that:
* GetDataRespV0 and GetDataRespV1 responses can be used interchangably in clusters
* we blend the different sources of stats into a unified presentation
  for the user, this required making the json marshaling slightly more
  complicated.
* in jaeger, the stats for executePlan are the sum of the (already
  reported) stats of the getData calls.
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jul 6, 2019

extended version of #1344. I think @shanson7 will like this. note that the stats are collected across RPC boundaries and summed across all peers

curl (...) --data 'target=some.id.of.a.metric.1*&from=1562412706&until=1562412843&format=json&maxDataPoints=1920&meta=true' --compressed | jsonpp | less

{
    "version": "v0.1",
    "meta": {
        "stats": {
            "executeplan.resolve-series.ms": 1,
            "executeplan.get-targets.ms": 10,
            "executeplan.prepare-series.ms": 0,
            "executeplan.plan-run.ms": 0,
            "executeplan.series-fetch.count": 112,
            "executeplan.points-fetch.count": 15344,
            "executeplan.points-return.count": 15344,
            "executeplan.cache-miss.count": 0,
            "executeplan.cache-hit-partial.count": 0,
            "executeplan.cache-hit.count": 112,
            "executeplan.chunks-from-tank.count": 0,
            "executeplan.chunks-from-cache.count": 336,
            "executeplan.chunks-from-store.count": 0
        }
    },
    "series": [
        {
            "target": "some.id.of.a.metric.1",
            "tags": {
                "name": "some.id.of.a.metric.1"
            },
            "datapoints": [
                [
                    0.8159659229285896,
                    1562412707
                ],
                [
...

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jul 6, 2019

jaeger traces much more plain, but with more stats:

jaeger
jaeger2

otherwise we only had the total and the ones for other peers
@Dieterbe Dieterbe changed the title [WIP] Jaeger cleanup Jaeger cleanup: much fewer spans, but with more stats - and more stats for meta section Jul 6, 2019
@Dieterbe Dieterbe requested review from woodsaj and replay July 6, 2019 14:15
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jul 6, 2019

cc @tomwilkie

@tomwilkie
Copy link
Contributor

Looks good!

atomic.AddUint32(&ss.ChunksFromStore, n)
}

// Add adds a to ss. Note that a is presumed to be "done" (read unsafely)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it would only make sense to try and add a once it is 'done', it would be unwise/unsafe to presume that is always true.
We should just use atomic operations to read a to avoid the possibility of a data race. eg

atomic.AddUint32(&ss.CacheHit, atomic.LoadUint32(&a.CacheHit)))

The call to Add() itself wont be atomic (i.e. individual counters might increase between when the first counter from a is read and when the last one is), but that wont have any impact on our use case.

Copy link
Contributor Author

@Dieterbe Dieterbe Jul 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have pondered the same thing.
I considered it the programmer's responsibility to make sure a is "done". i agree using atomics for a is safer, but at the same time, it would mask issues should they appear: it doesn't make sense to compute the aggregate stats if we're not done yet generating or collecting the stats. similar to how it also wouldn't make sense to build up our series slice if the inputs haven't been decoded yet.

I think this is in line with best practices everywhere in go code: whenever calling a function it's the programmers responsibility to make sure there is no unsafe data access. this is true for stdlib and many 3rd party libraries.

note that Add in the current form is also not atomic. I agree there is no need for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would mask issues should they appear

The main issue we need to avoid is panics. Panics are bad and we should avoid them were ever possible. Wrapping the the reads in atomic.LoadUint32 avoids panics. It doesn't avoid other issues that may result from users adding stats that are being updated and that is fine. The comment about a being done should still stay.

I think this is in line with best practices everywhere in go code:

As a general statement that is true. But a more specific "best practice" is "When using atomic, all reads and writes need to use atomic". We have run into lots of race issues because this simple rule hasn't been followed. So, we either need to always use atomic operations or not use them at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. actually the way the memory model works is you shouldn't mix atomic access with non-atomic ones, even when you know you're not accessing concurrently. Go's memory model provides no guarantees if you don't use atomics and the compiler/runtime actually take advantage of this to do memory optimizations behind the scenes.
it's so easy to forget this. funny, because i recently pointed it out somewhere else as well.

i'll fix

api/cluster.go Outdated
if err != nil {
// the only errors returned are from us catching panics, so we should treat them
// all as internalServerErrors
log.Errorf("HTTP getData() %s", err.Error())
response.Write(ctx, response.WrapError(err))
return
}
response.Write(ctx, response.NewMsgp(200, &models.GetDataResp{Series: series}))
ss.Trace(opentracing.SpanFromContext(ctx.Req.Context()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this is needed here. ss.Trace() is already called at the end of s.getTargetsLocal()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. we only need it in executePlan (for individual nodes, both local and cluster peers), and executePlan (for aggregated-across-peers per-response stats)


func (ss StorageStats) MarshalJSONFastRaw(b []byte) ([]byte, error) {
b = append(b, `"executeplan.cache-miss.count":`...)
b = strconv.AppendUint(b, uint64(ss.CacheMiss), 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()

b = append(b, `"executeplan.cache-miss.count":`...)
b = strconv.AppendUint(b, uint64(ss.CacheMiss), 10)
b = append(b, `,"executeplan.cache-hit-partial.count":`...)
b = strconv.AppendUint(b, uint64(ss.CacheHitPartial), 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()

b = append(b, `,"executeplan.cache-hit-partial.count":`...)
b = strconv.AppendUint(b, uint64(ss.CacheHitPartial), 10)
b = append(b, `,"executeplan.cache-hit.count":`...)
b = strconv.AppendUint(b, uint64(ss.CacheHit), 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()

b = append(b, `,"executeplan.cache-hit.count":`...)
b = strconv.AppendUint(b, uint64(ss.CacheHit), 10)
b = append(b, `,"executeplan.chunks-from-tank.count":`...)
b = strconv.AppendUint(b, uint64(ss.ChunksFromTank), 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()

b = append(b, `,"executeplan.chunks-from-tank.count":`...)
b = strconv.AppendUint(b, uint64(ss.ChunksFromTank), 10)
b = append(b, `,"executeplan.chunks-from-cache.count":`...)
b = strconv.AppendUint(b, uint64(ss.ChunksFromCache), 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()

b = append(b, `,"executeplan.chunks-from-cache.count":`...)
b = strconv.AppendUint(b, uint64(ss.ChunksFromCache), 10)
b = append(b, `,"executeplan.chunks-from-store.count":`...)
b = strconv.AppendUint(b, uint64(ss.ChunksFromStore), 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()

}

func (ss StorageStats) Trace(span opentracing.Span) {
span.SetTag("cache-miss", ss.CacheMiss)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()


func (ss StorageStats) Trace(span opentracing.Span) {
span.SetTag("cache-miss", ss.CacheMiss)
span.SetTag("cache-hit-partial", ss.CacheHitPartial)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()

func (ss StorageStats) Trace(span opentracing.Span) {
span.SetTag("cache-miss", ss.CacheMiss)
span.SetTag("cache-hit-partial", ss.CacheHitPartial)
span.SetTag("cache-hit", ss.CacheHit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()

span.SetTag("cache-miss", ss.CacheMiss)
span.SetTag("cache-hit-partial", ss.CacheHitPartial)
span.SetTag("cache-hit", ss.CacheHit)
span.SetTag("chunks-from-tank", ss.ChunksFromTank)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()

span.SetTag("cache-hit-partial", ss.CacheHitPartial)
span.SetTag("cache-hit", ss.CacheHit)
span.SetTag("chunks-from-tank", ss.ChunksFromTank)
span.SetTag("chunks-from-cache", ss.ChunksFromCache)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()

span.SetTag("cache-hit", ss.CacheHit)
span.SetTag("chunks-from-tank", ss.ChunksFromTank)
span.SetTag("chunks-from-cache", ss.ChunksFromCache)
span.SetTag("chunks-from-store", ss.ChunksFromStore)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs atomic.LoadUint32()

@@ -23,12 +23,18 @@ func (rwm ResponseWithMeta) MarshalJSONFast(b []byte) ([]byte, error) {
// RenderMeta holds metadata about a render request/response
type RenderMeta struct {
Stats stats
StorageStats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just seems hacky. Lets clean it up and use Stats and StorageStats consistently. eg

type RenderMeta struct {
	RenderStats
	StorageStats
}

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jul 9, 2019

addressed all feedback. PTAL

@Dieterbe Dieterbe merged commit da7e04a into master Jul 10, 2019
@Dieterbe Dieterbe deleted the jaeger-cleanup branch July 10, 2019 15:03
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants