GC resulting in missed stats #1207

Dieterbe · 2019-01-15T13:19:14Z

Dieterbe · 2019-02-11T09:33:17Z

This is also very noticeable during startup/backfill. We now have pods that don't send stats for several minutes.

awoods [9:46 AM]
So while this aligns with GC, I dont think it is GC directly that is the problem, but rather the increased CPU utilization.
i am seeing that memory.total_gc_cycles.counter64 is incrementing, then the following datapoint is missing for all series from the instance.
https://ops.grafana.net/d/65nw7zumk/mt-dropped-metrics?panelId=2&fullscreen&orgId=1

dieter [9:51 AM]
the graphite reporter is sensitive because if the routine is not ready to receive the tick, it is dropped.
why may it not be ready to receive the tick? when cpu is saturated (because of GC)
note that GC should never use more than a certain percentage of cpu
if you plot metrictank.stats.*.mt-read*-7697599ccf-9gr7d.memory.gc.cpu_fraction.gauge32 you can see the promille amount of cpu used by gc, but of course it's likely too coarse grained at 10s resolution, and the data is dropped as well
the other thing to keep in mind is that during GC, goroutines that allocate much have to execute GC work (mark assist and sweep assist IIRC), this is why routines executing http requests etc get slow during GC. I'm not sure if this applies to the Graphite.reporter() routine as it only allocates some temporary buffers to hold the carbon protocol data

but if we can reproduce this locally, then we can look at a trace to see if that routine has to assist

dieter10:26 AM
well i think a trace is going to be useful regardless too see exactly what all cores are executing at the time the reporter one is supposed to be ready to receive the tick

Dieterbe · 2019-05-16T17:39:21Z

this could be our problem:
golang/go#19812
ReadMemStats blocks during GC.
apparently prometheus/client_golang#568 has a workaround that could serve as inspiration for us

woodsaj · 2019-05-16T18:11:57Z

wouldnt it make sense to just have MemoryReporter and ProcessReporter just add metrics to the registry like every thing else.

ie,

 func NewMemoryReporter(interval int) *MemoryReporter{
        m := &MemoryReporter{
             interval: interval,
             totalAllocated: NewCounter64("memory.total_bytes_allocated"),
             allocatedInHeap: NewGauge32("memory.bytes.allocated_in_heap"),
              ....
        }
        go m.run()
        return m
}

func (m *MemoryReporter) run() {
    ticker := time.NewTicker(m.interval * time.Second);
    for range ticker.C {
         runtime.ReadMemStats(&m.mem)
	 gcPercent := getGcPercent()

         m.totalAllocated.SetUint64(m.mem.TotalAlloc)
         m.allocatedInHeap.Set(m.mem.Alloc)
         ...
}

I think it is generally a bad idea to try and "collect" metrics during the flush process. Collecting and flushing should be completely separate processes.

Dieterbe · 2019-05-16T18:14:56Z

I think it is generally a bad idea to try and "collect" metrics during the flush process. Collecting and flushing should be completely separate processes

I think it's a tradeoff. for collect operations that we know to be brief, we can do them "in sync" with the flush, so we always flush the latest stats.
if we can't guarantee them to be brief, then we have to do it out of band like your proposal, but that can result in flushing stats that are somewhat out of date.

fkaleo · 2019-09-16T12:53:58Z

plan of action:

updating go if issue runtime: cannot ReadMemStats during GC golang/go#19812 is fixed in upstream
OR
implement workaround similar to Return previous memstats if Go collector needs longer than 1s prometheus/client_golang#568

if none of that works, further investigation will be required

fkaleo · 2019-10-02T16:46:11Z

The upstream fix for golang/go#19812 was merged in Golang git just a few days after the 1.13 release.
We will first upgrade all instances to Golang 1.13 before updating them to Golang 1.13+the few commits between 1.13 and the upstream fix.

woodsaj · 2019-10-04T01:45:08Z

This means the fix is not available until go1.14 is released in Feb 2020.(we can/should not use a non release version of Go)

fkaleo · 2019-10-04T07:31:43Z

This means the fix is not available until go1.14 is released in Feb 2020.(we can/should not use a non release version of Go)

Wouldn't it be ok to backport the 1 patch we need to 1.13?

fkaleo · 2019-10-14T10:16:55Z

This means the fix is not available until go1.14 is released in Feb 2020.(we can/should not use a non release version of Go)

Wouldn't it be ok to backport the 1 patch we need to 1.13?

answer from awoods is no: part of the reason is because we are not equipped to support a patched version of go, especially with regards to users who don't want to run with a patched version of Go

Dieterbe · 2019-11-22T20:38:21Z

It looks like this is still an issue
here is a screenie of the dashboard for a pod booting with version metrictank:v0.13.0-186-g9e65a54 (current master)

note again after backfill the issue seems to disappear:

fkaleo · 2020-01-06T14:46:46Z

Try to reproduce locally with a docker-compose stack then try to get a go trace.

Dieterbe · 2020-01-17T07:59:17Z

FWIW a fix for golang/go#19812 has been merged in go.

Dieterbe · 2020-01-27T13:11:10Z

update: they reverted that fix :o

Dieterbe · 2020-02-24T14:27:58Z

Golang team is working on another fix now.
Maybe fixed by new go version.. giving this some more time.
also, doesn't seem as urgent compared to other items.

Dieterbe · 2020-03-18T22:38:26Z

new fix seems to be in in Go tip/master
golang/go#19812 (comment)

stale · 2020-06-17T02:06:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

robert-milan mentioned this issue Jul 4, 2019

Roadmap #1319

Open

27 tasks

fkaleo self-assigned this Sep 26, 2019

Dieterbe added this to the sprint-2 milestone Oct 7, 2019

Dieterbe added the customer-impacting label Oct 7, 2019

fkaleo mentioned this issue Oct 16, 2019

MemoryReporter: make call to runtime.ReadMemStats time bound to avoid lost metrics #1494

Merged

fkaleo closed this as completed in #1494 Oct 16, 2019

Dieterbe reopened this Nov 22, 2019

fkaleo modified the milestones: sprint-2, sprint-5 Nov 23, 2019

fkaleo modified the milestones: sprint-5, sprint-6 Jan 6, 2020

replay modified the milestones: sprint-6, sprint-7 Jan 27, 2020

replay removed this from the sprint-7 milestone Feb 24, 2020

stale bot added the stale label Jun 17, 2020

stale bot closed this as completed Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC resulting in missed stats #1207

GC resulting in missed stats #1207

Dieterbe commented Jan 15, 2019 •

edited

Dieterbe commented Feb 11, 2019 •

edited

Dieterbe commented May 16, 2019

woodsaj commented May 16, 2019

Dieterbe commented May 16, 2019

fkaleo commented Sep 16, 2019 •

edited

fkaleo commented Oct 2, 2019

woodsaj commented Oct 4, 2019

fkaleo commented Oct 4, 2019

fkaleo commented Oct 14, 2019

Dieterbe commented Nov 22, 2019 •

edited

fkaleo commented Jan 6, 2020 •

edited

Dieterbe commented Jan 17, 2020

Dieterbe commented Jan 27, 2020

Dieterbe commented Feb 24, 2020

Dieterbe commented Mar 18, 2020

stale bot commented Jun 17, 2020

GC resulting in missed stats #1207

GC resulting in missed stats #1207

Comments

Dieterbe commented Jan 15, 2019 • edited

Dieterbe commented Feb 11, 2019 • edited

Dieterbe commented May 16, 2019

woodsaj commented May 16, 2019

Dieterbe commented May 16, 2019

fkaleo commented Sep 16, 2019 • edited

fkaleo commented Oct 2, 2019

woodsaj commented Oct 4, 2019

fkaleo commented Oct 4, 2019

fkaleo commented Oct 14, 2019

Dieterbe commented Nov 22, 2019 • edited

fkaleo commented Jan 6, 2020 • edited

Dieterbe commented Jan 17, 2020

Dieterbe commented Jan 27, 2020

Dieterbe commented Feb 24, 2020

Dieterbe commented Mar 18, 2020

stale bot commented Jun 17, 2020

Dieterbe commented Jan 15, 2019 •

edited

Dieterbe commented Feb 11, 2019 •

edited

fkaleo commented Sep 16, 2019 •

edited

Dieterbe commented Nov 22, 2019 •

edited

fkaleo commented Jan 6, 2020 •

edited