Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

GC resulting in missed stats #1207

Closed
Dieterbe opened this issue Jan 15, 2019 · 16 comments · Fixed by #1494
Closed

GC resulting in missed stats #1207

Dieterbe opened this issue Jan 15, 2019 · 16 comments · Fixed by #1494

Comments

@Dieterbe
Copy link
Contributor

Dieterbe commented Jan 15, 2019

missed-stats
gc

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Feb 11, 2019

This is also very noticeable during startup/backfill. We now have pods that don't send stats for several minutes.

awoods [9:46 AM]
So while this aligns with GC, I dont think it is GC directly that is the problem, but rather the increased CPU utilization.
i am seeing that memory.total_gc_cycles.counter64 is incrementing, then the following datapoint is missing for all series from the instance.
https://ops.grafana.net/d/65nw7zumk/mt-dropped-metrics?panelId=2&fullscreen&orgId=1

dieter [9:51 AM]
the graphite reporter is sensitive because if the routine is not ready to receive the tick, it is dropped.
why may it not be ready to receive the tick? when cpu is saturated (because of GC)
note that GC should never use more than a certain percentage of cpu
if you plot metrictank.stats.*.mt-read*-7697599ccf-9gr7d.memory.gc.cpu_fraction.gauge32 you can see the promille amount of cpu used by gc, but of course it's likely too coarse grained at 10s resolution, and the data is dropped as well
the other thing to keep in mind is that during GC, goroutines that allocate much have to execute GC work (mark assist and sweep assist IIRC), this is why routines executing http requests etc get slow during GC. I'm not sure if this applies to the Graphite.reporter() routine as it only allocates some temporary buffers to hold the carbon protocol data

but if we can reproduce this locally, then we can look at a trace to see if that routine has to assist

dieter10:26 AM
well i think a trace is going to be useful regardless too see exactly what all cores are executing at the time the reporter one is supposed to be ready to receive the tick

@Dieterbe
Copy link
Contributor Author

this could be our problem:
golang/go#19812
ReadMemStats blocks during GC.
apparently prometheus/client_golang#568 has a workaround that could serve as inspiration for us

@woodsaj
Copy link
Member

woodsaj commented May 16, 2019

wouldnt it make sense to just have MemoryReporter and ProcessReporter just add metrics to the registry like every thing else.

ie,

 func NewMemoryReporter(interval int) *MemoryReporter{
        m := &MemoryReporter{
             interval: interval,
             totalAllocated: NewCounter64("memory.total_bytes_allocated"),
             allocatedInHeap: NewGauge32("memory.bytes.allocated_in_heap"),
              ....
        }
        go m.run()
        return m
}

func (m *MemoryReporter) run() {
    ticker := time.NewTicker(m.interval * time.Second);
    for range ticker.C {
         runtime.ReadMemStats(&m.mem)
	 gcPercent := getGcPercent()

         m.totalAllocated.SetUint64(m.mem.TotalAlloc)
         m.allocatedInHeap.Set(m.mem.Alloc)
         ...
}

I think it is generally a bad idea to try and "collect" metrics during the flush process. Collecting and flushing should be completely separate processes.

@Dieterbe
Copy link
Contributor Author

I think it is generally a bad idea to try and "collect" metrics during the flush process. Collecting and flushing should be completely separate processes

I think it's a tradeoff. for collect operations that we know to be brief, we can do them "in sync" with the flush, so we always flush the latest stats.
if we can't guarantee them to be brief, then we have to do it out of band like your proposal, but that can result in flushing stats that are somewhat out of date.

@robert-milan robert-milan mentioned this issue Jul 4, 2019
27 tasks
@fkaleo
Copy link
Contributor

fkaleo commented Sep 16, 2019

plan of action:

if none of that works, further investigation will be required

@fkaleo fkaleo self-assigned this Sep 26, 2019
@fkaleo
Copy link
Contributor

fkaleo commented Oct 2, 2019

The upstream fix for golang/go#19812 was merged in Golang git just a few days after the 1.13 release.
We will first upgrade all instances to Golang 1.13 before updating them to Golang 1.13+the few commits between 1.13 and the upstream fix.

@woodsaj
Copy link
Member

woodsaj commented Oct 4, 2019

This means the fix is not available until go1.14 is released in Feb 2020.(we can/should not use a non release version of Go)

@fkaleo
Copy link
Contributor

fkaleo commented Oct 4, 2019

This means the fix is not available until go1.14 is released in Feb 2020.(we can/should not use a non release version of Go)

Wouldn't it be ok to backport the 1 patch we need to 1.13?

@Dieterbe Dieterbe added this to the sprint-2 milestone Oct 7, 2019
@fkaleo
Copy link
Contributor

fkaleo commented Oct 14, 2019

This means the fix is not available until go1.14 is released in Feb 2020.(we can/should not use a non release version of Go)

Wouldn't it be ok to backport the 1 patch we need to 1.13?

answer from awoods is no: part of the reason is because we are not equipped to support a patched version of go, especially with regards to users who don't want to run with a patched version of Go

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 22, 2019

It looks like this is still an issue
here is a screenie of the dashboard for a pod booting with version metrictank:v0.13.0-186-g9e65a54 (current master)
stats-gaps
note again after backfill the issue seems to disappear:

stats2

@fkaleo fkaleo modified the milestones: sprint-2, sprint-5 Nov 23, 2019
@fkaleo
Copy link
Contributor

fkaleo commented Jan 6, 2020

Try to reproduce locally with a docker-compose stack then try to get a go trace.

@fkaleo fkaleo modified the milestones: sprint-5, sprint-6 Jan 6, 2020
@Dieterbe
Copy link
Contributor Author

FWIW a fix for golang/go#19812 has been merged in go.

@Dieterbe
Copy link
Contributor Author

update: they reverted that fix :o

@replay replay modified the milestones: sprint-6, sprint-7 Jan 27, 2020
@replay replay removed this from the sprint-7 milestone Feb 24, 2020
@Dieterbe
Copy link
Contributor Author

Golang team is working on another fix now.
Maybe fixed by new go version.. giving this some more time.
also, doesn't seem as urgent compared to other items.

@Dieterbe
Copy link
Contributor Author

new fix seems to be in in Go tip/master
golang/go#19812 (comment)

@stale
Copy link

stale bot commented Jun 17, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 17, 2020
@stale stale bot closed this as completed Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants