Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor: Does not exit on error #3966

Closed
wiardvanrij opened this issue Mar 24, 2021 · 20 comments
Closed

Compactor: Does not exit on error #3966

wiardvanrij opened this issue Mar 24, 2021 · 20 comments

Comments

@wiardvanrij
Copy link
Member

Thanos, Prometheus and Golang version used:
17.2

Object Storage Provider: s3

What happened: Compactor got an error but did not get killed and does not continue

What you expected to happen: Compactor exits so it can be restarted or continues regardless

How to reproduce it (as minimally and precisely as possible): n/a

Full logs to relevant components:

Logs

level=info ts=2021-03-18T13:35:24.13386452Z caller=clean.go:33 msg="started cleaning of aborted partial uploads"
level=info ts=2021-03-18T13:35:24.133906785Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
level=info ts=2021-03-18T13:35:24.13391986Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
level=info ts=2021-03-18T13:35:24.133930049Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
level=info ts=2021-03-18T13:35:29.026173574Z caller=fetcher.go:458 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=3.499913081s cached=5974 returned=5974 partial=0
level=error ts=2021-03-18T13:36:28.794804266Z caller=runutil.go:99 msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: Access Denied"
level=error ts=2021-03-18T13:37:28.70842268Z caller=runutil.go:99 msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: Access Denied"
level=warn ts=2021-03-18T13:38:24.507549166Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="syncing metas: BaseFetcher: iter bucket: Access Denied"
level=info ts=2021-03-18T13:38:24.507583914Z caller=http.go:65 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: BaseFetcher: iter bucket: Access Denied"
level=info ts=2021-03-18T13:38:25.007714323Z caller=http.go:84 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: BaseFetcher: iter bucket: Access Denied"
level=info ts=2021-03-18T13:38:25.007758137Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="syncing metas: BaseFetcher: iter bucket: Access Denied"

@GiedriusS
Copy link
Member

Could you please get the goroutine dump via the debug/pprof endpoint when this happens?

@wiardvanrij
Copy link
Member Author

Could you please get the goroutine dump via the debug/pprof endpoint when this happens?

That's not possible as this endpoint is shutdown :(

@bwplotka
Copy link
Member

Looks like bug, but I would suggest upgrading Thanos so we are sure we know what's happening.

Looks like to repro one could try running compactor with AccessDenied (:

@bwplotka bwplotka added the bug label Mar 24, 2021
@wiardvanrij
Copy link
Member Author

We had KIAM outage, so I think to reproduce is to allow acces -> remove acces.

@GiedriusS
Copy link
Member

Another thing is that we should only turn off the HTTP server at the end of everything to permit debugging via pprof in cases such as this.

@dswarbrick
Copy link

dswarbrick commented Apr 9, 2021

Could you please get the goroutine dump via the debug/pprof endpoint when this happens?

That's not possible as this endpoint is shutdown :(

Sounds like the same behaviour I observed in #3868.

@stale
Copy link

stale bot commented Jun 9, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 9, 2021
@stale
Copy link

stale bot commented Jun 23, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jun 23, 2021
@wiardvanrij
Copy link
Member Author

Still relevant

@GiedriusS
Copy link
Member

GiedriusS commented Jul 19, 2021

I have observed this a few times and AFAICT this happens because there aren't many cancelation points in the Prometheus compaction code or our downsampling code. So, two improvements can be made here:

  • Add more cancelation points to the code i.e. where we check whether the context is still valid
  • And close the HTTP server at the end of the whole shutdown operation

Help welcome.

@rudo-thomas
Copy link

We've hit this with v0.21.1 running on Kubernetes against a locally-hosted S3 (Ceph with radosgw):

level=info ts=2021-07-26T08:49:29.403830908Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
level=info ts=2021-07-26T08:49:29.403850994Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
level=info ts=2021-07-26T08:49:29.403872117Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
level=info ts=2021-07-26T08:50:29.12294803Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=332.354878ms cached=924 returned=924 partial=0
level=error ts=2021-07-26T08:56:28.79212747Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=warn ts=2021-07-26T08:58:19.271976663Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.272120069Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=error ts=2021-07-26T08:58:19.273050359Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context canceled"
level=info ts=2021-07-26T08:58:19.273219033Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.273261935Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"

After this the process was not doing anything anymore (as described above) and was not respond to SIGTERM either.

@ianwoolf
Copy link
Contributor

ianwoolf commented Oct 10, 2021

We've hit this with v0.21.1 running on Kubernetes against a locally-hosted S3 (Ceph with radosgw):

level=info ts=2021-07-26T08:49:29.403830908Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
level=info ts=2021-07-26T08:49:29.403850994Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
level=info ts=2021-07-26T08:49:29.403872117Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
level=info ts=2021-07-26T08:50:29.12294803Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=332.354878ms cached=924 returned=924 partial=0
level=error ts=2021-07-26T08:56:28.79212747Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=warn ts=2021-07-26T08:58:19.271976663Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.272120069Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=error ts=2021-07-26T08:58:19.273050359Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context canceled"
level=info ts=2021-07-26T08:58:19.273219033Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.273261935Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"

After this the process was not doing anything anymore (as described above) and was not respond to SIGTERM either.

By default, compactor does not crash on halt errors. there is a hidden flag you can change it.

https://thanos.io/tip/components/compact.md/#halting

i also met same issue and i am still investigating.

@stale
Copy link

stale bot commented Jan 9, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jan 9, 2022
@GiedriusS GiedriusS removed the stale label Jan 10, 2022
@stale
Copy link

stale bot commented Apr 17, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Apr 17, 2022
@stale
Copy link

stale bot commented May 1, 2022

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed May 1, 2022
@Venture200
Copy link

@rudo-thomas , @ianwoolf Hi guys, have you found any solution for that issue (meta.json) ?

@rudo-thomas
Copy link

@Venture200 we added a Kubernetes liveness probe.

@Venture200
Copy link

@rudo-thomas Thanks for replying, so when compactor crashes, it restarts it again ?

@rudo-thomas
Copy link

@Venture200 there is no crash to speak of, see the comments above.

Compactor stops responding to liveness probes and gets restarted by Kubernetes.

@Venture200
Copy link

Venture200 commented Nov 28, 2022

level=error ts=2022-11-27T11:08:57.563444234Z caller=runutil.go:100 msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context canceled"
level=error ts=2022-11-27T11:08:57.563603512Z caller=compact.go:499 msg="retriable error" err="compaction: sync: BaseFetcher: iter bucket: context canceled"
level=warn ts=2022-11-27T11:08:57.565572915Z caller=intrumentation.go:67 msg="changing probe status" status=not-ready reason="syncing metas: filter metas: filter blocks marked for deletion: get file: 01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json: Get "https://s3./thanos/01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json": net/http: timeout awaiting response headers"
level=info ts=2022-11-27T11:08:57.565650049Z caller=http.go:84 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: filter metas: filter blocks marked for deletion: get file: 01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json: Get "https://s3./thanos/01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json": net/http: timeout awaiting response headers"
level=info ts=2022-11-27T11:08:57.565803191Z caller=http.go:103 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: filter metas: filter blocks marked for deletion: get file: 01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json: Get "https://s3./thanos/01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json": net/http: timeout awaiting response headers"

in my case compactor keep running more than 24hours but after seeing this error, the process is killed. So restarts I believe happens after this error in your case either ? @rudo-thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants