Compactor: Does not exit on error #3966

wiardvanrij · 2021-03-24T15:49:42Z

Thanos, Prometheus and Golang version used:
17.2

Object Storage Provider: s3

What happened: Compactor got an error but did not get killed and does not continue

What you expected to happen: Compactor exits so it can be restarted or continues regardless

How to reproduce it (as minimally and precisely as possible): n/a

Full logs to relevant components:

Logs

level=info ts=2021-03-18T13:35:24.13386452Z caller=clean.go:33 msg="started cleaning of aborted partial uploads"
level=info ts=2021-03-18T13:35:24.133906785Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
level=info ts=2021-03-18T13:35:24.13391986Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
level=info ts=2021-03-18T13:35:24.133930049Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
level=info ts=2021-03-18T13:35:29.026173574Z caller=fetcher.go:458 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=3.499913081s cached=5974 returned=5974 partial=0
level=error ts=2021-03-18T13:36:28.794804266Z caller=runutil.go:99 msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: Access Denied"
level=error ts=2021-03-18T13:37:28.70842268Z caller=runutil.go:99 msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: Access Denied"
level=warn ts=2021-03-18T13:38:24.507549166Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="syncing metas: BaseFetcher: iter bucket: Access Denied"
level=info ts=2021-03-18T13:38:24.507583914Z caller=http.go:65 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: BaseFetcher: iter bucket: Access Denied"
level=info ts=2021-03-18T13:38:25.007714323Z caller=http.go:84 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: BaseFetcher: iter bucket: Access Denied"
level=info ts=2021-03-18T13:38:25.007758137Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="syncing metas: BaseFetcher: iter bucket: Access Denied"

The text was updated successfully, but these errors were encountered:

GiedriusS · 2021-03-24T15:58:53Z

Could you please get the goroutine dump via the debug/pprof endpoint when this happens?

wiardvanrij · 2021-03-24T16:03:42Z

Could you please get the goroutine dump via the debug/pprof endpoint when this happens?

That's not possible as this endpoint is shutdown :(

bwplotka · 2021-03-24T17:14:19Z

Looks like bug, but I would suggest upgrading Thanos so we are sure we know what's happening.

Looks like to repro one could try running compactor with AccessDenied (:

wiardvanrij · 2021-03-24T17:34:56Z

We had KIAM outage, so I think to reproduce is to allow acces -> remove acces.

GiedriusS · 2021-04-01T09:31:35Z

Another thing is that we should only turn off the HTTP server at the end of everything to permit debugging via pprof in cases such as this.

dswarbrick · 2021-04-09T12:34:48Z

Could you please get the goroutine dump via the debug/pprof endpoint when this happens?

That's not possible as this endpoint is shutdown :(

Sounds like the same behaviour I observed in #3868.

stale · 2021-06-09T00:54:07Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-06-23T09:20:14Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

wiardvanrij · 2021-07-19T09:26:52Z

Still relevant

GiedriusS · 2021-07-19T10:31:23Z

I have observed this a few times and AFAICT this happens because there aren't many cancelation points in the Prometheus compaction code or our downsampling code. So, two improvements can be made here:

Add more cancelation points to the code i.e. where we check whether the context is still valid
And close the HTTP server at the end of the whole shutdown operation

Help welcome.

rudo-thomas · 2021-08-05T12:12:40Z

We've hit this with v0.21.1 running on Kubernetes against a locally-hosted S3 (Ceph with radosgw):

level=info ts=2021-07-26T08:49:29.403830908Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
level=info ts=2021-07-26T08:49:29.403850994Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
level=info ts=2021-07-26T08:49:29.403872117Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
level=info ts=2021-07-26T08:50:29.12294803Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=332.354878ms cached=924 returned=924 partial=0
level=error ts=2021-07-26T08:56:28.79212747Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=warn ts=2021-07-26T08:58:19.271976663Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.272120069Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=error ts=2021-07-26T08:58:19.273050359Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context canceled"
level=info ts=2021-07-26T08:58:19.273219033Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.273261935Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"

After this the process was not doing anything anymore (as described above) and was not respond to SIGTERM either.

ianwoolf · 2021-10-10T06:30:03Z

We've hit this with v0.21.1 running on Kubernetes against a locally-hosted S3 (Ceph with radosgw):

level=info ts=2021-07-26T08:49:29.403830908Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
level=info ts=2021-07-26T08:49:29.403850994Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
level=info ts=2021-07-26T08:49:29.403872117Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
level=info ts=2021-07-26T08:50:29.12294803Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=332.354878ms cached=924 returned=924 partial=0
level=error ts=2021-07-26T08:56:28.79212747Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=warn ts=2021-07-26T08:58:19.271976663Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.272120069Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=error ts=2021-07-26T08:58:19.273050359Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context canceled"
level=info ts=2021-07-26T08:58:19.273219033Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"
level=info ts=2021-07-26T08:58:19.273261935Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="syncing metas: incomplete view: meta.json file exists: 01DP93WG84FMQQ0AWQXC1BTC20/meta.json: stat s3 object: Head \"http://ceph.internal:8080/thanos/01DP93WG84FMQQ0AWQXC1BTC20/meta.json\": context deadline exceeded"

After this the process was not doing anything anymore (as described above) and was not respond to SIGTERM either.

By default, compactor does not crash on halt errors. there is a hidden flag you can change it.

https://thanos.io/tip/components/compact.md/#halting

i also met same issue and i am still investigating.

stale · 2022-01-09T09:33:52Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2022-04-17T04:54:28Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2022-05-01T16:12:26Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

Venture200 · 2022-11-26T16:26:24Z

@rudo-thomas , @ianwoolf Hi guys, have you found any solution for that issue (meta.json) ?

rudo-thomas · 2022-11-28T09:50:13Z

@Venture200 we added a Kubernetes liveness probe.

Venture200 · 2022-11-28T10:57:15Z

@rudo-thomas Thanks for replying, so when compactor crashes, it restarts it again ?

rudo-thomas · 2022-11-28T11:03:10Z

@Venture200 there is no crash to speak of, see the comments above.

Compactor stops responding to liveness probes and gets restarted by Kubernetes.

Venture200 · 2022-11-28T11:44:38Z

level=error ts=2022-11-27T11:08:57.563444234Z caller=runutil.go:100 msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: context canceled"
level=error ts=2022-11-27T11:08:57.563603512Z caller=compact.go:499 msg="retriable error" err="compaction: sync: BaseFetcher: iter bucket: context canceled"
level=warn ts=2022-11-27T11:08:57.565572915Z caller=intrumentation.go:67 msg="changing probe status" status=not-ready reason="syncing metas: filter metas: filter blocks marked for deletion: get file: 01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json: Get "https://s3./thanos/01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json": net/http: timeout awaiting response headers"
level=info ts=2022-11-27T11:08:57.565650049Z caller=http.go:84 service=http/server component=compact msg="internal server is shutting down" err="syncing metas: filter metas: filter blocks marked for deletion: get file: 01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json: Get "https://s3./thanos/01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json": net/http: timeout awaiting response headers"
level=info ts=2022-11-27T11:08:57.565803191Z caller=http.go:103 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: filter metas: filter blocks marked for deletion: get file: 01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json: Get "https://s3./thanos/01GHER92ZHMSSCW6N553ZBXW4D/deletion-mark.json": net/http: timeout awaiting response headers"

in my case compactor keep running more than 24hours but after seeing this error, the process is killed. So restarts I believe happens after this error in your case either ? @rudo-thomas

GiedriusS added component: compact needs-more-info labels Mar 24, 2021

bwplotka added the bug label Mar 24, 2021

wiardvanrij mentioned this issue Apr 8, 2021

Store: does not exit nor continue on memcached DNS failure #4033

Closed

stale bot added the stale label Jun 9, 2021

stale bot closed this as completed Jun 23, 2021

wiardvanrij reopened this Jul 19, 2021

stale bot removed the stale label Jul 19, 2021

wiardvanrij mentioned this issue Jul 19, 2021

Thanos compact fails to exit on sigterm under certain conditions #4440

Closed

wiardvanrij added the help wanted label Jul 19, 2021

ianwoolf mentioned this issue Oct 10, 2021

compactor: add sync block metas timeout for block-viewer #4763

Closed

stale bot added the stale label Jan 9, 2022

GiedriusS removed the stale label Jan 10, 2022

stale bot added the stale label Apr 17, 2022

stale bot closed this as completed May 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor: Does not exit on error #3966

Compactor: Does not exit on error #3966

wiardvanrij commented Mar 24, 2021

GiedriusS commented Mar 24, 2021

wiardvanrij commented Mar 24, 2021

bwplotka commented Mar 24, 2021

wiardvanrij commented Mar 24, 2021

GiedriusS commented Apr 1, 2021

dswarbrick commented Apr 9, 2021 •

edited

Loading

stale bot commented Jun 9, 2021

stale bot commented Jun 23, 2021

wiardvanrij commented Jul 19, 2021

GiedriusS commented Jul 19, 2021 •

edited

Loading

rudo-thomas commented Aug 5, 2021

ianwoolf commented Oct 10, 2021 •

edited

Loading

stale bot commented Jan 9, 2022

stale bot commented Apr 17, 2022

stale bot commented May 1, 2022

Venture200 commented Nov 26, 2022

rudo-thomas commented Nov 28, 2022

Venture200 commented Nov 28, 2022

rudo-thomas commented Nov 28, 2022

Venture200 commented Nov 28, 2022 •

edited

Loading

Compactor: Does not exit on error #3966

Compactor: Does not exit on error #3966

Comments

wiardvanrij commented Mar 24, 2021

GiedriusS commented Mar 24, 2021

wiardvanrij commented Mar 24, 2021

bwplotka commented Mar 24, 2021

wiardvanrij commented Mar 24, 2021

GiedriusS commented Apr 1, 2021

dswarbrick commented Apr 9, 2021 • edited Loading

stale bot commented Jun 9, 2021

stale bot commented Jun 23, 2021

wiardvanrij commented Jul 19, 2021

GiedriusS commented Jul 19, 2021 • edited Loading

rudo-thomas commented Aug 5, 2021

ianwoolf commented Oct 10, 2021 • edited Loading

stale bot commented Jan 9, 2022

stale bot commented Apr 17, 2022

stale bot commented May 1, 2022

Venture200 commented Nov 26, 2022

rudo-thomas commented Nov 28, 2022

Venture200 commented Nov 28, 2022

rudo-thomas commented Nov 28, 2022

Venture200 commented Nov 28, 2022 • edited Loading

dswarbrick commented Apr 9, 2021 •

edited

Loading

GiedriusS commented Jul 19, 2021 •

edited

Loading

ianwoolf commented Oct 10, 2021 •

edited

Loading

Venture200 commented Nov 28, 2022 •

edited

Loading