Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abnormal metrics reported by tiflash-proxy #6347

Closed
JaySon-Huang opened this issue Nov 22, 2022 · 4 comments · Fixed by pingcap/tidb-engine-ext#224 or #6380
Closed

Abnormal metrics reported by tiflash-proxy #6347

JaySon-Huang opened this issue Nov 22, 2022 · 4 comments · Fixed by pingcap/tidb-engine-ext#224 or #6380
Assignees

Comments

@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Nov 22, 2022

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Deploy a v6.4.0 cluster and run high pressure update workloads.

2. What did you expect to see? (Required)

3. What did you see instead (Required)

The CPU and memory reported by tiflash-proxy is abnormal. However, the metrics reported by tiflash is OK
image
image

4. What is your TiFlash version? (Required)

v6.4.0

@JaySon-Huang JaySon-Huang added the type/bug The issue is confirmed as a bug. label Nov 22, 2022
@hehechen
Copy link
Contributor

hehechen commented Nov 23, 2022

This is because the profile takes more time than expected, causing the status_server thread to be stuck and unable to process the http request from prometheus.
For example: A 10s profile takes more than a minute
image

The reason why profile took longer than expected is still under investigation.

@hehechen
Copy link
Contributor

It seems that the overhead of dwarf is larger than that of frame pointer. I tried to revert this PR and it will be fine.

@hehechen
Copy link
Contributor

hehechen commented Nov 29, 2022

I add some logs in https://github.com/rust-lang/backtrace-rs/blob/a59e64f50c571d9675616321b22c2953d70d1a8b/src/symbolize/gimli.rs#L338, and found out that the number of dwarf symbols is more than frame pointer.
dwarf:
image
frame pointer:
image
So the overhead of dwarf resolving symbol is larger than that of frame pointer.

@hehechen
Copy link
Contributor

hehechen commented Dec 1, 2022

The root cause is that the backtrace of DWARF is deeper than that of frame pointer. So DWARF need to resolve more shared libraries when resolving symbols. But the capacity of lib cache in backtrace-rs is only 4, so cache miss will occurs frequently in DWARF scenarios.
After I increased the capacity of lib cache in backtrace-rs to 10, the 10 seconds profile took less than 11 seconds, and the metrics became normal.

one backtrace comparison:
frame pointer:
image
DWARF:
image

I added a log when inserting lib cache, and the log showed that:
Libraries need to resolve in DWARF scenarios :
image
Libraries need to resolve in frame pointer scenarios :
image

The flame graph of DWARF when resolving symbols:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants