-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in KgoVerifierWithSiTestLargeSegments.test_si_without_timeboxed
#5753
Comments
Checked if this was related to #5613 and no: this one is a storage assertion, that one is a segfault. |
If I had to guess, my suspicion would be that this is something in the SI cache changes, tearing down readers uncleanly on eviction perhaps. |
there are a couple of issues I can (17gb log file so still analysing)
here the node has election leaderhip for partition |
on further review, bad allocs are throughout the log during segment uploads. |
Regarding the error string in the OP (newlines mine):
Am I reading this right that two nodes failed, one with the |
Yes. I didn't look closely at the second one because we're in the "all bets are off" mode of bad_alloc exception paths: I've seen that kind of future error before when we're not cleanly shutting down things like input streams. |
Probably reocurrence FAIL test: FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600 (1/1 runs) |
seastar's abort on bad alloc showed a very large number of 128K entries in its report (this core had 7GB of memory):
Changing the write_behind value here
default_writebehind and setting that value to 1 instead of 10, so 1 instead of 128k, brings the cpu and mem usage way down, it represents the number of buffers to write in parallel.
I did not see bad allocs with that change but saw a (perhaps 503 related) error for which I need to raise another ticket - this is bad log lines not a crash:
Still testing with more runs on cdt. |
I made some other changes to the build (reduce size of chunk encoder for http etc) which are probably not related to this issue. Need to also revert those changes and do a clean test on CDT. |
tested with just the write_behind count fixed, and still got a bad alloc. the write behind count is 10 and it uses a buffer size of 128KB. will try with a smaller count like 1 and same buffer size.
|
Not seeing any memory/allocation related errors on write behind (parallel writes) set to 1. Memory usage is stable throughout the test at around 65-70% out of 32 GB on the node. However with this set of values the test run is really slow and even after running for hours the sequential consumer was not able to consume the full 20k messages (although it made constant, very slow progress). Need to investigate the slowness. It looks like write behind value for the cache service needs to be configurable as @Lazin suggested. |
Excellent investigation @abhijat |
Is this issue something that needs to be resolved for the release? or is this strictly for scale testing? |
I think we should fix the writebehind value for the cache service from 131072 to the original intended value of 10 before release. I will raise a PR for it soon. Making it part of adjustable config can be done in a future PR perhaps. I ran the test overnight and it failed because the random consumer ran into timeout after consuming around 15k of the 20k expected messages. The broker logs did not show any bad allocs, only some errors related to 503 from amazon which were recovered, no crashes. Will try once again with a consumer timeout at a very high value to check if it finishes. |
Tried several more tests with varying parameters, any value >= 5 for this parameter
With increased cache size (around 500MB) the test passes very quickly. Still have not been able to capture a flame graph of memory usage from the cdt node, gdb complains about missing thread information/mismatch of libthread-db. Trying to get the redpanda process to dump core on abort currently, as well as attaching a local gdb to gdbserver running on remote, as the flamegraph script works locally and the libpthread on remote is stripped (which apparently can cause gdb not to be able to work with multithreaded applications). |
Maybe this is caused by the index encoder. SI index files are compressed
and are below 1KiB most of the time but the encoder/decoder are both using
iobuf to store data under the hood. And it looks like the max allocation
size is 128KiB
https://github.com/redpanda-data/redpanda/blob/9949ee880eeb5814ad01fd667d1269303d82ccc6/src/v/bytes/iobuf.h#L217
so
maybe, when the segment index is materialized from disk it's somehow
allocates 128K chunk.
…On Sun, Aug 7, 2022, 07:11 Abhijat Malviya ***@***.***> wrote:
Tried several more tests with varying parameters, any value >= 5 for this
parameter default_writebehind results in bad allocs in my testing, but
there are a few things not clear yet:
- the writebehind value bug is very old, at least from 2021, so it
should have consistently affected the test but it did not until recently,
why this new set of failures?
- the segment size in the failing test is 100*2**20 bytes and the
cloud storage cache size is 5*2**20 bytes. this seems to result in the
cache continuously evicting segments as the random consumer triggers
segment downloads. Is the cache size to segment size ratio realistic? What
is a good cache size for testing? Assume it would be some multiple of
segment size. Even so, why did this test pass earlier even when the cache
size is unrealistically small? pandaresults shows a few failures in last 4
months but a majority passes which hints to some recent regression.
- The 128 KiB cache service write buffer doesn't seem to cause the
bulk of the 51k objects when the shard runs out of memory, I switched it to
256k and when the shard aborted on memory exhaustion the 51k objects were
still of 128k size and not 256k.
With increased cache size (around 500MB) the test passes very quickly.
Still have not been able to capture a flame graph of memory usage from the
cdt node, gdb complains about missing thread information/mismatch of
libthread-db.
Trying to get the redpanda process to dump core on abort currently, as
well as attaching a local gdb to gdbserver running on remote, as the
flamegraph script works locally and the libpthread on remote is stripped
(which apparently can cause gdb not to be able to work with multithreaded
applications).
—
Reply to this email directly, view it on GitHub
<#5753 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAWMNTAENVLFIH6DTPNUDLVX5APPANCNFSM55GJIGNA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Maybe it is possible that we are not able to decrement the cache size here https://github.com/redpanda-data/redpanda/blob/dev/src/v/cloud_storage/cache_service.cc#L161 due to shadowing, the cache size variable which is updated is a local variable? It seems to be the only place where cache size is decremented. So possibly we are always trying to clean up? But if this variable is not shadowing the instance field then this should be okay. |
Ran a build on CDT nodes which replaces new per-cache-put eviction with the old 30 second based eviction logic and it consistently passes the test within 4 minutes. The cache size was set to 5MB with segment size at 100MB for this test. Also saw the cache grow to much more than allowed limit during several times in the test :
So it seems with the periodic cleanup the consumers in test were able to read during the windows because we were growing the cache to much larger than 5 MB. With the more strict eviction we have now this no longer happens causing the test to fail. It causes a cycle of download -> cache::put -> evict the only segment in cache. This still doesn't explain what causes the constantly increasing memory during the cache churn. We should adjust the test to the new eviction system and continue to investigate memory usage during the pathological case (small cache + larger segment size). Additionally we can add validation to avoid cache size smaller than segment size #5896 |
Probably won't help at this point, but here's another recent occurrence: https://buildkite.com/redpanda/vtools/builds/3133#01827642-8678-41b8-a243-c532aa5f80ce. |
I can reproduce this locally by setting a small cache size eg 5 MB and simply running kgo-verifier like this:
I suspected that the number of read requests was piling up as more and more requests came in so I added a counter in
When running kgo-verifier, the read requests keep steadily increasing, until at one point we start running out of FDs
this doesnt show up on CDT though so the ulimit was low (1024), but it does show that we opened up a lot of files.
If enabling queue depth control via |
With a number of parallel reads set to 8, with 50 reads and qdc enabled - redpanda is able to finish the requests within a short period of time even with a small cache like 5 MB. But for something like 200 reads / 8 parallel it becomes really slow and takes nearly 10 minutes to finish. It does however finish without running into bad allocs with qdc enabled. |
FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600
KgoVerifierWithSiTest.test_si_without_timeboxed.segment_size=104857600
KgoVerifierWithSiTest.test_si_without_timeboxed.segment_size=104857600
KgoVerifierWithSiTestLargeSegments.test_si_without_timeboxed
This hasn't reoccurred in the last 30 days. We have tracking elsewhere for limiting reader concurrency:
So I think we can safely drop this ticket. |
https://buildkite.com/redpanda/vtools/builds/3073#018252da-dd82-4dac-bad9-ea37369ded6e
The text was updated successfully, but these errors were encountered: