-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage: Segment deleted while S3 upload in progress (Failure in FranzGoVerifiableWithSiTest.test_si_with_timeboxed
and test_si_without_timeboxed
)
#4624
Comments
Looking at the recent change, the main thing is the segment sizes changing from 5MB to 20MB or 100MB. The increased granularity could go some way to explaining the bad_alloc, but the file descriptor error is a bit surprising (bigger segments = fewer FDs). |
On the FD error (https://buildkite.com/redpanda/redpanda/builds/9898#38cf2489-dd4e-4457-86db-9d3d1d27f849), the specific is:
So this was happening during upload. I think it's happening because a segment is getting deleted out from under us.
It's possible this is showing up with larger segment sizes, because they take longer to upload, creating a longer time window for the delete to overlap with the upload. AIUI we're not meant to delete things until they're in S3, no matter the retention settings, so this could be a pretty serious bug. |
The FD issue just reproduced on clustered ducktape, so it's not related to the resource constraints in docker. This reproduced 1 time in 40 runs (a |
FranzGoVerifiableWithSiTest.test_si_with_timeboxed
(and FranzGoVerifiableWithSiTest.test_si_without_timeboxed
)FranzGoVerifiableWithSiTest.test_si_with_timeboxed
)
FranzGoVerifiableWithSiTest.test_si_with_timeboxed
)FranzGoVerifiableWithSiTest.test_si_with_timeboxed
and test_si_without_timeboxed
)
Yes, the changes were in response to a need to test varying SI params. |
Another instance, same symptoms (removal before upload) - https://buildkite.com/redpanda/redpanda/builds/9879#758a00a6-d312-4be5-b266-b12986129c3e |
Last night's occurrence https://buildkite.com/redpanda/redpanda/builds/10004#7237a2d6-f731-4c02-bd01-edda06032775 |
discussed with @Lazin and @ztlpn - thanks for pointing me in the right direction, this issue is more related to leadership transfer than a segment being deleted while being uploaded. Basically leadership is transferred while the leader is waiting for an http client to upload, and by the time it gets a handle to the http client it has lost leadership. Because the new leader uploads the segment, the archival metadata stm offset is forwarded and gc deletes the file from underneath the old leader's upload. The rough sequence of events (in one failure instance with two nodes rp-4 and rp-15)
The only suspicious part is that the gc happened a second after the old leader acquired the http client, so the old leader should have been able to finish the upload in that duration, but looking at the http client acquisition, during that time period many uploads were queued up, so it may be that the uploads were slow. Confirming currently with another failure instance if this understanding is consistent and if so it can be fixed with a leadership check before doing the upload after waiting for http client, we may have lost leadership while waiting for a client from the pool. The existing read lock added in the linked PR can be removed. |
In the other instance, it looks like the upload has already started when leadership transfers and the file is deleted, the leader does not wait on acquiring client.
might have to handle the error with a leadership check on upload failure, and ignore the error if the leadership was lost during upload. |
Failing 1/18 runs in last 24h.
Failures:
https://buildkite.com/redpanda/redpanda/builds/9898#38cf2489-dd4e-4457-86db-9d3d1d27f849
There is a "Bad file descriptor" error in the log of one of the redpanda nodes.
The text was updated successfully, but these errors were encountered: