-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster membership did not stabilize TopicRecreateTest.test_topic_recreation_while_producing.workload=ACKS_1.cleanup_policy=delete #5510
Comments
Hoping this can be addressed with a longer timeout--but I'm digging into the logs to see if there is any buggy behavior behind this. |
was it caused by a redpanda process crashing? |
edit: looking at first reproduction here. Some observations:
Question: Why is only one of the nodes reporting this, when test runner is GETing from all of the nodes' admin APIs? |
This codepath is We're waiting 30 seconds for all brokers to join cluster, which times out at I looked for client error messages in the test log--to see if the admin client was failing based on this missing superuser (?) |
In the first reproduction, we timed out waiting for Side Note: IMHO, this complex "wait until cluster is stable logic" that clients have to do before they can use a cluster should really go away--and clients should be able to submit requests iff the broker is in cluster quorum, al la #5076. Looking at membership update messages:
|
Fixed condition checking if range is locked. Incorrect check resulted in situations in which log truncation was blocked by pending readers. Log truncation is multi step process that ends with grabbing necessary segments write logs and data deletion. In order for truncation to grab the write lock all readers which own read lock units from range being truncated must be evicted from `readers_cache`. Since log truncation contains multiple scheduling points it may interleave with another fiber creating a reader for log that is currently being truncated. This reader MUST not be cached as truncation would need to wait for it to be evicted. Additionally no new readers can be created as truncation related write lock request is waiting in the `read_write_lock` underlying semaphore waiters queue. In order to prevent readers requesting truncated range from being cached readers cache maintain list of locked ranges i.e. ranges for which readers can not be cached. Previously an incorrect condition checking if reader belongs to locked range allow it to be cached preventing the `truncate` action to continue. This stopped all other writes and truncation for 60 seconds, after this duration the reader was evicted from the cache, its lease was released and trucation was able to finish. Fixed incorrect condition checking if reader is within the locked range. Fixes: redpanda-data#5510 Signed-off-by: Michal Maslanka <michal@redpanda.com>
Fixed condition checking if range is locked. Incorrect check resulted in situations in which log truncation was blocked by pending readers. Log truncation is multi step process that ends with grabbing necessary segments write logs and data deletion. In order for truncation to grab the write lock all readers which own read lock units from range being truncated must be evicted from `readers_cache`. Since log truncation contains multiple scheduling points it may interleave with another fiber creating a reader for log that is currently being truncated. This reader MUST not be cached as truncation would need to wait for it to be evicted. Additionally no new readers can be created as truncation related write lock request is waiting in the `read_write_lock` underlying semaphore waiters queue. In order to prevent readers requesting truncated range from being cached readers cache maintain list of locked ranges i.e. ranges for which readers can not be cached. Previously an incorrect condition checking if reader belongs to locked range allow it to be cached preventing the `truncate` action to continue. This stopped all other writes and truncation for 60 seconds, after this duration the reader was evicted from the cache, its lease was released and trucation was able to finish. Fixed incorrect condition checking if reader is within the locked range. Fixes: redpanda-data#5510 Signed-off-by: Michal Maslanka <michal@redpanda.com>
Fixed condition checking if range is locked. Incorrect check resulted in situations in which log truncation was blocked by pending readers. Log truncation is multi step process that ends with grabbing necessary segments write logs and data deletion. In order for truncation to grab the write lock all readers which own read lock units from range being truncated must be evicted from `readers_cache`. Since log truncation contains multiple scheduling points it may interleave with another fiber creating a reader for log that is currently being truncated. This reader MUST not be cached as truncation would need to wait for it to be evicted. Additionally no new readers can be created as truncation related write lock request is waiting in the `read_write_lock` underlying semaphore waiters queue. In order to prevent readers requesting truncated range from being cached readers cache maintain list of locked ranges i.e. ranges for which readers can not be cached. Previously an incorrect condition checking if reader belongs to locked range allow it to be cached preventing the `truncate` action to continue. This stopped all other writes and truncation for 60 seconds, after this duration the reader was evicted from the cache, its lease was released and trucation was able to finish. Fixed incorrect condition checking if reader is within the locked range. Fixes: redpanda-data#5510 Signed-off-by: Michal Maslanka <michal@redpanda.com>
https://buildkite.com/redpanda/redpanda/builds/12713#01821520-04fc-48c2-9deb-4ba6466ce2d9
The text was updated successfully, but these errors were encountered: