Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster membership did not stabilize TopicRecreateTest.test_topic_recreation_while_producing.workload=ACKS_1.cleanup_policy=delete #5510

Closed
ZeDRoman opened this issue Jul 19, 2022 · 5 comments · Fixed by #5681
Assignees

Comments

@ZeDRoman
Copy link
Contributor

https://buildkite.com/redpanda/redpanda/builds/12713#01821520-04fc-48c2-9deb-4ba6466ce2d9

FAIL test: TopicRecreateTest.test_topic_recreation_while_producing.workload=ACKS_1.cleanup_policy=delete (1/27 runs)
  failure at 2022-07-19T08:21:10.289Z: TimeoutError('Cluster membership did not stabilize')
      in job https://buildkite.com/redpanda/redpanda/builds/12713#01821520-04fc-48c2-9deb-4ba6466ce2d9
@ajfabbri
Copy link
Contributor

Hoping this can be addressed with a longer timeout--but I'm digging into the logs to see if there is any buggy behavior behind this.

@dotnwat
Copy link
Member

dotnwat commented Jul 22, 2022

was it caused by a redpanda process crashing?

@ajfabbri
Copy link
Contributor

ajfabbri commented Jul 22, 2022

edit: looking at first reproduction here. Some observations:

  • Don't see any backtraces
  • No ERROR log messages at all.
  • docker-rp-16 saw a lot of these WARN messages:
488:WARN  2022-07-19 07:33:52,124 [shard 0] admin_api_server - admin_server.cc:351 - Insecure Admin API listener on 172.16.16.8:9647, consider enabling `admin_api_require_auth`
667:WARN  2022-07-19 07:33:57,519 [shard 0] request_auth - request_auth.cc:107 - Client auth failure: user '{admin}' not found                                                                                                                                                  
670:WARN  2022-07-19 07:33:57,626 [shard 0] request_auth - request_auth.cc:107 - Client auth failure: user '{admin}' not found                                                                                                                                                  
690:WARN  2022-07-19 07:33:59,017 [shard 0] request_auth - request_auth.cc:107 - Client auth failure: user '{admin}' not found                                                                                                                                                  
693:WARN  2022-07-19 07:33:59,331 [shard 0] request_auth - request_auth.cc:107 - Client auth failure: user '{admin}' not found                                                                                                                                                  
708:WARN  2022-07-19 07:33:59,642 [shard 0] request_auth - request_auth.cc:107 - Client auth failure: user '{admin}' not found                                                                                                                                                  
723:WARN  2022-07-19 07:34:01,040 [shard 0] request_auth - request_auth.cc:107 - Client auth failure: user '{admin}' not found                                                                                                                                                  
731:WARN  2022-07-19 07:34:01,353 [shard 0] request_auth - request_auth.cc:107 - Client auth failure: user '{admin}' not found  

Question: Why is only one of the nodes reporting this, when test runner is GETing from all of the nodes' admin APIs?

@ajfabbri
Copy link
Contributor

ajfabbri commented Jul 22, 2022

This codepath is services/redpanda.py .. start(), here.

We're waiting 30 seconds for all brokers to join cluster, which times out at 2022-07-19 07:34:29,437.

I looked for client error messages in the test log--to see if the admin client was failing based on this missing superuser (?) {admin} admin .. and did not see any. Edit: This makes sense since redpanda.admin_require_auth is 0.

@ajfabbri
Copy link
Contributor

In the first reproduction, we timed out waiting for docker-rp-20 to appear in docker-rp-16's broker list.

Side Note: IMHO, this complex "wait until cluster is stable logic" that clients have to do before they can use a cluster should really go away--and clients should be able to submit requests iff the broker is in cluster quorum, al la #5076.

Looking at membership update messages:

112/RedpandaService-0-140152690088640/docker-rp-20/redpanda.log
569:DEBUG 2022-07-19 07:33:55,561 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 1, type: added}
624:DEBUG 2022-07-19 07:33:55,580 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 2, type: added}
631:DEBUG 2022-07-19 07:33:55,583 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 1, type: reallocation_finished}
634:DEBUG 2022-07-19 07:33:55,588 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 3, type: added}
647:DEBUG 2022-07-19 07:33:55,597 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 4, type: added}
662:DEBUG 2022-07-19 07:33:55,619 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 2, type: reallocation_finished}
673:DEBUG 2022-07-19 07:33:55,634 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 5, type: added}
723:DEBUG 2022-07-19 07:33:57,453 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 3, type: reallocation_finished}
815:DEBUG 2022-07-19 07:34:02,648 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 4, type: reallocation_finished}
913:DEBUG 2022-07-19 07:34:07,760 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 5, type: reallocation_finished}

112/RedpandaService-0-140152690088640/docker-rp-16/redpanda.log
553:DEBUG 2022-07-19 07:33:52,334 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 1, type: added}
613:DEBUG 2022-07-19 07:33:52,356 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 2, type: added}
623:DEBUG 2022-07-19 07:33:52,364 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 1, type: reallocation_finished}
626:DEBUG 2022-07-19 07:33:52,373 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 3, type: added}
641:DEBUG 2022-07-19 07:33:52,389 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 4, type: added}

112/RedpandaService-0-140152690088640/docker-rp-11/redpanda.log
542:DEBUG 2022-07-19 07:33:45,816 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 1, type: added}
600:DEBUG 2022-07-19 07:33:45,835 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 2, type: added}
653:DEBUG 2022-07-19 07:33:47,569 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 1, type: reallocation_finished}
688:DEBUG 2022-07-19 07:33:49,103 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 3, type: added}
754:DEBUG 2022-07-19 07:33:52,321 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 4, type: added}
787:DEBUG 2022-07-19 07:33:52,429 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 2, type: reallocation_finished}
870:DEBUG 2022-07-19 07:33:55,559 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 5, type: added}
918:DEBUG 2022-07-19 07:33:57,448 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 3, type: reallocation_finished}
1041:DEBUG 2022-07-19 07:34:02,647 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 4, type: reallocation_finished}
1139:DEBUG 2022-07-19 07:34:07,760 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 5, type: reallocation_finished}

112/RedpandaService-0-140152690088640/docker-rp-12/redpanda.log
550:DEBUG 2022-07-19 07:33:49,101 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 1, type: added}
608:DEBUG 2022-07-19 07:33:49,119 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 2, type: added}
613:DEBUG 2022-07-19 07:33:49,124 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 1, type: reallocation_finished}
618:DEBUG 2022-07-19 07:33:49,128 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 3, type: added}
694:DEBUG 2022-07-19 07:33:52,323 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 4, type: added}
723:DEBUG 2022-07-19 07:33:52,432 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 2, type: reallocation_finished}
808:DEBUG 2022-07-19 07:33:55,558 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 5, type: added}
857:DEBUG 2022-07-19 07:33:57,450 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 3, type: reallocation_finished}
951:DEBUG 2022-07-19 07:34:02,647 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 4, type: reallocation_finished}
1035:DEBUG 2022-07-19 07:34:07,760 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 5, type: reallocation_finished}

112/RedpandaService-0-140152690088640/docker-rp-19/redpanda.log
415:DEBUG 2022-07-19 07:33:42,299 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 1, type: added}
616:DEBUG 2022-07-19 07:33:45,581 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 2, type: added}
823:DEBUG 2022-07-19 07:33:47,316 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 1, type: reallocation_finished}
921:DEBUG 2022-07-19 07:33:48,908 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 3, type: added}
1324:DEBUG 2022-07-19 07:33:52,201 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 4, type: added}
1503:DEBUG 2022-07-19 07:33:52,396 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 2, type: reallocation_finished}
1947:DEBUG 2022-07-19 07:33:55,469 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 5, type: added}
2230:DEBUG 2022-07-19 07:33:57,437 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 3, type: reallocation_finished}
2825:DEBUG 2022-07-19 07:34:02,463 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 4, type: reallocation_finished}
3326:DEBUG 2022-07-19 07:34:07,490 [shard 0] cluster - members_backend.cc:118 - membership update received: {node_id: 5, type: reallocation_finished}

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 27, 2022
Fixed condition checking if range is locked. Incorrect check resulted in
situations in which log truncation was blocked by pending readers.

Log truncation is multi step process that ends with grabbing necessary
segments write logs and data deletion. In order for truncation to grab
the write lock all readers which own read lock units from range being
truncated must be evicted from `readers_cache`. Since log truncation
contains multiple scheduling points it may interleave with another fiber
creating a reader for log that is currently being truncated. This reader
MUST not be cached as truncation would need to wait for it to be
evicted. Additionally no new readers can be created as truncation
related write lock request is waiting in the `read_write_lock`
underlying semaphore waiters queue.

In order to prevent readers requesting truncated range from being cached
readers cache maintain list of locked ranges i.e. ranges for which
readers can not be cached.

Previously an incorrect condition checking if reader belongs to locked
range allow it to be cached preventing the `truncate` action to
continue. This stopped all other writes and truncation for 60 seconds,
after this duration the reader was evicted from the cache, its lease was
released and trucation was able to finish.

Fixed incorrect condition checking if reader is within the locked
range.

Fixes: redpanda-data#5510

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 28, 2022
Fixed condition checking if range is locked. Incorrect check resulted in
situations in which log truncation was blocked by pending readers.

Log truncation is multi step process that ends with grabbing necessary
segments write logs and data deletion. In order for truncation to grab
the write lock all readers which own read lock units from range being
truncated must be evicted from `readers_cache`. Since log truncation
contains multiple scheduling points it may interleave with another fiber
creating a reader for log that is currently being truncated. This reader
MUST not be cached as truncation would need to wait for it to be
evicted. Additionally no new readers can be created as truncation
related write lock request is waiting in the `read_write_lock`
underlying semaphore waiters queue.

In order to prevent readers requesting truncated range from being cached
readers cache maintain list of locked ranges i.e. ranges for which
readers can not be cached.

Previously an incorrect condition checking if reader belongs to locked
range allow it to be cached preventing the `truncate` action to
continue. This stopped all other writes and truncation for 60 seconds,
after this duration the reader was evicted from the cache, its lease was
released and trucation was able to finish.

Fixed incorrect condition checking if reader is within the locked
range.

Fixes: redpanda-data#5510

Signed-off-by: Michal Maslanka <michal@redpanda.com>
andrwng pushed a commit to andrwng/redpanda that referenced this issue Aug 5, 2022
Fixed condition checking if range is locked. Incorrect check resulted in
situations in which log truncation was blocked by pending readers.

Log truncation is multi step process that ends with grabbing necessary
segments write logs and data deletion. In order for truncation to grab
the write lock all readers which own read lock units from range being
truncated must be evicted from `readers_cache`. Since log truncation
contains multiple scheduling points it may interleave with another fiber
creating a reader for log that is currently being truncated. This reader
MUST not be cached as truncation would need to wait for it to be
evicted. Additionally no new readers can be created as truncation
related write lock request is waiting in the `read_write_lock`
underlying semaphore waiters queue.

In order to prevent readers requesting truncated range from being cached
readers cache maintain list of locked ranges i.e. ranges for which
readers can not be cached.

Previously an incorrect condition checking if reader belongs to locked
range allow it to be cached preventing the `truncate` action to
continue. This stopped all other writes and truncation for 60 seconds,
after this duration the reader was evicted from the cache, its lease was
released and trucation was able to finish.

Fixed incorrect condition checking if reader is within the locked
range.

Fixes: redpanda-data#5510

Signed-off-by: Michal Maslanka <michal@redpanda.com>
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants