Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure "Node 6 decommissioning stopped making progress" in random_node_operations_test.RandomNodeOperationsTest.test_node_operations #9052

Closed
bharathv opened this issue Feb 22, 2023 · 32 comments · Fixed by #9414 or #11468
Assignees
Labels
area/replication ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.

Comments

@bharathv
Copy link
Contributor

https://buildkite.com/redpanda/redpanda/builds/23623#01867359-b3a6-4354-b2c7-7a018c31dd5a

Module: rptest.tests.random_node_operations_test
Class:  RandomNodeOperationsTest
Method: test_node_operations
Arguments:
{
  "enable_failures": true
}
test_id:    rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True
status:     FAIL
run time:   5 minutes 16.515 seconds


    AssertionError('Node 6 decommissioning stopped making progress')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/random_node_operations_test.py", line 118, in test_node_operations
    executor.execute_operation(op)
  File "/root/tests/rptest/utils/node_operations.py", line 395, in execute_operation
    self.wait_for_removed(node_id)
  File "/root/tests/rptest/utils/node_operations.py", line 258, in wait_for_removed
    waiter.wait_for_removal()
  File "/root/tests/rptest/utils/node_operations.py", line 176, in wait_for_removal
    assert self._made_progress(
AssertionError: Node 6 decommissioning stopped making progress
@bharathv
Copy link
Contributor Author

bharathv commented Mar 1, 2023

Seems like a legit decommission hang.. its stuck with non empty allocator loop (allocator_empty: false) and no subsequent re-allocations.

8960225:INFO  2023-02-21 10:12:20,994 [shard 0] cluster - members_backend.cc:854 - [update: {{node_id: 6, type: decommissioned, offset: 1216}}] decommissioning in progress. draining: true all_reallocations_finished: true, allocator_empty: false updates_in_progress:false
8960226:INFO  2023-02-21 10:12:20,994 [shard 0] cluster - members_backend.cc:861 - [update: {{node_id: 6, type: decommissioned, offset: 1216}}] decommissioning in progress. recalculating reallocations
8960227:TRACE 2023-02-21 10:12:20,994 [shard 0] cluster - members_backend.cc:175 - reconciliation loop result: Partition configuration update in progress

Not obvious to me what that stuck allocation is.

@bharathv bharathv added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Mar 1, 2023
@bharathv
Copy link
Contributor Author

bharathv commented Mar 1, 2023

sev/medium: legit decommission hang (UX problem).

(workaround: restart the partition leader broker)

@mmaslankaprv
Copy link
Member

I've looked into the details on this problem, nothing obvious blocking allocator

@jcsp jcsp added the kind/bug Something isn't working label Mar 2, 2023
@bharathv bharathv self-assigned this Mar 7, 2023
@dlex
Copy link
Contributor

dlex commented Mar 10, 2023

@dlex
Copy link
Contributor

dlex commented Mar 10, 2023

@piyushredpanda
Copy link
Contributor

@bharathv ^

@bharathv
Copy link
Contributor Author

bharathv commented Mar 13, 2023

@bharathv ^

Doubt it..things are genuinely stuck (based on logging) with a pending allocation unit .. I think the issue is the following.. It seems like a race condition where add/decomm/topic delete all race at the same time. Consider the following series of commands that run very close to each other.

  1. Add node 0 - Computes a bunch of re-configurations moving into node 0.
  2. Decomm node 0 - Interrupts add and issues a cancel_update of all the ^ re-configurations.
  3. Delete topic foo - Interrupts in_progress 'cancel_update' partitions

With cancel_update we swap the current assignment to reflect the state of the original move but we never swap the replicas of in_progress operation so I think the delete deallocation logic is tripping and deallocating fewer units than it should.

@mmaslankaprv ^^

@BenPope
Copy link
Member

BenPope commented Mar 15, 2023

And on v22.3.14-rc3 CDT: https://buildkite.com/redpanda/vtools/builds/6711#0186e2a3-fb06-4a7a-8288-539d4b760777

Module: rptest.scale_tests.node_operations_fuzzy_test
Class:  NodeOperationFuzzyTest
Method: test_node_operations
Arguments:
{
  "compacted_topics": false,
  "enable_failures": false,
  "num_to_upgrade": 3
}

@dotnwat
Copy link
Member

dotnwat commented Apr 18, 2023

@dotnwat dotnwat reopened this Apr 18, 2023
@dotnwat
Copy link
Member

dotnwat commented Apr 19, 2023

@VladLazar
Copy link
Contributor

FAIL test: RandomNodeOperationsTest.test_node_operations.enable_failures=True.num_to_upgrade=0.enable_controller_snapshots=True (1/67 runs)
  failure at 2023-05-21T07:14:52.104Z: AssertionError('Node 1 decommissioning stopped making progress')
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29542#01883cf8-d944-4feb-9473-eb3ec8c11a04

@vshtokman
Copy link
Contributor

@andijcr
Copy link
Contributor

andijcr commented Jun 7, 2023

@rockwotj
Copy link
Contributor

Let me dig in more in the morning, the linked logs look working as intended, lookup the pid then kill it, and exit when there is no more matching pid

[DEBUG - 2023-06-14 17:21:42,581 - remoteaccount - _log - lineno:166]: root@docker-rp-19: Running ssh command: pgrep --exact redpanda
[DEBUG - 2023-06-14 17:21:42,586 - remoteaccount - _log - lineno:166]: root@docker-rp-19: Running ssh command: kill -9 5386
[DEBUG - 2023-06-14 17:21:42,634 - remoteaccount - _log - lineno:166]: root@docker-rp-19: Running ssh command: pgrep --exact redpanda
[DEBUG - 2023-06-14 17:21:42,785 - remoteaccount - _log - lineno:166]: root@docker-rp-19: Running ssh command: pgrep --exact redpanda
[DEBUG - 2023-06-14 17:21:42,890 - remoteaccount - _log - lineno:166]: root@docker-rp-19: Running ssh command: pgrep --exact redpanda
[DEBUG - 2023-06-14 17:21:42,896 - remoteaccount - _log - lineno:166]: root@docker-rp-19: Running ssh command 'pgrep --exact redpanda' exited with status 1 and message: b''

@ztlpn
Copy link
Contributor

ztlpn commented Jun 16, 2023

@rockwotj The problem is not that redpanda is not being killed, but that it is not started again. There is a concurrent thread that sometimes runs redpanda --version command and the FailureInjector._start method gets confused that this is a real redpanda process and skips issuing the start command. This issue is one of the reasons why the previous pid detection command was so ugly :)

@piyushredpanda no, we shouldn't revert, because that PR fixes another issue with redpanda process detection, we should think of a better way.

@andijcr
Copy link
Contributor

andijcr commented Jun 16, 2023

There is a concurrent thread that sometimes runs redpanda --version command

One ugly alternative could be to rename this redpanda process to escape pgrep. a softlink works too
tested locally:
ln -s /usr/bin/redpanda Redpanda
./Repanda --version
gives the expected output and is not reported by pgrep --exact redpanda

@rockwotj
Copy link
Contributor

We can also lookup the full path and skip if --version is in the full command.

Why is a concurrent thread checking for versions? Can we not do that?

@rockwotj
Copy link
Contributor

I opened a PR to exclude --version commands from pid detection

@ztlpn
Copy link
Contributor

ztlpn commented Jun 16, 2023

Why is a concurrent thread checking for versions? Can we not do that?

Yeah I wish I've known the exact sequence... I think the test checks the version to write the config file before starting redpanda, but in this case starting is skipped, so that's a bit strange. I guess we'll need better debug logging if this resurfaces.

Here is the old issue for the reference #8753

@dotnwat
Copy link
Member

dotnwat commented Jun 27, 2023

@dotnwat dotnwat reopened this Jun 27, 2023
@bharathv bharathv assigned bharathv and unassigned ztlpn Jun 29, 2023
@bharathv
Copy link
Contributor Author

$ egrep "partition_balancer|redpanda/controller/0}] vote_stm.cc" RedpandaService-0-139865974157136/docker-rp-19/redpanda.log

DEBUG 2023-06-26 17:35:50,981 [shard 0] cluster - partition_balancer_backend.cc:159 - Tick scheduled to run in: 30000ms
INFO  2023-06-26 17:35:50,981 [shard 0] cluster - partition_balancer_backend.cc:99 - partition balancer started
DEBUG 2023-06-26 17:36:20,980 [shard 0] cluster - partition_balancer_backend.cc:268 - not leader, skipping tick
DEBUG 2023-06-26 17:36:20,980 [shard 0] cluster - partition_balancer_backend.cc:159 - Tick scheduled to run in: 30000ms
DEBUG 2023-06-26 17:36:50,981 [shard 0] cluster - partition_balancer_backend.cc:268 - not leader, skipping tick
DEBUG 2023-06-26 17:36:50,981 [shard 0] cluster - partition_balancer_backend.cc:159 - Tick scheduled to run in: 30000ms
TRACE 2023-06-26 17:36:52,300 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:310 - Voting for self in term 2
INFO  2023-06-26 17:36:52,310 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {4}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:36:52,310 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {0}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:36:52,310 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {3}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:36:52,311 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:77 - vote reply from {id: {3}, revision: {0}} - {term:{2}, target_node_id{id: {1}, revision: {0}}, vote_granted: 1, log_ok:1}
INFO  2023-06-26 17:36:52,317 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:77 - vote reply from {id: {4}, revision: {0}} - {term:{2}, target_node_id{id: {1}, revision: {0}}, vote_granted: 1, log_ok:1}
TRACE 2023-06-26 17:36:52,317 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:256 - vote acks in term 2 from: {{id: {1}, revision: {0}}, {id: {3}, revision: {0}}, {id: {4}, revision: {0}}}
INFO  2023-06-26 17:36:52,317 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:265 - becoming the leader term:2
INFO  2023-06-26 17:36:52,418 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:280 - became the leader term: 2
TRACE 2023-06-26 17:37:19,755 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:310 - Voting for self in term 3
INFO  2023-06-26 17:37:19,765 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {4}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:37:19,765 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {3}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:37:19,765 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {0}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:37:19,765 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:77 - vote reply from {id: {3}, revision: {0}} - {term:{2}, target_node_id{id: {1}, revision: {0}}, vote_granted: 0, log_ok:1}
INFO  2023-06-26 17:37:19,769 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:77 - vote reply from {id: {4}, revision: {0}} - {term:{3}, target_node_id{id: {1}, revision: {0}}, vote_granted: 1, log_ok:1}
INFO  2023-06-26 17:37:19,769 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:234 - Vote failed
DEBUG 2023-06-26 17:37:20,981 [shard 0] cluster - partition_balancer_backend.cc:268 - not leader, skipping tick
DEBUG 2023-06-26 17:37:20,981 [shard 0] cluster - partition_balancer_backend.cc:159 - Tick scheduled to run in: 30000ms
TRACE 2023-06-26 17:37:21,687 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:310 - Voting for self in term 4
INFO  2023-06-26 17:37:21,698 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {4}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:37:21,698 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {0}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:37:21,698 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {3}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:37:21,705 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:77 - vote reply from {id: {3}, revision: {0}} - {term:{4}, target_node_id{id: {1}, revision: {0}}, vote_granted: 1, log_ok:1}
INFO  2023-06-26 17:37:21,708 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:77 - vote reply from {id: {4}, revision: {0}} - {term:{4}, target_node_id{id: {1}, revision: {0}}, vote_granted: 1, log_ok:1}
TRACE 2023-06-26 17:37:21,708 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:256 - vote acks in term 4 from: {{id: {1}, revision: {0}}, {id: {3}, revision: {0}}, {id: {4}, revision: {0}}}
INFO  2023-06-26 17:37:21,708 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:265 - becoming the leader term:4
INFO  2023-06-26 17:37:21,797 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:280 - became the leader term: 4
DEBUG 2023-06-26 17:37:50,981 [shard 0] cluster - partition_balancer_backend.cc:268 - not leader, skipping tick
DEBUG 2023-06-26 17:37:50,981 [shard 0] cluster - partition_balancer_backend.cc:159 - Tick scheduled to run in: 30000ms
TRACE 2023-06-26 17:37:52,853 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:310 - Voting for self in term 5
INFO  2023-06-26 17:37:52,855 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {3}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:37:52,855 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {4}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:37:52,855 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:52 - Sending vote request to {id: {0}, revision: {0}} with timeout 1500
INFO  2023-06-26 17:37:52,866 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:77 - vote reply from {id: {3}, revision: {0}} - {term:{5}, target_node_id{id: {1}, revision: {0}}, vote_granted: 1, log_ok:1}
INFO  2023-06-26 17:37:52,866 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:77 - vote reply from {id: {4}, revision: {0}} - {term:{5}, target_node_id{id: {1}, revision: {0}}, vote_granted: 1, log_ok:1}
TRACE 2023-06-26 17:37:52,866 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:256 - vote acks in term 5 from: {{id: {4}, revision: {0}}, {id: {1}, revision: {0}}, {id: {3}, revision: {0}}}
INFO  2023-06-26 17:37:52,866 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:265 - becoming the leader term:5
INFO  2023-06-26 17:37:53,049 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:280 - became the leader term: 5
INFO  2023-06-26 17:38:16,089 [shard 0] cluster - partition_balancer_backend.cc:253 - stopping...

@jcsp
Copy link
Contributor

jcsp commented Jul 5, 2023

@abhijat
Copy link
Contributor

abhijat commented Jul 28, 2023

@rystsov
Copy link
Contributor

rystsov commented Sep 1, 2023

According to pandatriage the issue is stale, please reopen it only if pandatriage mark it as should reopen otherwise create a new issue.

| failure id | issue id | freq | total | last occ.  | test                                                >
| ---------- | -------- | ---- | ----- | ---------- | ---------------------------------------------------->
| 359        | 11736    | 166  | 5     | 2023-06-27 | TopicDeleteCloudStorageTest.topic_delete_cloud_stora>
| 516        | 11665    | 245  | 2     | 2023-06-26 | TopicDeleteStressTest.stress_test                   >
| 473        | 9052     | 174  | 5     | 2023-06-15 | RandomNodeOperationsTest.test_node_operations       >
| 484        | 9052     | 208  | 6     | 2023-06-14 | RandomNodeOperationsTest.test_node_operations       >
| 478        | 9052     | 69   | 2     | 2023-06-12 | RandomNodeOperationsTest.test_node_operations       

@rystsov rystsov closed this as completed Sep 1, 2023
@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich vbotbuildovich reopened this Jan 9, 2024
@michael-redpanda
Copy link
Contributor

The above linked builds are from September. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/replication ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.
Projects
None yet