redpanda: node decommission test failure #2142

dotnwat · 2021-08-24T02:19:48Z

https://buildkite.com/vectorized/redpanda/builds/1723

--------------------------------------------------------------------------------
test_id:    rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node
status:     FAIL
run time:   1 minute 37.948 seconds
 
    TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic7', partition=0): 1041, TopicPartition(topic='topic7', partition=3): 7586, TopicPartition(topic='topic7', partition=1): 891, TopicPartition(topic='topic7', partition=2): 5998} after waiting 45s.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 67, in test_decommissioning_working_node
    self.run_validation(enable_idempotence=False, consumer_timeout_sec=45)
  File "/root/tests/rptest/tests/end_to_end.py", line 139, in run_validation
    self.await_consumed_offsets(self.producer.last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 105, in await_consumed_offsets
    wait_until(has_finished_consuming,
  File "/usr/local/lib/python3.8/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='topic7', partition=0): 1041, TopicPartition(topic='topic7', partition=3): 7586, TopicPartition(topic='topic7', partition=1): 891, TopicPartition(topic='topic7', partition=2): 5998} after waiting 45s.

The text was updated successfully, but these errors were encountered:

mmaslankaprv · 2021-08-24T08:25:18Z

I am looking at this issue, seem like a false positive.
I went over the logs and consumer consumed each of the offsets mentioned in logs:

{"timestamp":1629762168193,"name":"records_consumed","count":4,"partitions":[{"topic":"topic7","partition":2,"count":4,"minOffset":5995,"maxOffset":5998}]}
{"timestamp":1629762157901,"name":"records_consumed","count":43,"partitions":

[{"topic":"topic7","partition":3,"count":4,"minOffset":1829,"maxOffset":1832},{"topic":"topic7","partition":2,"count":2,"minOffset":1452,"maxOffset":1453},{"topic":"topic7","partition":0,"count":37,"minOffset":1005,"maxOffset":1041}]}

{"timestamp":1629762168166,"name":"records_consumed","count":11,"partitions":[{"topic":"topic7","partition":3,"count":2,"minOffset":7585,"maxOffset":7586},{"topic":"topic7","partition":2,"count":9,"minOffset":5969,"maxOffset":5977}]}

{"timestamp":1629762155213,"name":"records_consumed","count":125,"partitions":[{"topic":"topic7","partition":3,"count":16,"minOffset":678,"maxOffset":693},{"topic":"topic7","partition":2,"count":48,"minOffset":616,"maxOffset":663},{"topic":"topic7","partition":1,"count":61,"minOffset":831,"maxOffset":891}]}

What is interesting tho is the fact that consume failed to commit offsets:

{"timestamp":1629762157568,"name":"offsets_committed","offsets":[{"topic":"topic7","partition":3,"offset":694},{"topic":"topic7","partition":2,"offset":664},{"topic":"topic7","partition":1,"offset":892}],

"error":"Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. 
This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.","success":false}

jcsp · 2021-09-24T11:10:53Z

Related: #2388 is a failure on the same test, but looks to be a different failure mode.

@mmaslankaprv when you say this is a false positive, do you mean the test should be doing something different? The default poll interval threshold in kafka client is 5 mins (I'm not aware of the test modifying it) so it doesn't seem that likely that this was just a race

dswang · 2021-10-29T16:27:05Z

@dotnwat @jcsp Is this still an issue? Would this block 21.10.1?

jcsp · 2021-10-29T17:23:33Z

@dswang No this shouldn't block: it's unlikely to be a regression in 21.10.

Believed to be stable now, explanation in ticket: redpanda-data#2142 Signed-off-by: John Spray <jcs@vectorized.io>

jcsp · 2021-12-02T20:57:03Z

In last 2 weeks on test-staging, this test failed 1/116 runs, and that was a case of #2388 which was just fixed. I think it's time to reinstate this test: #3139

Fixed mapping of raft error code to Kafka error codes in offset commit handling code path. When offset commit request fails we need to preserve the error semantic from Kafka broker. All the error indicating that current node is not a leader should be translated to `NOT_COORDINATOR` error. Other errors should be translated to generic server error as the client should rebuild its state instead of assuming particular broker behavior. Fixes: redpanda-data#2142 Signed-off-by: Michal Maslanka <michal@vectorized.io>

Fixed mapping of raft error code to Kafka error codes in offset commit handling code path. When offset commit request fails we need to preserve the error semantic from Kafka broker. All the error indicating that current node is not a leader should be translated to `NOT_COORDINATOR` error. Other errors should be translated to generic server error as the client should rebuild its state instead of assuming particular broker behavior. Fixes: redpanda-data#2142 Signed-off-by: Michal Maslanka <michal@vectorized.io> (cherry picked from commit 07ee7c5)

dotnwat assigned mmaslankaprv Aug 24, 2021

dotnwat added this to the v21.8.1 milestone Aug 24, 2021

dswang modified the milestones: v21.8.1, v21.8.2 Aug 25, 2021

dswang modified the milestones: v21.8.2, 21.9.1 Sep 2, 2021

jcsp added kind/bug Something isn't working area/raft ci-failure labels Sep 15, 2021

dswang modified the milestones: v21.9.1, v21.10.1 Oct 12, 2021

dswang removed this from the v21.10.1 milestone Oct 29, 2021

jcsp added the ci-disabled-test label Dec 2, 2021

jcsp added a commit to jcsp/redpanda that referenced this issue Dec 2, 2021

tests: reinstate NodesDecommissioningTest.test_working node

0381196

Believed to be stable now, explanation in ticket: redpanda-data#2142 Signed-off-by: John Spray <jcs@vectorized.io>

jcsp mentioned this issue Dec 2, 2021

tests: reinstate NodesDecommissioningTest.test_working node #3139

Merged

jcsp closed this as completed Dec 2, 2021

mmaslankaprv reopened this Dec 9, 2021

mmaslankaprv mentioned this issue Dec 9, 2021

Fix mapping shutdown error #3205

Merged

jcsp mentioned this issue Dec 13, 2021

rpk: add commands for dealing with centrally stored cluster config #3129

Merged

mmaslankaprv closed this as completed in #3205 Dec 14, 2021

snyk-bot mentioned this issue Aug 6, 2022

[Snyk] Upgrade winston from 3.3.3 to 3.8.0 runsafesecurity/redpanda#4

Open

wreet mentioned this issue Sep 17, 2022

[Snyk] Upgrade winston from 3.3.3 to 3.8.1 #6450

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redpanda: node decommission test failure #2142

redpanda: node decommission test failure #2142

dotnwat commented Aug 24, 2021

mmaslankaprv commented Aug 24, 2021

jcsp commented Sep 24, 2021

dswang commented Oct 29, 2021

jcsp commented Oct 29, 2021

jcsp commented Dec 2, 2021

redpanda: node decommission test failure #2142

redpanda: node decommission test failure #2142

Comments

dotnwat commented Aug 24, 2021

mmaslankaprv commented Aug 24, 2021

jcsp commented Sep 24, 2021

dswang commented Oct 29, 2021

jcsp commented Oct 29, 2021

jcsp commented Dec 2, 2021