Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redpanda: node decommission test failure #2142

Closed
dotnwat opened this issue Aug 24, 2021 · 5 comments · Fixed by #3205
Closed

redpanda: node decommission test failure #2142

dotnwat opened this issue Aug 24, 2021 · 5 comments · Fixed by #3205
Assignees

Comments

@dotnwat
Copy link
Member

dotnwat commented Aug 24, 2021

https://buildkite.com/vectorized/redpanda/builds/1723

--------------------------------------------------------------------------------
test_id:    rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node
status:     FAIL
run time:   1 minute 37.948 seconds
 
    TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic7', partition=0): 1041, TopicPartition(topic='topic7', partition=3): 7586, TopicPartition(topic='topic7', partition=1): 891, TopicPartition(topic='topic7', partition=2): 5998} after waiting 45s.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 67, in test_decommissioning_working_node
    self.run_validation(enable_idempotence=False, consumer_timeout_sec=45)
  File "/root/tests/rptest/tests/end_to_end.py", line 139, in run_validation
    self.await_consumed_offsets(self.producer.last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 105, in await_consumed_offsets
    wait_until(has_finished_consuming,
  File "/usr/local/lib/python3.8/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='topic7', partition=0): 1041, TopicPartition(topic='topic7', partition=3): 7586, TopicPartition(topic='topic7', partition=1): 891, TopicPartition(topic='topic7', partition=2): 5998} after waiting 45s.
@dotnwat dotnwat added this to the v21.8.1 milestone Aug 24, 2021
@mmaslankaprv
Copy link
Member

I am looking at this issue, seem like a false positive.
I went over the logs and consumer consumed each of the offsets mentioned in logs:

{"timestamp":1629762168193,"name":"records_consumed","count":4,"partitions":[{"topic":"topic7","partition":2,"count":4,"minOffset":5995,"maxOffset":5998}]}
{"timestamp":1629762157901,"name":"records_consumed","count":43,"partitions":

[{"topic":"topic7","partition":3,"count":4,"minOffset":1829,"maxOffset":1832},{"topic":"topic7","partition":2,"count":2,"minOffset":1452,"maxOffset":1453},{"topic":"topic7","partition":0,"count":37,"minOffset":1005,"maxOffset":1041}]}

{"timestamp":1629762168166,"name":"records_consumed","count":11,"partitions":[{"topic":"topic7","partition":3,"count":2,"minOffset":7585,"maxOffset":7586},{"topic":"topic7","partition":2,"count":9,"minOffset":5969,"maxOffset":5977}]}

{"timestamp":1629762155213,"name":"records_consumed","count":125,"partitions":[{"topic":"topic7","partition":3,"count":16,"minOffset":678,"maxOffset":693},{"topic":"topic7","partition":2,"count":48,"minOffset":616,"maxOffset":663},{"topic":"topic7","partition":1,"count":61,"minOffset":831,"maxOffset":891}]}

What is interesting tho is the fact that consume failed to commit offsets:

{"timestamp":1629762157568,"name":"offsets_committed","offsets":[{"topic":"topic7","partition":3,"offset":694},{"topic":"topic7","partition":2,"offset":664},{"topic":"topic7","partition":1,"offset":892}],
"error":"Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. 
This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.","success":false}

@dswang dswang modified the milestones: v21.8.1, v21.8.2 Aug 25, 2021
@dswang dswang modified the milestones: v21.8.2, 21.9.1 Sep 2, 2021
@jcsp jcsp added kind/bug Something isn't working area/raft ci-failure labels Sep 15, 2021
@jcsp
Copy link
Contributor

jcsp commented Sep 24, 2021

Related: #2388 is a failure on the same test, but looks to be a different failure mode.

@mmaslankaprv when you say this is a false positive, do you mean the test should be doing something different? The default poll interval threshold in kafka client is 5 mins (I'm not aware of the test modifying it) so it doesn't seem that likely that this was just a race

@dswang dswang modified the milestones: v21.9.1, v21.10.1 Oct 12, 2021
@dswang
Copy link
Contributor

dswang commented Oct 29, 2021

@dotnwat @jcsp Is this still an issue? Would this block 21.10.1?

@jcsp
Copy link
Contributor

jcsp commented Oct 29, 2021

@dswang No this shouldn't block: it's unlikely to be a regression in 21.10.

@dswang dswang removed this from the v21.10.1 milestone Oct 29, 2021
jcsp added a commit to jcsp/redpanda that referenced this issue Dec 2, 2021
Believed to be stable now, explanation in ticket:
redpanda-data#2142

Signed-off-by: John Spray <jcs@vectorized.io>
@jcsp
Copy link
Contributor

jcsp commented Dec 2, 2021

In last 2 weeks on test-staging, this test failed 1/116 runs, and that was a case of #2388 which was just fixed. I think it's time to reinstate this test: #3139

@jcsp jcsp closed this as completed Dec 2, 2021
@mmaslankaprv mmaslankaprv reopened this Dec 9, 2021
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Dec 9, 2021
Fixed mapping of raft error code to Kafka error codes in offset commit
handling code path. When offset commit request fails we need to preserve
the error semantic from Kafka broker. All the error indicating that
current node is not a leader should be translated to `NOT_COORDINATOR`
error. Other errors should be translated to generic server error as the
client should rebuild its state instead of assuming particular broker
behavior.

Fixes: redpanda-data#2142

Signed-off-by: Michal Maslanka <michal@vectorized.io>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Dec 9, 2021
Fixed mapping of raft error code to Kafka error codes in offset commit
handling code path. When offset commit request fails we need to preserve
the error semantic from Kafka broker. All the error indicating that
current node is not a leader should be translated to `NOT_COORDINATOR`
error. Other errors should be translated to generic server error as the
client should rebuild its state instead of assuming particular broker
behavior.

Fixes: redpanda-data#2142

Signed-off-by: Michal Maslanka <michal@vectorized.io>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Dec 16, 2021
Fixed mapping of raft error code to Kafka error codes in offset commit
handling code path. When offset commit request fails we need to preserve
the error semantic from Kafka broker. All the error indicating that
current node is not a leader should be translated to `NOT_COORDINATOR`
error. Other errors should be translated to generic server error as the
client should rebuild its state instead of assuming particular broker
behavior.

Fixes: redpanda-data#2142

Signed-off-by: Michal Maslanka <michal@vectorized.io>
(cherry picked from commit 07ee7c5)
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants