Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: rptest.tests.topic_delete_test.TopicDeleteStressTest.stress_test unclean client shutdown #4326

Closed
dotnwat opened this issue Apr 19, 2022 · 7 comments
Assignees
Labels

Comments

@dotnwat
Copy link
Member

dotnwat commented Apr 19, 2022

when looking into the logs around the time that the badlogline was reported, it seems as though ducktape may have kill -9 on a kafka client which could cause this issue. it may be that we need to improve unclean shutdown.

this looks like a kafka client terminated or crashed in the test.

test_id:    rptest.tests.topic_delete_test.TopicDeleteStressTest.stress_test
--
  | status:     FAIL
  | run time:   6 minutes 15.546 seconds
  |  
  |  
  | <BadLogLines nodes=docker_n_13(1) example="ERROR 2022-04-19 18:24:06,405 [shard 0] rpc - server.cc:114 - kafka rpc protocol - Error[applying protocol] remote address: 172.18.0.10:55738 - std::out_of_range (Invalid skip(n). Expected:999597, but skipped:996533)">
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 47, in wrapped
  | self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  | File "/root/tests/rptest/services/redpanda.py", line 849, in raise_on_bad_logs
  | raise BadLogLines(bad_lines)
  | rptest.services.redpanda.BadLogLines: <BadLogLines nodes=docker_n_13(1) example="ERROR 2022-04-19 18:24:06,405 [shard 0] rpc - server.cc:114 - kafka rpc protocol - Error[applying protocol] remote address: 172.18.0.10:55738 - std::out_of_range (Invalid skip(n). Expected:999597, but skipped:996533)">
  |  

Originally posted by @dotnwat in #4199 (comment)

@dotnwat
Copy link
Member Author

dotnwat commented Apr 21, 2022

@rystsov when I disable idempotence this problem appears to go away.

diff --git a/src/v/config/configuration.cc b/src/v/config/configuration.cc
index e98586cb5..98dd96e94 100644
--- a/src/v/config/configuration.cc
+++ b/src/v/config/configuration.cc
@@ -415,7 +415,7 @@ configuration::configuration()
       "enable_idempotence",
       "Enable idempotent producer",
       {.visibility = visibility::user},
-      true)
+      false)
   , enable_transactions(
       *this,
       "enable_transactions",

The message std::out_of_range (Invalid skip(n). Expected:999597, but skipped:996533) is often associated with a client that crashes or stops uncleanly. Though, it could potentially be due to other things. I haven't investigated further than reproducing w/ and w/o idempotence.

@dotnwat dotnwat added this to the v22.1.1 (Stale) milestone Apr 21, 2022
@jcsp jcsp added the kind/bug Something isn't working label Apr 21, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Apr 21, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Apr 21, 2022
This is a lighter touch than marking the test ok_to_fail:
we are specifically just tolerating failures that result
from the unexpected RPC error.

Related: redpanda-data#4326
@jcsp
Copy link
Contributor

jcsp commented Apr 21, 2022

This failed 7/54 runs in the last 24h, last example was https://buildkite.com/redpanda/redpanda/builds/9256#138b41aa-c3a9-439a-a21f-c113c5af1636

I was about to mark it ok_to_fail, but I think it's better to just make it temporarily tolerant of this error message
#4367

jcsp added a commit to jcsp/redpanda that referenced this issue Apr 21, 2022
This is a lighter touch than marking the test ok_to_fail:
we are specifically just tolerating failures that result
from the unexpected RPC error.

Related: redpanda-data#4326
jcsp added a commit to jcsp/redpanda that referenced this issue Apr 21, 2022
This is a lighter touch than marking the test ok_to_fail:
we are specifically just tolerating failures that result
from the unexpected RPC error.

Related: redpanda-data#4326
jcsp added a commit to jcsp/redpanda that referenced this issue Apr 26, 2022
This is a lighter touch than marking the test ok_to_fail:
we are specifically just tolerating failures that result
from the unexpected RPC error.

Related: redpanda-data#4326
(cherry picked from commit 7bac61f)
@dotnwat dotnwat modified the milestones: v22.1.1 (Stale), v22.1.1 Apr 26, 2022
@dotnwat dotnwat modified the milestones: v22.1.1, v22.1.2 May 5, 2022
@dotnwat
Copy link
Member Author

dotnwat commented May 5, 2022

@rystsov: not a release blocker moving to next patch release

@andrewhsu andrewhsu modified the milestones: v22.1.2, v22.1.3 May 9, 2022
@jcsp jcsp removed this from the v22.1.3 milestone May 11, 2022
abhijat pushed a commit to abhijat/redpanda that referenced this issue May 20, 2022
This is a lighter touch than marking the test ok_to_fail:
we are specifically just tolerating failures that result
from the unexpected RPC error.

Related: redpanda-data#4326
@rystsov
Copy link
Contributor

rystsov commented Jul 8, 2022

#5238 introduces a new workload & online verifier which isn't subject to such errors when it's it we switch to it and it will fix this issue

@dotnwat
Copy link
Member Author

dotnwat commented Jul 17, 2022

#5238 introduces a new workload & online verifier which isn't subject to such errors when it's it we switch to it and it will fix this issue

nice!

@dotnwat
Copy link
Member Author

dotnwat commented Nov 2, 2022

@rystsov can you decide if this important, fixed, etc...?

@jcsp
Copy link
Contributor

jcsp commented Nov 25, 2022

No repros in 30d

@jcsp jcsp closed this as completed Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants