Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: failure in rptest.tests.mirror_maker_test.TestMirrorMakerService.test_simple_end_to_end.source_type #4317

Closed
dotnwat opened this issue Apr 19, 2022 · 9 comments
Assignees

Comments

@dotnwat
Copy link
Member

dotnwat commented Apr 19, 2022

https://buildkite.com/redpanda/redpanda/builds/9125#2d696ce8-3f6d-4ba9-88a8-d43939bbb242

test_id:    rptest.tests.mirror_maker_test.TestMirrorMakerService.test_simple_end_to_end.source_type=kafka
--
  | status:     FAIL
  | run time:   8 minutes 42.982 seconds
  |  
  |  
  | TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic-dxgpoxyoft', partition=0): 5339} after waiting 120s.")
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/mirror_maker_test.py", line 138, in test_simple_end_to_end
  | self.run_validation(consumer_timeout_sec=120)
  | File "/root/tests/rptest/tests/end_to_end.py", line 173, in run_validation
  | self.await_consumed_offsets(self.producer.last_acked_offsets,
  | File "/root/tests/rptest/tests/end_to_end.py", line 139, in await_consumed_offsets
  | wait_until(has_finished_consuming,
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
  | raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
  | ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='topic-dxgpoxyoft', partition=0): 5339} after waiting 120s.
  |  
  | --------------------------------------------------------------------------------
@dotnwat dotnwat added the kind/bug Something isn't working label Apr 19, 2022
@dotnwat dotnwat added this to the v22.1.1 (Stale) milestone Apr 19, 2022
@jcsp
Copy link
Contributor

jcsp commented Apr 20, 2022

https://buildkite.com/redpanda/redpanda/builds/9181#23d07699-7a09-4a15-82f4-20b8d7379907

Latest failure at 2022-04-20T00:17:48.200Z: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic-xsoqofqzkt', partition=0): 5350} after waiting 120s.")

@LenaAn
Copy link
Contributor

LenaAn commented Apr 29, 2022

@Lazin
Copy link
Contributor

Lazin commented May 4, 2022

@graphcareful
Copy link
Contributor

graphcareful commented May 4, 2022

I was able to reproduce locally, one time after 40 runs of all parameterized tests. What I observed was that MirrorMaker itself had not finished before the 120s timeout that ducktape had given on the test waiting for the entire migration to complete, end to end. To observe this, tail a mirror maker log file of a test that passed, it should look something like this:

...
[2022-05-03 23:51:00,981] DEBUG [Producer clientId=producer-2] Shutdown of Kafka producer I/O thread has completed. (org.apache.kafka.clients.producer.internals.Sender)
[2022-05-03 23:51:00,981] INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,981] INFO Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,981] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,981] INFO App info kafka.producer for producer-2 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-05-03 23:51:00,981] DEBUG [Producer clientId=producer-2] Kafka producer has been closed (org.apache.kafka.clients.producer.KafkaProducer)
[2022-05-03 23:51:00,981] DEBUG [Consumer clientId=consumer-target-mm2-2, groupId=target-mm2] Executing onLeavePrepare with generation Generation{generationId=-1, memberId='', protocol='null'} and memberId  (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2022-05-03 23:51:00,981] INFO [Consumer clientId=consumer-target-mm2-2, groupId=target-mm2] Resetting generation due to: consumer pro-actively leaving the group (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2022-05-03 23:51:00,982] INFO [Consumer clientId=consumer-target-mm2-2, groupId=target-mm2] Request joining group due to: consumer pro-actively leaving the group (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2022-05-03 23:51:00,982] INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,982] INFO Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,982] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,982] INFO App info kafka.consumer for consumer-target-mm2-2 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-05-03 23:51:00,982] DEBUG [Consumer clientId=consumer-target-mm2-2, groupId=target-mm2] Kafka consumer has been closed (org.apache.kafka.clients.consumer.KafkaConsumer)
[2022-05-03 23:51:00,982] INFO Stopped KafkaBasedLog for topic mm2-offsets.target.internal (org.apache.kafka.connect.util.KafkaBasedLog)
[2022-05-03 23:51:00,982] INFO Stopped KafkaOffsetBackingStore (org.apache.kafka.connect.storage.KafkaOffsetBackingStore)
[2022-05-03 23:51:00,982] INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,982] INFO Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,982] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,982] DEBUG Unregistering Connect metrics with JMX for worker 'target->source' (org.apache.kafka.connect.runtime.ConnectMetrics)
[2022-05-03 23:51:00,982] INFO App info kafka.connect for target->source unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-05-03 23:51:00,982] INFO Worker stopped (org.apache.kafka.connect.runtime.Worker)
[2022-05-03 23:51:00,983] DEBUG [AdminClient clientId=adminclient-15] Initiating close operation. (org.apache.kafka.clients.admin.KafkaAdminClient)
[2022-05-03 23:51:00,983] DEBUG [AdminClient clientId=adminclient-15] Waiting for the I/O thread to exit. Hard shutdown in 31536000000 ms. (org.apache.kafka.clients.admin.KafkaAdminClient)
[2022-05-03 23:51:00,983] INFO App info kafka.admin.client for adminclient-15 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-05-03 23:51:00,984] INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,984] INFO Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,984] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-05-03 23:51:00,984] DEBUG [AdminClient clientId=adminclient-15] Exiting AdminClientRunnable thread. (org.apache.kafka.clients.admin.KafkaAdminClient)
[2022-05-03 23:51:00,984] DEBUG [AdminClient clientId=adminclient-15] Kafka admin client closed. (org.apache.kafka.clients.admin.KafkaAdminClient)
[2022-05-03 23:51:00,984] INFO [Worker clientId=connect-2, groupId=target-mm2] Herder stopped (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
[2022-05-03 23:51:00,984] INFO [Worker clientId=connect-2, groupId=target-mm2] Herder stopped (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
[2022-05-03 23:51:00,984] INFO Kafka MirrorMaker stopped. (org.apache.kafka.connect.mirror.MirrorMaker)

against a mirror maker logfile for a failed test:

...
[2022-04-29 19:06:13,789] DEBUG [Worker clientId=connect-2, groupId=target-mm2] No broker available to send FindCoordinator request (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator)
[2022-04-29 19:06:13,789] DEBUG [Producer clientId=producer-3] Connection with docker_n_4/172.18.0.18 disconnected (org.apache.kafka.common.network.Selector)
java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at org.apache.kafka.common.network.PlaintextTransportLayer.finishConnect(PlaintextTransportLayer.java:50)
	at org.apache.kafka.common.network.KafkaChannel.finishConnect(KafkaChannel.java:224)
	at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:526)
	at org.apache.kafka.common.network.Selector.poll(Selector.java:481)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:551)
	at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:328)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:243)
	at java.base/java.lang.Thread.run(Thread.java:829)
[2022-04-29 19:06:13,789] DEBUG [Worker clientId=connect-2, groupId=target-mm2] Give up sending metadata request since no node is available (org.apache.kafka.clients.NetworkClient)
[2022-04-29 19:06:13,789] DEBUG [Worker clientId=connect-2, groupId=target-mm2] No broker available to send FindCoordinator request
...

The MirrorMaker log files of failed tests average around ~700M or more of size, while a passing one is significantly smaller around 15M-20M. The failed logfiles are filled with looping retry logic, that start right when ducktape shuts down all services.

As to why MirrorMaker hasn't completed its task within at least 120s I am not sure, however what is sure is that this is not a Redpanda regression. Redpanda would surely fail to return all data if MirrorMaker didn't complete its migration task in time.

@dotnwat
Copy link
Member Author

dotnwat commented May 5, 2022

thanks for the analysis @graphcareful . sounds like this isn't a release blocker and more so a generic flaky test / flaky mm2 issue.

@twmb
Copy link
Contributor

twmb commented May 5, 2022

@twmb
Copy link
Contributor

twmb commented May 5, 2022

@abhijat
Copy link
Contributor

abhijat commented May 6, 2022

@jcsp
Copy link
Contributor

jcsp commented Aug 1, 2022

Not seen in 30d.

@jcsp jcsp closed this as completed Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants