Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: use rpk producer for even record distribution #4928

Merged

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented May 25, 2022

Cover letter

kafka tools cli used in some of the tests could produce non uniform message distribution, causing errors in assertion because the segment is not uploaded to SI. Using rpk producer produces better distribution as it uses random keys.

fixes #4886

kafka tools cli used in some of the tests could produce non uniform
message distribution, causing errors in assertion because the segment
is not uploaded to SI.

using rpk producer produces better distribution as it uses random
keys.

this commit also fixes a log message and adds a topic specific pattern
to the matching condition of topic_manifest.json. This is to avoid the
verification matching files for topics which are not part of the test.
@jcsp jcsp added kind/bug Something isn't working area/tests labels May 26, 2022
@jcsp
Copy link
Contributor

jcsp commented May 26, 2022

This looks good.

I wonder if it's work converting all the cases in topic_recovery_test to use this producer? I think they all have a similar risk of partition imbalance, although I'm not sure why we saw more failures of test_missing_partition than the others. I did just see a failure of test_fast2 here (https://buildkite.com/redpanda/redpanda/builds/10542#0180faea-d582-4f72-8f1e-69f25b0eec77), which I haven't dug into but maybe has the same underlying cause?

@abhijat
Copy link
Contributor Author

abhijat commented May 26, 2022

This looks good.

I wonder if it's work converting all the cases in topic_recovery_test to use this producer? I think they all have a similar risk of partition imbalance, although I'm not sure why we saw more failures of test_missing_partition than the others. I did just see a failure of test_fast2 here (https://buildkite.com/redpanda/redpanda/builds/10542#0180faea-d582-4f72-8f1e-69f25b0eec77), which I haven't dug into but maybe has the same underlying cause?

It is possible. Looking at the logs for that failure, partition index 2 of topic 1 never got enough records to upload to S3 and so the condition was not reached (1/0 is uploaded twice and 1/2 is never uploaded):

 $ grep -R 'Uploading manifest, ' * 
docker-rp-10/redpanda.log:DEBUG 2022-05-25 12:14:57,763 [shard 1] archival - [fiber6~3|0|10000ms kafka/panda-topic-1/0] - ntp_archiver_service.cc:215 - Uploading manifest, path: {"50000000/meta/kafka/panda-topic-1/0_16/manifest.json"}
docker-rp-22/redpanda.log:DEBUG 2022-05-25 12:14:45,003 [shard 1] archival - [fiber6~5|0|10000ms kafka/panda-topic-1/1] - ntp_archiver_service.cc:215 - Uploading manifest, path: {"d0000000/meta/kafka/panda-topic-1/1_16/manifest.json"}
docker-rp-22/redpanda.log:DEBUG 2022-05-25 12:14:45,462 [shard 1] archival - [fiber7~5|0|10000ms kafka/panda-topic-1/0] - ntp_archiver_service.cc:215 - Uploading manifest, path: {"50000000/meta/kafka/panda-topic-1/0_16/manifest.json"}
docker-rp-22/redpanda.log:DEBUG 2022-05-25 12:14:45,620 [shard 0] archival - [fiber4~2|0|10000ms kafka/panda-topic-2/1] - ntp_archiver_service.cc:215 - Uploading manifest, path: {"80000000/meta/kafka/panda-topic-2/1_18/manifest.json"}
docker-rp-48/redpanda.log:DEBUG 2022-05-25 12:14:45,520 [shard 1] archival - [fiber6~4|0|10000ms kafka/panda-topic-2/2] - ntp_archiver_service.cc:215 - Uploading manifest, path: {"50000000/meta/kafka/panda-topic-2/2_18/manifest.json"}
docker-rp-48/redpanda.log:DEBUG 2022-05-25 12:14:45,554 [shard 1] archival - [fiber7~5|0|10000ms kafka/panda-topic-2/0] - ntp_archiver_service.cc:215 - Uploading manifest, path: {"60000000/meta/kafka/panda-topic-2/0_18/manifest.json"}

for this partition all logs show the candidate segment open like below:

docker-rp-10/redpanda.log:DEBUG 2022-05-25 12:15:41,871 [shard 0] archival - [fiber4 kafka/panda-topic-1/2] - ntp_archiver_service.cc:363 - scheduling uploads, start_upload_offset: 0, last_stable_offset: 1032
docker-rp-10/redpanda.log-DEBUG 2022-05-25 12:15:41,871 [shard 0] archival - archival_policy.cc:87 - Upload policy for {kafka/panda-topic-1/2} invoked, start offset: 0
docker-rp-10/redpanda.log-DEBUG 2022-05-25 12:15:41,871 [shard 0] archival - archival_policy.cc:141 - Upload policy for {kafka/panda-topic-1/2}: can't find candidate, candidate is not closed
docker-rp-10/redpanda.log-DEBUG 2022-05-25 12:15:41,871 [shard 0] archival - [fiber4 kafka/panda-topic-1/2] - ntp_archiver_service.cc:261 - upload candidate not found, start_upload_offset: 0, last_stable_offset: 1032
--
docker-rp-10/redpanda.log:DEBUG 2022-05-25 12:15:51,888 [shard 0] archival - [fiber4 kafka/panda-topic-1/2] - ntp_archiver_service.cc:363 - scheduling uploads, start_upload_offset: 0, last_stable_offset: 1032
docker-rp-10/redpanda.log-DEBUG 2022-05-25 12:15:51,888 [shard 0] archival - archival_policy.cc:87 - Upload policy for {kafka/panda-topic-1/2} invoked, start offset: 0
docker-rp-10/redpanda.log-DEBUG 2022-05-25 12:15:51,888 [shard 0] archival - archival_policy.cc:141 - Upload policy for {kafka/panda-topic-1/2}: can't find candidate, candidate is not closed
docker-rp-10/redpanda.log-DEBUG 2022-05-25 12:15:51,888 [shard 0] archival - [fiber4 kafka/panda-topic-1/2] - ntp_archiver_service.cc:261 - upload candidate not found, start_upload_offset: 0, last_stable_offset: 1032

although in this case the bytes difference does not seem very large in the partitions. Probably because there are six partitions to upload.

@abhijat abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label May 26, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label May 26, 2022
@abhijat abhijat force-pushed the rpkproducer-for-missing-partitions-test branch 5 times, most recently from 494541c to ef38553 Compare May 27, 2022 15:00
use rpk producer for all TopicRecoveryTest instances. Additionally
some cleanups according to python linter.
@abhijat abhijat force-pushed the rpkproducer-for-missing-partitions-test branch from ef38553 to 4329fa5 Compare May 28, 2022 04:25
@abhijat abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label May 28, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label May 28, 2022
@abhijat abhijat marked this pull request as ready for review May 28, 2022 06:40
@abhijat
Copy link
Contributor Author

abhijat commented May 28, 2022

I used rpk producer for most tests but left size and time based retention tests alone for now, they do not seem to work well with that producer and more investigation is required into why they fail, will keep this PR focused on the tests where the record count is not uniform across partitions.

@abhijat
Copy link
Contributor Author

abhijat commented May 28, 2022

@jcsp jcsp merged commit 363127f into redpanda-data:dev May 30, 2022
@abhijat
Copy link
Contributor Author

abhijat commented Jul 29, 2022

/backport v22.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x 9d764b26d96733e46cfcec9c92b7b52173d376d2 4329fa5b9cd870f0bc7cb4545afef8117dc012c3

Workflow run logs.

@abhijat
Copy link
Contributor Author

abhijat commented Jul 29, 2022

/backport v21.1.x

@vbotbuildovich
Copy link
Collaborator

Branch name "v21.1.x" not found.

Workflow run logs.

@abhijat
Copy link
Contributor Author

abhijat commented Jul 29, 2022

/backport v21.11.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x 9d764b26d96733e46cfcec9c92b7b52173d376d2 4329fa5b9cd870f0bc7cb4545afef8117dc012c3

Workflow run logs.

@abhijat
Copy link
Contributor Author

abhijat commented Jul 29, 2022

/backport v21.11.x

this branch is really far out of sync with dev

@mmedenjak mmedenjak added ci-failure area/cloud-storage Shadow indexing subsystem labels Jul 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure in TopicRecoveryTest.test_missing_partition
4 participants