Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (wait_until segment_number_matches - compaction stuck) in CompactionE2EIdempotencyTest.test_basic_compaction #8492

Closed
rystsov opened this issue Jan 29, 2023 · 16 comments · Fixed by #8927
Assignees
Labels
area/storage ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@rystsov
Copy link
Contributor

rystsov commented Jan 29, 2023

This is in a new test in #8413

https://buildkite.com/redpanda/redpanda/builds/22034#0185fbf7-26e3-4971-b008-bc94644644c1

Module: rptest.tests.compaction_e2e_test
Class:  CompactionE2EIdempotencyTest
Method: test_basic_compaction
Arguments:
{
  "initial_cleanup_policy": "compact",
  "workload": "IDEMPOTENCY"
}
test_id:    rptest.tests.compaction_e2e_test.CompactionE2EIdempotencyTest.test_basic_compaction.initial_cleanup_policy=compact.workload=Workload.IDEMPOTENCY
status:     FAIL
run time:   5 minutes 28.082 seconds

    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/compaction_e2e_test.py", line 140, in test_basic_compaction
    wait_until(lambda: segment_number_matches(lambda s: s < 5),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError
@rystsov rystsov added kind/bug Something isn't working ci-failure labels Jan 29, 2023
@rystsov
Copy link
Contributor Author

rystsov commented Jan 29, 2023

It looks like compaction isn't happening for two out three partitions

[INFO  - 2023-01-29 06:02:12,202 - compaction_e2e_test - test_basic_compaction - lineno:135]: waiting for compaction to happen
[DEBUG - 2023-01-29 06:02:12,310 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-tqjzzltraj segments per partition: [11, 7, 8]
[DEBUG - 2023-01-29 06:02:14,415 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-tqjzzltraj segments per partition: [11, 7, 8]
...
[DEBUG - 2023-01-29 06:07:08,870 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-tqjzzltraj segments per partition: [2, 7, 8]
[DEBUG - 2023-01-29 06:07:10,932 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-tqjzzltraj segments per partition: [2, 7, 8]

@rystsov
Copy link
Contributor Author

rystsov commented Jan 29, 2023

if we grep executed_compaction: true in redpanda logs then on node docker-rp-1 it has 23 matches while on nodes docker-rp-2 & docker-rp-3 there are 43 matches. So it looks like compaction has stuck on node docker-rp-1

@rystsov rystsov changed the title CI Failure (wait_until segment_number_matches) in CompactionE2EIdempotencyTest.test_basic_compaction CI Failure (wait_until segment_number_matches - compaction stuck) in CompactionE2EIdempotencyTest.test_basic_compaction Jan 29, 2023
@VadimPlh VadimPlh self-assigned this Jan 30, 2023
@jcsp
Copy link
Contributor

jcsp commented Jan 31, 2023

@andijcr @VadimPlh just checking you guys have been in contact? Please can you update this ticket with the analysis so far.

@andijcr
Copy link
Contributor

andijcr commented Feb 8, 2023

tinkering with this test in #8687

What I see in the last rebase is that the tests do not fail anymore at wait_until(lambda: segment_number_matches(lambda s: s < 5),

The runs that are failing are doing it in two spots:

While checking that enough segments are produced before reducing the compaction interval

            # wait for multiple segments to appear in topic partitions
            wait_until(lambda: segment_number_matches(lambda s: s >= 5),

and at remote_wait_consumer. this one is more concerning

        self.logger.info(f"enable consumer and validate consumed records")
        rw_verifier.remote_start_consumer()
        rw_verifier.remote_wait_consumer()

@piyushredpanda
Copy link
Contributor

@andijcr to have 1 PR for 1st issue. @andijcr: what was the conclusion for the second thing here?

@andijcr
Copy link
Contributor

andijcr commented Feb 9, 2023

@andijcr to have 1 PR for 1st issue. @andijcr: what was the conclusion for the second thing here?

the second thing could be a proper bug. The tests fails to consume all the messages in a partition. i'm trying to reproduce it in a live cluster to get an idea If the data is missing

@rystsov
Copy link
Contributor Author

rystsov commented Feb 9, 2023

@andijcr there is already an issue for that #8698 there is some difference between position and last offset obtained via jumping to the end but all written data was consumed, I've added a workaround (made the test less strict) there - 3e4a479

but current issue (wait_until segment_number_matches) still fails!

see same PR #8624 failing build - https://buildkite.com/redpanda/redpanda/builds/22887#01863754-84ea-4212-aec6-2c3ccf6138c8

@andijcr
Copy link
Contributor

andijcr commented Feb 10, 2023

@rystsov here #8783 I rebased your PR #8624 and included my fix on type.h, running test_basic_compaction in a loop and I see no failures.

The pr with this fix is here #8797

@rystsov
Copy link
Contributor Author

rystsov commented Feb 10, 2023

I don't see the link between initialization of the txn fields and the segment problem, can you explain it?

@rystsov
Copy link
Contributor Author

rystsov commented Feb 10, 2023

you run 20 iteration, on my PR the rate was 2/50 so we don't have a clear signal that it's truly fixed

@rystsov
Copy link
Contributor Author

rystsov commented Feb 11, 2023

ok after 50 reties there still was a failure #8783 (comment)

@andijcr
Copy link
Contributor

andijcr commented Feb 15, 2023

I don't see the link between initialization of the txn fields and the segment problem, can you explain it?

The reasoning was to expand on Bharath fix by initializing fields of type duration, since it's UB to access them without initialization. I wasn't able to find a code path, so I took a cautionary approach initializing whatever seemed suspicious. At least from ducktape seems like the frequency of the failure decreased.

I tried to run msan and valgrind, but even thought I was able to compile with some modifications, the first crashed immediately on nettle and the second seems to raise false positives.

as of now I'm analyzing the latest ducktape failed run,

3 nodes and 3 partitions, counting the lines with "Creating new segment" and "Removing <segment_path>" we see that partition 1 and 2 have a proportional number of creation/removal for the three nodes, but partition 0 does not get removal lines on node docker-rp-9.
segments-removed-partition-0.txt
segments-created-partition-0.txt

This makes the test fail, but one weirdness is that test_log.debug reports partition 0 as decreasing and the other as stuck.
Some lines with a timestamp older than the lines in the above files

[DEBUG - 2023-02-11 01:02:44,765 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [8, 9, 8]
[DEBUG - 2023-02-11 01:02:47,742 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [7, 9, 8]
[DEBUG - 2023-02-11 01:02:50,678 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [6, 9, 8]
[DEBUG - 2023-02-11 01:02:53,613 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [5, 9, 8]
[DEBUG - 2023-02-11 01:02:56,549 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [5, 9, 8]
[DEBUG - 2023-02-11 01:02:59,486 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [4, 9, 8]
[DEBUG - 2023-02-11 01:03:02,421 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [3, 9, 8]
[DEBUG - 2023-02-11 01:03:05,357 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [2, 9, 8]
[DEBUG - 2023-02-11 01:03:08,285 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [2, 9, 8]
[DEBUG - 2023-02-11 01:03:11,221 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [2, 9, 8]
[DEBUG - 2023-02-11 01:03:14,153 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [2, 9, 8]
[DEBUG - 2023-02-11 01:03:17,089 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [2, 9, 8]
[DEBUG - 2023-02-11 01:03:20,021 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [2, 9, 8]
[DEBUG - 2023-02-11 01:03:22,957 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [2, 9, 8]
[DEBUG - 2023-02-11 01:03:25,885 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [2, 9, 8]

This files are the lines of partition 0 for docker-rp-9. rp-9-partition-0.txt
I'm comparing them against the lines of partition 1 on the same node Uploading rp-9-partition-1.log…
partition 0 on docker-rp-10 rp-10-partition-0.log
and the lines from the same test but in a run that did not fail rp-1-partition-0.log
Trying to understand what prevented the compaction on rp-9

@andijcr
Copy link
Contributor

andijcr commented Feb 16, 2023

While checking this line from test_log.debug

[DEBUG - 2023-02-11 01:01:49,005 - compaction_e2e_test - segment_number_matches - lineno:100]: Topic topic-slyegrhbto segments per partition: [12, 9, 8]

I noticed that the checks are run only on the first node.
Given the irregular failure of this test, it might be that other green runs were hiding failures.

I opened a pr to extend the check on the other 2 nodes #8927
and I'm checking the previous green run for hidden failures

Slightly unrelated to the main bug:
segments per partition: [12, 9, 8] refers to node 0 partitions [2, 1, 0] and not [0, 1, 2] (based on redpanda.log and the number of segments created).

@andijcr
Copy link
Contributor

andijcr commented Feb 16, 2023

It is present in other runs in https://buildkite.com/redpanda/redpanda/builds/23027#01863dee-e467-4fa2-a905-ccb824ac66bf
some other failures:
https://ci-artifacts.dev.vectorized.cloud/redpanda/01863dee-e467-4fa2-a905-ccb824ac66bf/vbuild/ducktape/results/2023-02-11--001/CompactionE2EIdempotencyTest/test_basic_compaction/initial_cleanup_policy=compact.workload=Workload.IDEMPOTENCY/111/
https://ci-artifacts.dev.vectorized.cloud/redpanda/01863dee-e467-4fa2-a905-ccb824ac66bf/vbuild/ducktape/results/2023-02-11--001/CompactionE2EIdempotencyTest/test_basic_compaction/initial_cleanup_policy=compact.workload=Workload.IDEMPOTENCY/113/
https://ci-artifacts.dev.vectorized.cloud/redpanda/01863dee-e467-4fa2-a905-ccb824ac66bf/vbuild/ducktape/results/2023-02-11--001/CompactionE2EIdempotencyTest/test_basic_compaction/initial_cleanup_policy=compact.workload=Workload.IDEMPOTENCY/29/
https://ci-artifacts.dev.vectorized.cloud/redpanda/01863dee-e467-4fa2-a905-ccb824ac66bf/vbuild/ducktape/results/2023-02-11--001/CompactionE2EIdempotencyTest/test_basic_compaction/initial_cleanup_policy=compact.workload=Workload.IDEMPOTENCY/37/

and others

summarizer

#!/bin/bash

sp="/-\|"
sc=0
spin() {
	printf '\b%s' "${sp:sc++:1}"

	((sc==${#sp})) && sc=0
}
endspin() {
	printf "\r%s\n" "$@"
}


check_segments_at () (
	set -e
	cd -- "$1"

	grep -c "Creating new segment /var/lib/redpanda/data/kafka/topic-[^/]*/0" redpanda.log > segments_created  || true
	grep -c "Creating new segment /var/lib/redpanda/data/kafka/topic-[^/]*/1" redpanda.log >> segments_created || true
	grep -c "Creating new segment /var/lib/redpanda/data/kafka/topic-[^/]*/2" redpanda.log >> segments_created || true

	grep -c "Removing \"/var/lib/redpanda/data/kafka/topic-[^/]*/0" redpanda.log > segments_removed  || true
	grep -c "Removing \"/var/lib/redpanda/data/kafka/topic-[^/]*/1" redpanda.log >> segments_removed || true
	grep -c "Removing \"/var/lib/redpanda/data/kafka/topic-[^/]*/2" redpanda.log >> segments_removed || true

	paste segments_created segments_removed | awk '{if(($1 - $2) < 5) printf "" > "res_ok"; else printf "" > "res_not"; }'

)

shopt -s nullglob

nodes=(ducktape*/results/latest/CompactionE2EIdempotencyTest/test_basic_compaction/*/*/RedpandaService-0*/docker-rp-*)
for path in "${nodes[@]}"; do
	spin
	check_segments_at "$path"
done
endspin

find . -name "res_not"

@andijcr andijcr added sev/high loss of availability, pathological performance degradation, recoverable corruption and removed ci-disabled-test labels Feb 17, 2023
@VladLazar VladLazar added sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages and removed sev/high loss of availability, pathological performance degradation, recoverable corruption labels Feb 17, 2023
@VladLazar
Copy link
Contributor

It's a low impact redpanda bug. Under certain circumstances the update to log_compaction_interval_ms might only be honored after one more period with the current value. The test sets log_compaction_interval_ms=1h initially and when we hit this bug, compaction doesn't happen before the test times out.

@andijcr
Copy link
Contributor

andijcr commented Feb 17, 2023

#8927 now contains the fix, I'll run CI to gain confidence

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants