Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (serde overflow or underflow) in CompactionE2EIdempotencyTest.test_basic_compaction #8491

Closed
rystsov opened this issue Jan 29, 2023 · 5 comments · Fixed by #8700
Closed
Assignees
Labels
area/transactions ci-failure kind/bug Something isn't working low-hanging-fruit sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.

Comments

@rystsov
Copy link
Contributor

rystsov commented Jan 29, 2023

https://buildkite.com/redpanda/redpanda/builds/22032#0185fb59-d754-4805-9a3c-eca75b9bda6b

results/2023-01-29--001/CompactionE2EIdempotencyTest/test_basic_compaction/initial_cleanup_policy=compact.workload=Workload.TX/51

Module: rptest.tests.compaction_e2e_test
Class:  CompactionE2EIdempotencyTest
Method: test_basic_compaction
Arguments:
{
  "initial_cleanup_policy": "compact",
  "workload": "TX"
}
test_id:    rptest.tests.compaction_e2e_test.CompactionE2EIdempotencyTest.test_basic_compaction.initial_cleanup_policy=compact.workload=Workload.TX
status:     FAIL
run time:   1 minute 9.720 seconds

    <BadLogLines nodes=docker-rp-17(2) example="ERROR 2023-01-29 03:17:06,952 [shard 1] serde - serde.h:232 - Overflow or underflow detected when casting to nanoseconds, clamping to a limit. Input: -9223372036854775808  type: std::chrono::duration<long long, std::ratio<1, 1000>>">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 67, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  File "/root/tests/rptest/services/redpanda.py", line 1740, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-17(2) example="ERROR 2023-01-29 03:17:06,952 [shard 1] serde - serde.h:232 - Overflow or underflow detected when casting to nanoseconds, clamping to a limit. Input: -9223372036854775808  type: std::chrono::duration<long long, std::ratio<1, 1000>>">
@rystsov rystsov added kind/bug Something isn't working ci-failure labels Jan 29, 2023
@rystsov rystsov changed the title CI Failure (key symptom) in CompactionE2EIdempotencyTest.test_basic_compaction CI Failure (serde overflow or underflow) in CompactionE2EIdempotencyTest.test_basic_compaction Jan 29, 2023
@ztlpn ztlpn added area/transactions sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. and removed area/raft labels Jan 30, 2023
@ztlpn
Copy link
Contributor

ztlpn commented Jan 30, 2023

Right before the bad log line the following logs are produced:

TRACE 2023-01-29 03:17:06,952 [shard 1] tx - tm_stm_cache.cc:39 - looking for tx:{da9dc38a-1cf8-42fb-9477-068014ab6576} etag:1
TRACE 2023-01-29 03:17:06,952 [shard 1] tx - tm_stm_cache.cc:50 - looking for tx:{da9dc38a-1cf8-42fb-9477-068014ab6576} etag:1: can't find term:1 in _state
TRACE 2023-01-29 03:17:06,952 [shard 0] tx - tm_stm_cache.cc:39 - looking for tx:{da9dc38a-1cf8-42fb-9477-068014ab6576} etag:1
TRACE 2023-01-29 03:17:06,952 [shard 0] tx - tm_stm_cache.cc:50 - looking for tx:{da9dc38a-1cf8-42fb-9477-068014ab6576} etag:1: can't find term:1 in _state

This means that the messages are produced here - we tried searching for the transaction in tm_stm_cache, could not find it and return fetch_tx_reply(tx_errc::tx_not_found). Most probably the problem is that the fetch_tx_reply constructor from tx_ercc does not explicitly initialize the timeout_ms field and it gets filled with junk. This is a benign problem because this field in the errored fetch_tx_reply should not be used anywhere but still technically an UB so sev-medium.

@ztlpn
Copy link
Contributor

ztlpn commented Jan 30, 2023

Since there is no CompactionE2EIdempotencyTest in dev yet, I think it makes sense to fix this bug and test the fix as part of #8413

@ztlpn ztlpn assigned rystsov and unassigned ztlpn Jan 30, 2023
@rystsov rystsov assigned rystsov and unassigned rystsov Jan 31, 2023
@rystsov
Copy link
Contributor Author

rystsov commented Jan 31, 2023

@ztlpn the error is irrelevant to the PR and it shouldn't block it, I'll add ok_to_fail

@mmaslankaprv
Copy link
Member

We need to default initialize the duration fields

@mmaslankaprv mmaslankaprv assigned bharathv and unassigned mmaslankaprv Feb 2, 2023
@dotnwat
Copy link
Member

dotnwat commented Feb 2, 2023

We need to default initialize the duration fields

which fields?

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/transactions ci-failure kind/bug Something isn't working low-hanging-fruit sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants