Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicates consistency error by caching already translated offsets #5356

Merged
merged 13 commits into from
Jul 12, 2022

Conversation

rystsov
Copy link
Contributor

@rystsov rystsov commented Jul 6, 2022

Cover letter

Redpanda idempotency works by caching the mapping between seq numbers and the corresponding offsets. Currently we cache raft offset and rely on the offset translator to turn them into kafka offsets. Unfortunately the offset translator can't translate the offsets of the already evicted segments meanwhile the cache may outlive the eviction.

Fixing the problem by caching the already translated offsets.

The failing offset translator causes unknown_server_error which was triggering a client bug resulting in consistency violation

Fixes #5355

Release notes

@mmedenjak mmedenjak added the kind/bug Something isn't working label Jul 6, 2022
@mmaslankaprv
Copy link
Member

The main issue i see with this is inconsistent use of newly introduced kafka_offset. I think that after we introduced kafka_offfset abstraction we must consistently use it in all the layers where we deal with offsets that are translated to kafka layer.

@mmaslankaprv
Copy link
Member

Also i think it would be better to keep the partition interface consistent when it comes to returned offsets. i.e. if we decide to return kafka offsets from partition all of the interfaces should return kafka offsets. (we may return raft and log to access to raw not translated offsets).

@mmaslankaprv
Copy link
Member

Another idea that i have is not to return kafka offset directly but cache enough information to make offset translation possible this way when translation is impossible using offsets_translator we may fallback to rm_stm cache to perform translation. This approach would not require changing the place where translation take place

src/v/cluster/types.h Outdated Show resolved Hide resolved
@rystsov
Copy link
Contributor Author

rystsov commented Jul 6, 2022

Another idea that i have is not to return kafka offset directly but cache enough information to make offset translation possible this way when translation is impossible using offsets_translator we may fallback to rm_stm cache to perform translation. This approach would not require changing the place where translation take place

I think I don't understand the idea. It sounds like something too complex:

  • creating a parallel offset translator
  • cache not only the offset but some extra information
  • implement a fall back mechanism

@rystsov
Copy link
Contributor Author

rystsov commented Jul 6, 2022

Also i think it would be better to keep the partition interface consistent when it comes to returned offsets. i.e. if we decide to return kafka offsets from partition all of the interfaces should return kafka offsets. (we may return raft and log to access to raw not translated offsets).

I'll take a look in the context of fixing the unit tests but I especially introduced kafka_offsets to avoid the confusion of mixing the offsets in the same class to reduce scope of this PR.

src/v/cluster/rm_stm.cc Outdated Show resolved Hide resolved
src/v/kafka/server/handlers/produce.cc Show resolved Hide resolved
src/v/cluster/partition.cc Show resolved Hide resolved
src/v/cluster/partition.cc Show resolved Hide resolved
src/v/kafka/server/group.cc Show resolved Hide resolved
src/v/cluster/types.h Outdated Show resolved Hide resolved
src/v/kafka/server/replicated_partition.cc Show resolved Hide resolved
src/v/kafka/server/replicated_partition.cc Show resolved Hide resolved
src/v/cluster/rm_stm.cc Show resolved Hide resolved
src/v/cluster/rm_stm.cc Show resolved Hide resolved
@rystsov rystsov force-pushed the seq-kafka-offset-cache branch 2 times, most recently from d121aee to c23ea21 Compare July 7, 2022 00:03
@rystsov rystsov added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022
@rystsov rystsov added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022
Kafka client doesn't process unknown_server_error correctly and it may
lead to duplicates violating the idempotency. See the following issue
for more info: https://issues.apache.org/jira/browse/KAFKA-14034

request_timed_out just like unknown_server_error means that the true
outcome of the operation is unknown and unlike unknown_server_error it
doesn't cause the problem so switching to using it to avoid the problem
Update all partition::replicate dependees which don't perform offset
translation to bypass it via a direct raft reference
Updating consumer groups to use conditional replication to prevent a
situation when after a check a leadership jumps away, invalidates the
check, jumps back just in time for the post check replication.

check condition
  leadership goes to a new node
  the node replicates something which invalidates the conditions
  the leadership jumps back
the node successfully replicates assuming that the condition is true

Switched to a conditional replicate to fix the problem. When a group
manager detects a leadership change it replays the group's records to
reconstruct the groups state. We cache the current term in the state
and use it as a condition on replicate. In this case we know that if
the leadership bounce the replication won't pass.
We're going to mix raft and kafka offset in the same class, since
both the offsets uses the same type it's easy to make an error and
treat one as it was another. Introducing kafka offset to rely on the
type system to prevent such errors.
Shifting offset translation down the abstraction well to eventually
reach rm_stm
Preparing rm_stm to use kafka::offset based seq-offset cache. Right
now it uses raft offsets but there is a problem with it: once the
cache items become older that the head of the log (eviction) panda
becomes unable to use offset translation so we need to store already
translated offsets.

Since the cache is persisted as a part the snapshot so we need to
change the disk format and provide backward compatibility. The change
is splitted into two commits. Current commit introduces types to
represent old format seq_cache_entry_v1 and tx_snapshot_v1 and adds
compatibility machinary to convert old snapshot (tx_snapshot_v1) to new
snapshot (tx_snapshot).

The follow up commit updates the default types to use new format and
updates the mapping between old and default types.
switching to caching seq-kafka offsets cache to avoid out of
range errors on translating offsets beyond the eviction point
manually validated the tests by tweaking active_snapshot_version()
to ignore feature manager and to always use the newest version and
checked that in this case the tests fail
@rystsov
Copy link
Contributor Author

rystsov commented Jul 12, 2022

The ci failures aren't caused by this PR, see #5437

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Idempotency doesn't prevent duplicates
5 participants