Fix duplicates consistency error by caching already translated offsets #5356

rystsov · 2022-07-06T05:23:17Z

Cover letter

Redpanda idempotency works by caching the mapping between seq numbers and the corresponding offsets. Currently we cache raft offset and rely on the offset translator to turn them into kafka offsets. Unfortunately the offset translator can't translate the offsets of the already evicted segments meanwhile the cache may outlive the eviction.

Fixing the problem by caching the already translated offsets.

The failing offset translator causes unknown_server_error which was triggering a client bug resulting in consistency violation

Fixes #5355

Release notes

Add a workaround to bypass a java kafka client's consistency issue

src/v/kafka/server/replicated_partition.cc

src/v/cluster/types.h

src/v/cluster/rm_stm.cc

src/v/cluster/feature_table.h

mmaslankaprv · 2022-07-06T09:00:49Z

The main issue i see with this is inconsistent use of newly introduced kafka_offset. I think that after we introduced kafka_offfset abstraction we must consistently use it in all the layers where we deal with offsets that are translated to kafka layer.

mmaslankaprv · 2022-07-06T09:04:47Z

Also i think it would be better to keep the partition interface consistent when it comes to returned offsets. i.e. if we decide to return kafka offsets from partition all of the interfaces should return kafka offsets. (we may return raft and log to access to raw not translated offsets).

mmaslankaprv · 2022-07-06T09:06:42Z

Another idea that i have is not to return kafka offset directly but cache enough information to make offset translation possible this way when translation is impossible using offsets_translator we may fallback to rm_stm cache to perform translation. This approach would not require changing the place where translation take place

src/v/cluster/types.h

rystsov · 2022-07-06T14:42:35Z

Another idea that i have is not to return kafka offset directly but cache enough information to make offset translation possible this way when translation is impossible using offsets_translator we may fallback to rm_stm cache to perform translation. This approach would not require changing the place where translation take place

I think I don't understand the idea. It sounds like something too complex:

creating a parallel offset translator
cache not only the offset but some extra information
implement a fall back mechanism

rystsov · 2022-07-06T14:44:49Z

Also i think it would be better to keep the partition interface consistent when it comes to returned offsets. i.e. if we decide to return kafka offsets from partition all of the interfaces should return kafka offsets. (we may return raft and log to access to raw not translated offsets).

I'll take a look in the context of fixing the unit tests but I especially introduced kafka_offsets to avoid the confusion of mixing the offsets in the same class to reduce scope of this PR.

src/v/kafka/server/handlers/produce.cc

src/v/cluster/rm_stm.cc

src/v/kafka/server/handlers/produce.cc

src/v/cluster/partition.cc

src/v/kafka/server/group.cc

src/v/cluster/types.h

src/v/kafka/server/replicated_partition.cc

src/v/cluster/rm_stm.cc

Kafka client doesn't process unknown_server_error correctly and it may lead to duplicates violating the idempotency. See the following issue for more info: https://issues.apache.org/jira/browse/KAFKA-14034 request_timed_out just like unknown_server_error means that the true outcome of the operation is unknown and unlike unknown_server_error it doesn't cause the problem so switching to using it to avoid the problem

Update all partition::replicate dependees which don't perform offset translation to bypass it via a direct raft reference

Updating consumer groups to use conditional replication to prevent a situation when after a check a leadership jumps away, invalidates the check, jumps back just in time for the post check replication. check condition leadership goes to a new node the node replicates something which invalidates the conditions the leadership jumps back the node successfully replicates assuming that the condition is true Switched to a conditional replicate to fix the problem. When a group manager detects a leadership change it replays the group's records to reconstruct the groups state. We cache the current term in the state and use it as a condition on replicate. In this case we know that if the leadership bounce the replication won't pass.

We're going to mix raft and kafka offset in the same class, since both the offsets uses the same type it's easy to make an error and treat one as it was another. Introducing kafka offset to rely on the type system to prevent such errors.

Shifting offset translation down the abstraction well to eventually reach rm_stm

Preparing rm_stm to use kafka::offset based seq-offset cache. Right now it uses raft offsets but there is a problem with it: once the cache items become older that the head of the log (eviction) panda becomes unable to use offset translation so we need to store already translated offsets. Since the cache is persisted as a part the snapshot so we need to change the disk format and provide backward compatibility. The change is splitted into two commits. Current commit introduces types to represent old format seq_cache_entry_v1 and tx_snapshot_v1 and adds compatibility machinary to convert old snapshot (tx_snapshot_v1) to new snapshot (tx_snapshot). The follow up commit updates the default types to use new format and updates the mapping between old and default types.

src/v/cluster/rm_stm.cc

switching to caching seq-kafka offsets cache to avoid out of range errors on translating offsets beyond the eviction point

manually validated the tests by tweaking active_snapshot_version() to ignore feature manager and to always use the newest version and checked that in this case the tests fail

rystsov · 2022-07-12T19:38:07Z

The ci failures aren't caused by this PR, see #5437

rystsov requested review from dotnwat, mmaslankaprv, NyaliaLui, graphcareful, VadimPlh and ztlpn as code owners July 6, 2022 05:23

github-actions bot added the area/redpanda label Jul 6, 2022

rystsov force-pushed the seq-kafka-offset-cache branch from ae40f15 to 88157fe Compare July 6, 2022 05:25