storage: assertion failure in `offset_translator_state.cc` #4466

dotnwat · 2022-04-28T04:42:13Z

Version: v21.11.10

ERROR 2022-04-26 19:57:36,136 [shard 0] assert - Assert failure: (../../../src/v/storage/offset_translator_state.cc:194) 'base_offset > last_offset && base_offset < offset' ntp {kafka/topic/1}: inconsistent add_absolute_delta (offset 767, delta 20), but last_offset: 761, last_delta: 33

Do we have any more information / context about this failure @bpraseed ?

The text was updated successfully, but these errors were encountered:

ztlpn · 2022-04-28T11:04:50Z

offset_translator_state::add_absolute_delta is used in shadow indexing when reading from remote segments. For each remote segment offset delta is stored in the manifest metadata. So this error is possible when these offset deltas are inconsistent.

One possible way this could happen that I can think of is when shadow indexing data from different topic instances gets mixed together (e.g. when a topic was recreated in a different redpanda instance that reused the same s3 bucket and the revision numbers coincided due to bad luck).

piyushredpanda · 2022-04-28T12:58:43Z

This was reproduced with 22.1.1 as well.

dotnwat · 2022-04-28T14:16:03Z

This was reproduced with 22.1.1 as well.

@piyushredpanda do you have a reference to the instance of this occurring in 22.1.1?

piyushredpanda · 2022-04-28T14:42:58Z

@VadimPlh do you mind adding the details? cc: @dotnwat

ztlpn · 2022-04-28T18:53:18Z

Ok I think I found it.

The problem is at the intersection of partition movement, offset translation and shadow indexing.

ntp kafka/topic-swiubxneva/0 got moved from {node_id: 1, shard: 1} to {node_id: 1, shard: 3}. Crucially, the new partition placement is on the same node.
Offset translator state is not found in the kvstore and gets initialized from the configuration manager state: INFO 2022-04-28 12:01:56,259 [shard 3] offset_translator - ntp: {kafka/topic-swiubxneva/0} - offset_translator.cc:131 - offset translation kvstore state not found, loading from provided bootstrap state
Because offset delta from the configuration manager doesn't take archival metadata batches into account, offset translator state gets clobbered.
Later a segment with incorrect offset delta gets uploaded.

dotnwat · 2022-04-29T00:16:43Z

partition movement, offset translation and shadow indexing.

😅 just a few minor sub systems

dotnwat · 2022-04-29T00:20:49Z

Nice find. That looks like it took a lot of digging.

Crucially, the new partition placement is on the same node.

Given that we try to keep state partitioned and isolated on each core, does this suggest that something global to the node is involved? Or rather, reading the rest of the bullet points I'm not sure which part is related to the property of the movement remaining on the same node.

Do you think a reproducer in ducktape is feasible? Do we have insight yet into what the fix may look like?

ztlpn · 2022-04-29T09:49:42Z

Given that we try to keep state partitioned and isolated on each core, does this suggest that something global to the node is involved?

This global thing is the log itself. When partition movement is cross-node, we have to download the log via recovery and the offset translator state gets rebuilt correctly. OTOH when the movement is cross-core, the log stays in the same place but we have to move the raft kvstore state. Looks like in this case we forgot to move offset translator bits.

Do you think a reproducer in ducktape is feasible?

Sure, it should be easily reproducible.

dotnwat · 2022-04-29T14:00:11Z

When partition movement is cross-node, we have to download the log via recovery and the offset translator state gets rebuilt correctly.

Oh of course. Thanks. That's a really clear explanation

Previously due to a typo raft::details::move_persistent_state was called twice. Fixes redpanda-data#4466

ztlpn · 2022-04-29T16:23:06Z

/backport v22.1.x

ztlpn · 2022-04-29T16:23:47Z

/backport v21.11.x

Previously due to a typo raft::details::move_persistent_state was called twice. Fixes redpanda-data#4466 (cherry picked from commit 3152509)

Previously due to a typo raft::details::move_persistent_state was called twice. Fixes redpanda-data#4466

dotnwat added the kind/bug Something isn't working label Apr 28, 2022

dotnwat assigned ztlpn Apr 28, 2022

dotnwat added this to the v22.1.1 milestone Apr 28, 2022

ztlpn added a commit to ztlpn/redpanda that referenced this issue Apr 29, 2022

c/controller_backend: use correct fn to move offset translator state

3152509

Previously due to a typo raft::details::move_persistent_state was called twice. Fixes redpanda-data#4466

ztlpn mentioned this issue Apr 29, 2022

Use correct fn to move offset translator state #4493

Merged

This was referenced Apr 29, 2022

[v22.1.x] storage: assertion failure in offset_translator_state.cc #4494

Closed

[v21.11.x] storage: assertion failure in offset_translator_state.cc #4495

Closed

dotnwat closed this as completed in #4493 Apr 29, 2022

abhijat pushed a commit to abhijat/redpanda that referenced this issue May 20, 2022

c/controller_backend: use correct fn to move offset translator state

f88f2bd

Previously due to a typo raft::details::move_persistent_state was called twice. Fixes redpanda-data#4466

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: assertion failure in `offset_translator_state.cc` #4466

storage: assertion failure in `offset_translator_state.cc` #4466

dotnwat commented Apr 28, 2022

ztlpn commented Apr 28, 2022

piyushredpanda commented Apr 28, 2022

dotnwat commented Apr 28, 2022

piyushredpanda commented Apr 28, 2022

ztlpn commented Apr 28, 2022

dotnwat commented Apr 29, 2022

dotnwat commented Apr 29, 2022

ztlpn commented Apr 29, 2022 •

edited

Loading

dotnwat commented Apr 29, 2022

ztlpn commented Apr 29, 2022

ztlpn commented Apr 29, 2022

storage: assertion failure in offset_translator_state.cc #4466

storage: assertion failure in offset_translator_state.cc #4466

Comments

dotnwat commented Apr 28, 2022

ztlpn commented Apr 28, 2022

piyushredpanda commented Apr 28, 2022

dotnwat commented Apr 28, 2022

piyushredpanda commented Apr 28, 2022

ztlpn commented Apr 28, 2022

dotnwat commented Apr 29, 2022

dotnwat commented Apr 29, 2022

ztlpn commented Apr 29, 2022 • edited Loading

dotnwat commented Apr 29, 2022

ztlpn commented Apr 29, 2022

ztlpn commented Apr 29, 2022

storage: assertion failure in `offset_translator_state.cc` #4466

storage: assertion failure in `offset_translator_state.cc` #4466

ztlpn commented Apr 29, 2022 •

edited

Loading