-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data loss consistency violation #6019
Conversation
PartitionBalancerTest.test_full_nodes - #5884 |
help='Path to the log desired to be analyzed') | ||
parser.add_argument( | ||
'--path', | ||
type=str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you'd like to integrate more tightly with argument parsing, you can change type = str
to type = validate_path
and validate_path
to:
def validate_path(path):
if not os.path.exists(path):
raise ArgumentTypeError(f"Path doesn't exist {path}")
controller = join(path, "redpanda", "controller")
if not os.path.exists(controller):
raise ArgumentTypeError(f"Each redpanda data dir should have controller piece but {controller} isn't found")
return path
Kafka API doesn't have explicit begin txn API, when a transaction coordinator recieves first add_partition or add_group it starts a transaction. Also Redpanda defers disk flushes to the commit moment. Combinations of those thing caues a problem: 1. a client issues add_partition to txn coordinator 2. txn coordinator starts a transaction (this is in memory state) 3. the client writes a message to data partition 4. redpanda treats the write as acks=1 and acks the request before the replication is finished (it's safe to do it because on commit redpanda checks that all pending replication is done) 5. txn coordinator & data partition experience re-election and the in memory state is lost 6. the client issues add_group to txn coordinator 7. txn coordinator is unaware about the ongoing txn and starts new transaction 8. the client commits the txn and only the consumer group change is written Since the data record was written with acks=1 just before re-election it gets lost and the client hasn't figured out that there was some- thing wrong with transaction. The fix is to write to the txn coordinator's log when a transaction starts; in this case when a crush-induced re-election happens new txn coordinator has an opportunity to detect an ongoing txn and fail it.
the checks didn't handle well empty dirs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new changes lgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎾 🎾 🎾 🎾 🎾 🎾 🎾 🎾 🎾
SIPartitionMovementTest.test_shadow_indexing - #4702 |
Cover letter
Kafka API doesn't have explicit begin txn API, when a transaction coordinator recieves first add_partition or add_group it starts a transaction. Also Redpanda defers disk flushes to the commit moment. Combinations of those thing caues a problem:
Since the data record was written with acks=1 just before re-election it gets lost and the client hasn't figured out that there was something wrong with transaction.
Fixes #6018
Backport Required
UX changes
Release notes
Bug Fixes