You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently offset_translator::maybe_checkpoint uses a fixed 64MiB threshold to decide whether to write a checkpoint.
If I have e.g. 40k partitions on a node with 2TB of storage, the most any partition ever stores is 50MB, so the threshold is never reached. This results in a large amount of read IO on restart: currently ManyPartitionsTest has to wait as long for node startup as it spent writing the data in to begin with (many minutes).
We could make this configurable, similar to the falloc step size.
We could use our knowledge of the disk size and partition count to dynamically select a size that makes sure we are never reading more than a certain fraction of the disk size on startup.
We could keep a global count of the number of un-checkpointed bytes across all partitions, and trigger checkpoints based on that -- this would be the most direct way of bounding the amount of data that redpanda has to replay on startup, at the cost of more coordination.
It may be that the solution to this can also be used to drive a dynamic falloc step size (this has a similar issue where the default 32MiB threshold doesn't make much sense for systems with huge partition counts).
The text was updated successfully, but these errors were encountered:
Currently offset_translator::maybe_checkpoint uses a fixed 64MiB threshold to decide whether to write a checkpoint.
If I have e.g. 40k partitions on a node with 2TB of storage, the most any partition ever stores is 50MB, so the threshold is never reached. This results in a large amount of read IO on restart: currently ManyPartitionsTest has to wait as long for node startup as it spent writing the data in to begin with (many minutes).
It may be that the solution to this can also be used to drive a dynamic falloc step size (this has a similar issue where the default 32MiB threshold doesn't make much sense for systems with huge partition counts).
The text was updated successfully, but these errors were encountered: