Skip to content
Martin Pool edited this page Jun 22, 2021 · 19 revisions

Conserve Design/Feature Ideas

The features below are classified by themes in only rough priority order.

Development proceeds a bit sporadically depending on developer time.

In general Conserve's tip tree should always be usable, and we will make a new release every month that there has been significant development.

Diff

Perhaps run a named command (like diff) on files that differ.

It'd be really nice to account for which changes might have been due to later changes in the source:

  • Files are different and the file in the source is newer.

  • File is missing from the source and potentially deleted since the backup was created.

Index format

These changes should be possible without making a new major archive format, although they will create new bands that can't be read by old versions.

  • Try using Rust Prost to write out protobufs.

    • If done using derivation the buigd process should still be reasonably clean.

    • Allows storing small files inline as bytes.

    • Consider also dictionary-compressing hashes within each index hunk rather that relying on byte compression to do this.

  • Alternatively, try CBOR, but it's probably less efficient that binary proto. This will allow storing tiny files inline in the index, and should make the uncompressed form smaller: in particular block references can be binary rather than hex, so will be half the size.

    • Measure the size and time impact of this change.

    • This probably requires a translation between the in-memory index entry and the serialized object.

  • Consider compressing the index with zstd rather than Snappy.

    Last time I tried this the results were actually not so impressive, but we could try again.

Tiny files inline

https://github.com/sourcefrog/conserve/issues/154

  • Store tiny files in the index rather than in blocks. For tiny files (for example 10 bytes) it's actually smaller to store the content than a reference to the block, and it avoids separate IO to the block store.

    This depends on having an index that can efficiently include binary content.

Store and restore more metadata

  • Store permissions

  • Store and restore ownership.

    • In many cases the ownership will be all the same. Perhaps it can be omitted from the index.

    • Perhaps there should be an option to request storing ownership: for many cases it doesn't matter so much.

Being unable to set the group or owner should be a problem that's by default only a warning.

Backup with O_NOATIME?

In-memory cache of blockdir info

https://github.com/sourcefrog/conserve/issues/106

In some cases backups spend a lot of time in stat on the blockdir. This could be helped by keeping an in-memory cache of which blocks and block prefix subdirs are present.

Avoid re-reading block data

During restore we will read data from combined blocks multiple times, and at the moment they're read and decompressed each time. This is pretty inefficient.

One way to help is to keep a cache, maybe an adaptive replacement cache, of data blocks in memory and read out from that.

Also, we could look ahead through the index hunk and see all the blocks that are needed for the files that are to be restored. We don't need to be "surprised" to see something is needed again when we can see arbitrarily far forward which ones will be needed. Perhaps this can turn into cache hints.

In both of these there needs to be some reasonable cap, either fixed size or perhaps a fraction of system memory.

Delta indexes

Incremental backups still write a full copy of the index, listing all the entries in the current tree. This in practice seems to work pretty reasonably, with an index only about 1/1000th the size of the tree. (For each file there's about 100 bytes in the name and block references.)

(I used to think this would be very important, but experience seems to show it's not so much.)

We could add a concept of higher-tier versions, that record only files stored since a basis index.

  1. An index concept of a whiteout.

  2. A tree reader that reads several indexes in parallel and merges them. (Something much like this will be needed to read incomplete trees.)

  3. A tree writer that notices only the differences versus the parent tree, and records them, including whiteouts.

It seems like we'd need some heuristic for when to make a delta rather than full index. One possibility is to look at the length of the previous delta index: if it's getting too long (perhaps 1/4 of the full index?) then just store a full index.

More validation

Validation checks some invariants of the format, to catch either bugs or issues originating in the environment, like disk corruption.

Perhaps more of the work here is in creating tests that make variously broken archives and validate them - positive cases for validation.

What bugs are actually plausible? What failures could be caused by interruption or machine crash or other likely underlying failures?

How much is this similar to just doing a restore and throwing away the results?

  • For the archive
    • [done] No unexpected directories or files
    • [done] All band directories are in the canonical format
  • For every band
    • The index block numbers are contiguous and correctly formated
    • No unexpected files or directories
  • For every entry in the index:
    • Filenames are in order (and without duplicates)
    • Filenames don't contain / or . or ..
    • The referenced blocks exist
    • (Deep only) The blocks can be extracted and they reconstitute the expected hash
  • For the blockdir:
    • No unexpected top-level files or directories
    • Every prefix subdirectory is a hex prefix of the right length
    • Every file inside a prefix subdirectory matches the prefix
    • There are no unexpected files or directories inside prefix subdirectories
    • No zero-byte files
    • No temporary files
  • For every block in the blockdir:
    • [done] The hash of the block is what the name says.
    • All blocks are referenced by one index

Should report on (and gc could clean up) any old leftover tmp files.

Better ignore patterns

  • Auto ignore CACHEDIR.TAG directories using the Rust library to match them.

Built-in profiling?

Robustness

  • Test handling of various broken archives - perhaps needs some scripts or infrastructure to construct them
  • decompression failure
  • missing block
  • bad block
  • missing index file
  • File is removed during reading of index

Testing

  • Add more unit tests for restore.
  • Interesting Unicode names? (What's interesting?)
  • Filenames that cause trouble across Windows/Unix.
  • Test performance of block storage by looking at counts: semi-white-box test of side effects
  • Filesystem wrapper to allow injecting faults
  • Detection of corrupt block:
    • Wrong hash
    • Decompression fails
  • Helper to compare trees and show diff
  • Helper for blackbox tests: show all output if something fails in the test. (Is it enough to just print output unconditionally?)
  • Rename testsupport to a seperable treebuilder?

Resume interrupted backup

  • Detect there's an interrupted band
  • Look at what index blocks are already present
  • Find the last stored name from the last stored index block
  • Maybe check all the data blocks from the last index block are actually stored, to know that the interruption was safe?
  • Resume from that filename

Parallelism

  • Reading, hashing, and compressing non-small files within a group can be parallelized.

  • Reading non-small files in to memory can be parallelized. Hashing and compressing them still needs to be serial, but that should be much cheaper.

  • Parallelize finding referenced blocks across all the hunks of a band.

Both reading and writing do a lot of CPU-intensive hashing and de/compression, and are fairly easy to parallel.

Parallelizing within a single file is probably possible, but doing random IO within the file will be complicated, especially for non-local filesystems. Similarly entries must be written into the index in order: they could arrive a bit out of order but we do need to finish one chunk at a time.

However it should be easy to parallelize across multiple files, and index chunks give an obvious granularity for doing this:

  • Read a thousand filenames.
  • Compress and store all of them, generating index entries in the right order. (Or, sort the index entries if necessary.)
  • Write out the index chunk and move to the next.

It seems like it'll fit naturally on Rayon, which is great.

I do want to also combine small blocks together, which means the index entry isn't available immediately after the file is written in, only when the chunk is complete. This could potentially be on a per-thread basis.

Draw progress bars from a thread

At the moment every new information is passed to the progress bar, we look at whether it was repainted recently, and if not, redraw it.

If there is a burst of updates and then a pause, this can leave the terminal showing an update that was not the last one, and that doesn't reflect where the process is up to.

We could instead have a thread that repaints every 500ms, and calls from the work threads only update the state.

This would also avoid contention of worker threads on the pb mutex: painting to the terminal is potentially somewhat slow.

Or, alternatively, try to make sure that there just are frequent updates, and so frequent opportunities to repaint. In particular we could emit ticks while processing blocks within a large file.

Security

  • Salt the hashes to avoid DoS collision attacks, and to enable encryption. (Store the salt in the base tier? Requires version bump.)
  • Asymmetric encryption? Perhaps better to rely on the underlying storage?
  • Signing?

Remote storage

Perhaps do SFTP first. https://github.com/sourcefrog/conserve/issues/83 However at least on Linux, SFTP can be done pretty well by sshfs.

Replicate from one archive to another

https://github.com/sourcefrog/conserve/issues/122

  • conserve replicate to copy bands from an archive without changing the content?
    • Like an ordering-aware gsutil rsync or rsync
  • Test on GCS FUSE
  • For remote or slow storage, keep a local cache of which blocks are present?

Questionable features

  • Store inode numbers and attempt to restore hard links
  • Store file types other than file/dir/symlink
  • How can we avoid every user needing to manually configure what to exclude?

  • Exclude files from future backups but don't mark them as deleted. In other words, a pattern which will be assumed to be unchanged. (Is it useful?)

Measuring archive space usage

conserve size on an archive should probably give the size of the archive by default, not the size of the last stored tree.

conserve versions --sizes won't say much useful about the version sizes any more, because most of the disk usage isn't in the band directory. Maybe we need a conserve archive describe or conserve archive measure.

We could say the total size of all blocks referenced by that version.

Perhaps it'd be good to say how many blocks are used by that version and not any newer version.