Only in use on 64-bit systems. Use the upper 28bits of the id of an
index entry as bloom filter. This allows skipping the index entry
traversal most of the time if an id is not stored in the hashmap.
The bloom filter embedded in the index entry id is check each time
before following a reference to an index entry. This further reduces
the risk of false positives. The bloom filter itself is basically for
free on modern CPUs.
The main performance cost of checking for unknown blobs in the index are
the essentially random RAM accesses for the initial bucket lookup as
well as following the next pointer in the index entries. With the bloom
filter most of the time only the initial bucket lookup is necessary.
This speeds up checking for unknown blobs by a factor 5 (!), while
having no effect on the lookup of known blobs:
$ benchstat no-bloom with-bloom
name old time/op new time/op delta
IndexHasUnknown-16 49.0ms ± 2% 9.9ms ± 7% -79.70% (p=0.000 n=10+10)
IndexHasKnown-16 48.0ms ± 3% 47.9ms ± 3% ~ (p=0.968 n=10+9)
This bloom filter parameters m=28 k=1 were derived empirically, while
also leaving sufficient room for very large repositories. Before this
commit, the final merge index step took roughly 1 second per million
index entries. With the chosen bloom filter parameters, it would
currently take 19 hours to just merge such an index. It is safe to
assume that such large repositories don't exist.
Comparison with other parameter sets:
$ m=28 k=1 versus m=32 k=1
name old time/op new time/op delta
IndexHasUnknown-16 49.0ms ± 2% 9.7ms ±16% -80.17% (p=0.000 n=10+10)
IndexHasKnown-16 48.0ms ± 3% 48.4ms ± 3% ~ (p=0.436 n=10+10)
$ m=28 k=1 versus m=24 k=1
name old time/op new time/op delta
IndexHasUnknown-16 49.0ms ± 2% 10.8ms ±13% -77.90% (p=0.000 n=10+10)
IndexHasKnown-16 48.0ms ± 3% 47.9ms ± 3% ~ (p=0.684 n=10+10)
$ m=28 k=1 versus m=28 k=2
name old time/op new time/op delta
IndexHasUnknown-16 49.0ms ± 2% 24.9ms ± 5% -49.27% (p=0.000 n=10+10)
IndexHasKnown-16 48.0ms ± 3% 48.0ms ± 4% ~ (p=1.000 n=10+10)
`k=2` outright wrecks the performance. This is most likely the case as
it performs worse on longer index entry chains, which also happen to be
the expensive ones to process.
`m=32` yields diminishing returns, while getting within an order of
magnitude of the largest known restic repositories.
Design alternatives:
In principle it would be possible to add a single large bloom filter
instead of embedding them in the index entry ids. However, this bloom
filter would necessarily incur additional random memory accesses and
thus slow things down overall.
Do not require a full index reload if only a few additional index files
have been added. This can drastically speed up loading the index in the
mount command.
When using `restic mount` to serve a repository via a POSIX host's file
system, all files backed up from Windows systems will show up on the
mounting host with a (hard)link count of 0.
While this is not a problem in general and most programs do not even
register this strange value, some others (such as Samba's smbd(8)) will
go the extra mile and check a file's stat() results for various
properties (such as a positive non-zero link count), and refuse to
operate if any of the reported values appear off.
Since other inode properties absent from non-POSIX backup sources are
also "faked up" (e.g., the inode number) during mount and work fine in
general, the chances of this change to be in any way harmful are
probably rather slim.
ui: mention compressed size of added files in `backup -vv`
This is already shown for modified files, but the added files message
wasn't updated when compression was implemented in restic.
Co-authored-by: Ilya Grigoriev <ilyagr@users.noreply.github.com>
cmd/restic/cmd_rewrite.go:
introduction of include filters for this command:
- add include filters, add error checking code
- add new parameter 'keepEmptyDirectoryFunc' to 'walker.NewSnapshotSizeRewriter()',
so empty directories have to be kept to keep the directory structure intact
- add parameter 'keepEmptySnapshot' to 'filterAndReplaceSnapshot()' to keep snapshots
intact when nothing is to be included
- introduce helper function 'gatherIncludeFilters()' and 'gatherExcludeFilters()' to
keep code flow clean
cmd/restic/cmd_rewrite_integration_test.go:
add several new tests around the 'include' functionality
internal/filter/include.go:
this is where is include filter is defined
internal/walker/rewriter.go:
- struct RewriteOpts gains field 'KeepEmtpyDirectory', which is a 'NodeKeepEmptyDirectoryFunc()'
which defaults to nil, so that al subdirectories are kept
- function 'NewSnapshotSizeRewriter()' gains the parameter 'keepEmptyDirecoryFilter' which
controls the management of empty subdirectories in case of include filters active
internal/data/tree.go:
gains a function Count() for checking the number if node elements in a newly built tree
internal/walker/rewriter_test.go:
function 'NewSnapshotSizeRewriter()' gets an additional parameter nil to keeps things happy
cmd/restic/cmd_repair_snapshots.go:
function 'filterAndReplaceSnapshot()' gets an additional parameter 'keepEmptySnapshot=nil'
doc/045_working_with_repos.rst:
gets to mention include filters
changelog/unreleased/issue-4278:
the usual announcement file
git rebase master -i produced this
restic rewrite include - keep linter happy
cmd/restic/cmd_rewrite_integration_test.go:
linter likes strings.Contain() better than my strings.Index() >= 0
While not planned, it's also not completely impossible that a tree node
might get additional top-level fields. As the tree iterator is built
with a strict expectation of the top-level fields, this would result in
a parsing error. Future-proof the code by simply skipping unknown
fields.
The previous implementation stored the whole tree in a map and used it
for checking overlap between trees. This is now replaced with the
DualTreeIterator, which iterates over two trees in parallel and returns
the merge stream in order. In case of overlap between both trees, it
returns both nodes at the same time. Otherwise, only a single node is
returned.
The TreeNodeIterator decodes nodes while iterating over a tree blob.
This should reduce peak memory usage as now only the serialized tree
blob and a single node have to alive at the same time. Using the
iterator has implications for the error handling however. Now it is
necessary that all loops that iterate through a tree check for errors
before using the node returned by the iterator.
The other change is that it is no longer possible to iterate over a tree
multiple times. Instead it must be loaded a second time. This only
affects the tree rewriting code.
The tree.Nodes will be replaced by an iterator to loads and serializes
tree node ondemand. Thus, the processing moves from StreamTrees into the
callback. Schedule them onto the workers used by StreamTrees for proper
load distribution.
Always serialize trees via TreeJSONBuilder. Add a wrapper called
TreeWriter which combines serialization and saving the tree blob in the
repository. In the future, TreeJSONBuilder will have to upload tree
chunks while the tree is still serialized. This will a wrapper like
TreeWriter, so add it right now already.
The archiver.treeSaver still directly uses the TreeJSONBuilder as it
requires special handling.
Calling t.Fatal internally triggers runtime.Goexit . This kills the
current goroutine while only running deferred code. Add an extra context
that gets canceled if the go routine exits while within the user
provided callback.
data.TestCreateSnapshot which is used in particular by TestFindUsedBlobs
and TestFindUsedBlobs could generate trees with duplicate file names.
This is invalid and going forward will result in an error.