-
Notifications
You must be signed in to change notification settings - Fork 82
Performance Numbers ‐ v0.09
With this page I want to provide some example performance numbers so users have an idea of what to expect when they run duperemove on non-trivial data sets.
The following tests were run on a Dell Precision T3610 workstation with a copy of /home
from my workstation rsynced to a fresh btrfs partition. You can find more information about the hardware and software setup here
The version of duperemove used here is v0.09beta2 plus a few extra bug fixes (no performance improvements) that will be part of v0.09beta3.
There are 1151400 files in the data set (about 760 gigabytes of data). Of those files, duperemove finds 1151142 to be hashable. Average size of the files works out to about 700K but the truth is that it's a very mixed set of general user data (dotfiles, dotfile directories, source code, documents) and media files (ISO images, music, movies, books).
The first two tests measure performance of the file hash and extent finding steps independent of each other. Finally we do a full combined run with dedupe to get a more realistic test.
weyoun2:~ # time ./duperemove -hr --hash-threads=16 --write-hashes=/root/slash-home-pre-dedupe.dup /btrfs/ &> slash-home-write-hashes.log
real 26m54.741s
user 79m3.896s
sys 5m2.168s
The large user
time is partially attributable to the hash function in use here (and that there's 16 threads at work). I expect to be merging alternative hash algorithms in the near future.
weyoun2:~ # time ./duperemove --read-hashes=/root/slash-home-pre-dedupe.dup &> /dev/null
real 18m2.981s
user 18m1.232s
sys 0m1.676s
We reboot to run with no disk cache present. The numbers until now were just breaking down the first two steps for informational purposes. This is representative of what a user would actually experience if they ran duperemove against this data set. I saved the results to a file to check for errors.
weyoun2:~ # time ./duperemove -dhr --hash-threads=16 /btrfs/ &> full_run.txt
real 120m37.026s
user 99m28.944s
sys 62m46.664s
So, on this hardware duperemove took about 2 hours to hash and dedupe 760 Gigabytes of data. The dedupe step was the longest, at around 1.25 hours whereas the other two took around .75 hours. Performance optimizations to the dedupe step are planned (see Development Tasks) so hopefully we can get that number down in the future.