Skip to content

Duperemove Design

Mark Fasheh edited this page Sep 30, 2016 · 16 revisions

Duperemove Design

3-steps of operation

Duperemove has 3 primary phases of operation, each of which feeds data into the next. Almost all duperemove command line options either change which steps are executed or alter the behavior of a specific step. The three stages are described below. In order, they are the checksum, find-dupes and dedupe stages.

Every stage of duperemove is highly threaded.

General

Tips for hacking on duperemove

You can skip this if you just want to learn about Duperemove design.

Please do a git checkout of the Duperemove Testing Project and familiarize yourself with it. It has tests, sample hashfiles and some software for generating test data.

Pay attention to the --read-hashes and --write-hashes switches as they allow us to isolate different stages of operation.

For instructions on how to do performance runs, see the Performance Testing page. It also has examples on using the --read-hashes and --write-hashes switches.

Do make DEBUG=1 for all tests that are not performance sensitive.

To get per-file stats on how much was deduped (very useful when hacking on find-dupes), do make CFLAGS=-DPRINT_STATS. These can take a while to generate so go get a cup of coffee while it runs.

When hacking on the hashfile format, use sqlite3_analyzer to get an idea of how much space your changes take. It is extremely thorough and can even tell us the cost of things like an index, etc.

Hashfiles

Hashfiles are essentially sqlite3 database files with 3 tables, config, files and hashes. Hashfiles are meant to be reused - duperemove will store the results of the scan and dedupe stages to speed up subsequent runs.

If no hashfile option is specified, duperemove uses an in-memory sqlite database and otherwise proceeds as though the --hashfile option was given on the command line. Here is the database schema as of v0.11:

$ echo .schema | sqlite3 testing.dup
CREATE TABLE config(keyname TEXT PRIMARY KEY NOT NULL, keyval BLOB);
CREATE TABLE files(filename TEXT PRIMARY KEY NOT NULL, ino INTEGER, subvol INTEGER, size INTEGER, blocks INTEGER, mtime INTEGER, dedupe_seq INTEGER);
CREATE TABLE hashes(digest BLOB KEY NOT NULL, ino INTEGER, subvol INTEGER, loff INTEGER, flags INTEGER);
CREATE INDEX idx_digest on hashes(digest);
CREATE INDEX idx_hashes_inosub on hashes(ino, subvol);
CREATE INDEX idx_inosub on files(ino, subvol);

Each index will reduce write performance considerably so we must be careful when adding them to be sure they're an overall win.

Memstats

Running duperemove with '--debug' will print structure size and allocation statistics. We can use this to easily get an idea of structure size and memory usage. The counters are not thread safe on all versions of duperemove so be careful not to rely directly on the counts we collect.

For reference, here is the output from a very short run of duperemove v0.11:

struct file_block num: 0 sizeof: 88 total: 0 max: 16384 max total: 1441792
struct dupe_blocks_list num: 0 sizeof: 112 total: 0 max: 1 max total: 112
struct dupe_extents num: 0 sizeof: 112 total: 0 max: 1 max total: 112
struct extent num: 1 sizeof: 128 total: 128 max: 135 max total: 17280
struct extent_dedupe_info num: 0 sizeof: 24 total: 0 max: 0 max total: 0
struct filerec num: 0 sizeof: 184 total: 0 max: 128 max total: 23552
struct filerec_token num: 0 sizeof: 32 total: 0 max: 8128 max total: 260096
struct file_hash_head num: 0 sizeof: 48 total: 0 max: 128 max total: 6144
struct find_dupes_cmp num: 1 sizeof: 16 total: 16 max: 8060 max total: 128960
Sqlite3 used: 0  highwater: 2269920

Code layout

Here are some of the important duperemove source files and what they take care of. This is not an exhaustive list. Most of the code in duperemove is directed from the first 4 files. The rest provide core functionality for everything else to rely on.

Higher level code:

  • duperemove.c: Main program, drives all 3 stages
  • file_scan.c: Drives most of the file scan and csum stage (1)
  • find_dupes.c: Drives the find dupes stage (2)
  • run_dedupe.c: Drives the dedupe stage (3)

Lower level code:

  • csum.c, csum-murmur3.c, csum-xxhash.c: checksumming core
  • filerec.c: file management core, defines struct filerec
  • dbfile.c: manages serialization of our data set to/from hashfiles (via sqlite3)
  • dedupe.c: provides an api to using the extent-same ioctl
  • hash-tree.c: tree of duplicate hashes, defines struct dupe_blocks_list and struct file_block
  • results-tree.c: tree of duplicate extents, defines struct dupe_extents and struct extent

Step 1: Discover dedupe targets and checksum

Description

In this phase duperemove discovers which files can be deduped, then hashes them. If a hashfile is provided, duperemove will only compute hashes for those files which have changed since the last time a scan was run.

We start by taking any files on the commandline and passing them to add_file(). If the filename is a directory, add_file() will scan it as well. If the file is a reasonable dedupe candidate, add_file() will create a filerec for it.

Next we take every file in the db and pass it to add_file_db(). add_file_db() has the job of resolving what we have in the database with what we just found via add_file(). Most importantly - if a file already exists in the filerec hash, add_file_db() will check mtime on the filerec versus what is in the db. If there is a difference, that file is marked to be scanned, otherwise it will have the scan flag cleared. This way we never rescan a file that has not changed.

Next, duperemove will compute hashes for the dataset. Each file marked for scan is read in blocksize chunks, each of which is checksumed and inserted directly into a database.

After writing our hashes to the database, we then load only those hashes which have at least 1 other duplicate.

Phase results

After this phase, duperemove is able to look at similar data. Data structures which contains hashes and files information are stored in memory by default, ready for phase #2.
This phase can be bypassed to speed up subsequent execution on the same dataset.
To do this, see the "--read-hashes=hashfile" switch.

Key data structures and functions

struct file_block
struct filerec

add_file()
add_file_db()
dbfile_scan_files()
csum_whole_file()
dbfile_load_hashes()
insert_hashed_block()

Options

This phase is affected by these optional options: "--hashfile=", "-b size", "-r", "--io-threads=N", "--lookup-extents=[yes|no]".
This phase requires a list of objects from either the command line or the database files table.

Resource Usage

  • High I/O usage - mostly streaming reads.
  • Moderate CPU power for block hashing. However, reading files shall be the bottleneck.
  • Memory usage will be 1 struct file_block per duplicated block in our data set (see hash-tree.h)

Step 2: Find-dupes

Description

This phase uses hash-file maps from #1 to create lists of duplicate extents. For each set of duplicate hashes duperemove builds a list of files that reference that hash. Each of these files then have their duplicated blocks compared to each other to create a tree of duplicate extents (we call it the 'results-tree')

Key data structures and functions

Resource Usage

  • No disk consumption.
  • Memory to store the extent lists
  • Lots of memory to store accounting info (such as which files have already been compared)
  • CPU intensive and can use as many cores as are available. Hyperthreading will hurt performance so we halve the number of used cores in that case.

Hacking on find-dupes

This can be tricky because we're trying to balance at least 3 very important resources:

  1. cpu usage
  2. memory usage
  3. quality of dedupe

Reducing CPU usage typically means we have to store more items in memory to track our compares. We have to be careful with this though as users can have hundreds of thousands of files (or more) in a data set and memory usage can very trivially skyrocket to untenable levels. Lastly all of this wants to get the most dedupe possible. If we give up on at least one of cpu usage (so compare everything) or memory usage (track everything), we can achieve the best dedupe results but very few people will have the resources to get through the program. Cutting back on cpu usage or memory usage involves having to make choices on what file compares to do which of course then affects the quality of dedupe.

Fixing up find_dupes to be better is an ongoing project. Today it is balanced towards more dedupe and more memory usage, but there are some tools we have to help adjust that balance.

  • Firstly, we can isolate this stage with the --read-hashes option and an existing hashfile. See the Duperemove Testing Project for some sample hashfiles and software to generate data sets. This will make iterative development and testing very fast as you won't have to rescan files every test.
  • Once we're using --read-hashes, using 'time' is an easy way to figure out how much cpu we're using
  • For memory usage, duperemove tracks allocation of every non-trivial object. To get printouts in the code, insert calls to print_mem_stats(). memstats doesn't build with locking by default, you'll want to turn this on make CFLAGS=-DLOCK_MEMSTATS so that our counters stay coherent.
  • dedupe coverage stats are calculated in stats.c. This also not run by default as it is very expensive on the cpu however it can tell us on a file-by-file basis the total amount of duplicated blocks found and the total count of duplicated extents. If there's a difference then we know we're missing some amount of dedupe. For dedupe stats build with make CFLAGS=-DPRINT_STATS.

Step 3: Deduplication

Description

This phase handles the actual deduplication work. For each dupes extents list found in #2 (or for each dupe block found in step 1), a structure is prepared and pass to the kernel via ioctl.
Extents are looked up via fiemap to get an idea of how many were marked shared before and after our dedupe. We also compare the extent start offset to our other extents for dedupe, allowing us to skip some instances of already-deduped data.

Options

To run this phase, you must set the "-d" switch.
This stage is affected by the following options: "--io-threads=N", "--dedupe-options=XXX". See "--read-hashes=hashfile" if you want to bypass #1.

CPU / memory / disk usage

  • Few disk, as the kernel will read duplicates extents to check data sanity bit by bit. After that, writes are just meta data changes.
  • No more memory.
  • Multi-threaded, kernel-space will be doing lots of memcmp to check data