Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to a custom Myers-based LCS engine #103

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kov
Copy link

@kov kov commented Oct 27, 2024

This is a big one. It basically allows our version of diff to be used on larger files, which right now crash due to an impossibly large allocation, due to the library we use implementing the LCS algorithm using a full matrix matching lines of both files.

I looked at implementing this on the library, but the author already has a patch for months and has been unresponsive. I also think it makes sense to have our own engine that can be customized and optimized as required by the goals of the tool rather than relying on a library which may be more conservative. The author mentioned in their PR that the plan there is to make the Myers diagonal search optional due to the pathological cases - which to me means heuristics will probably not be welcome.

I split the actual feature in 3 commits:

  • implementation and adoption of the Myers diagonal search
  • importing some heuristics used by GNU diff to bail out on "good enough" or "too expensive" searches
  • adapting yet another heuristic to give up on the algorithm completely for files that seem just too large and different

See each commit for more details of performance, it is looking pretty good. I would prefer to keep the commits separate, like I did for the cmp optimizations, so that we have a good base for debugging and bisecting if we find some weird corner case.

Tests all pass, the diffs produced I looked manually at seem to have good quality. To weed out any corner cases (and I did find some!) I have been running the following script for the last couple of days. Since I last fired it up it's been going for over 12 hours now, basically diffing every single text and source code file I have in my home directory (over 370k files, I have multiple browser code bases fwiw) against /etc/passwd, an empty file, the previous file, all of them back and forth, then trying to apply it with patch and checking for correctness for each of those. So far no further issues found!

#!/bin/env python3
import os
import subprocess
import sys
import shutil

def run_diff(a, b):
    shutil.copyfile(a, "a");
    shutil.copyfile(b, "b");
    ret = subprocess.run(f"./target/debug/diffutils diff a b".split(' '), stdout=open('diff', 'w'))
    print(f"ret: {ret}")
    if ret.returncode < 0 or ret.returncode == 101:
        print(f"diff killed: {ret.returncode}")
        sys.exit(2)

    ret = subprocess.run(f"patch -p0 a diff".split(' '))
    print(f"ret: {ret}")
    if ret.returncode != 0:
        print(f"patch failed: {ret.returncode}")
        sys.exit(2)

    assert open('a', 'rb').read() == open('b', 'rb').read()

previous_file = '/etc/passwd'
for current_file in open('../diff.rs/text-files'):
    current_file = current_file.strip()
    print(f"Diffing {previous_file} to {current_file}")
    for reference in ['/etc/passwd', 'empty', previous_file]:
        run_diff(reference, current_file)
        run_diff(current_file, reference)
    previous_file = current_file

Copy link

codecov bot commented Oct 27, 2024

Codecov Report

Attention: Patch coverage is 92.50000% with 15 lines in your changes missing coverage. Please review.

Project coverage is 70.14%. Comparing base (1910cbf) to head (21e7a07).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/engine.rs 85.71% 15 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #103       +/-   ##
===========================================
- Coverage   85.11%   70.14%   -14.98%     
===========================================
  Files          12       13        +1     
  Lines        5818     7646     +1828     
  Branches      475      548       +73     
===========================================
+ Hits         4952     5363      +411     
- Misses        846     2266     +1420     
+ Partials       20       17        -3     
Flag Coverage Δ
macos_latest 70.05% <92.50%> (-15.09%) ⬇️
ubuntu_latest 75.09% <92.50%> (-10.30%) ⬇️
windows_latest 23.77% <51.00%> (+0.78%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

GNU patch is getting confused and deciding the patch the wrong file
on MacOS.
@sylvestre
Copy link
Collaborator

Interesting.
Did you try with hyperfine for benchmarking ? Thanks

Maybe create a BENCHMARKING.md file to document the improvement ?

@kov
Copy link
Author

kov commented Oct 27, 2024

Interesting. Did you try with hyperfine for benchmarking ? Thanks

Yep! There are some details in the individual commit messages, but the most important improvement from my perspective is being able to handle larger files at all. For very short files it should be more or less the same.

The worst case for this algorithm is completely different files, and for those (on files short enough for the old version to be able to process ;D) we have a performance regression:

Before:

hyperfine -N -i --output=pipe --warmup 2 './target/release/diffutils diff -u './target/release/diffutils diff -u /usr/share/licenses/glib2/LGPL-2.1-or-later.txt /usr/share/licenses/coreutils-common/COPYING'
Benchmark 1: ./target/release/diffutils diff -u /usr/share/licenses/glib2/LGPL-2.1-or-later.txt /usr/share/licenses/coreutils-common/COPYING
Time (mean ± σ): 2.0 ms ± 0.2 ms [User: 1.2 ms, System: 0.7 ms]
Range (min … max): 1.6 ms … 3.4 ms 900 runs

After:

hyperfine -N -i --output=pipe --warmup 2 './target/release/diffutils diff -u /usr/share/licenses/glib2/LGPL-2.1-or-later.txt /usr/share/licenses/coreutils-common/COPYING'
Benchmark 1: ./target/release/diffutils diff -u /usr/share/licenses/glib2/LGPL-2.1-or-later.txt /usr/share/licenses/coreutils-common/COPYING
Time (mean ± σ): 4.8 ms ± 0.3 ms [User: 3.9 ms, System: 0.8 ms]
Range (min … max): 4.2 ms … 6.8 ms 444 runs

Maybe create a BENCHMARKING.md file to document the improvement ?

Will do later today!

@sylvestre
Copy link
Collaborator

Please run hyperfine once with the two commands. It produces better results and comparison

Until now we have used the engine provided by the 'diff' crate to do
the actual diffing. It implements a Longest Common Subsequence algorithm
that explores all potential offsets between the two collections.

This produces high quality diffs, but has the downside of requiring a
huge amount of memory for the left x right lines matrix, which makes it
unable to process big files (~36MB):

  > ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar
  memory allocation of 2202701222500 bytes failed
  fish: Job 1, './target/release/diffutils diff…' terminated by signal SIGABRT (Abort)

The author has begun an implementation of the Myers algorithm, which
will be offered as an alternative to the full LCS one, but has not
made any progress on merging it for months, and has not been responsive.

It probably makes sense for us to have our own engine, in any case,
so that we can evolve it along with the tool and make any adjustments
or apply any heuristics we decide could be helpful for matching GNU
diff's behavior or performance.

The Myers algorithm is a more efficient implementation of LCS, as it
only uses a couple vectors with 2 * (m + n) positions, rather than the
m * n positions used by the full LCS matrix, where m and n are the
number of lines of each file.

With this new engine we outperform GNU diff significantly when comparing
those two big files that are largely equal, with changes at the top and
bottom while producing the exact same diff and using almost exactly the
same amount of memory at the peak:

  Benchmark 1: ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar
    Time (mean ± σ):     105.0 ms ±   2.5 ms    [User: 62.2 ms, System: 41.6 ms]
    Range (min … max):   101.7 ms … 111.3 ms    28 runs

    Warning: Ignoring non-zero exit code.

  Benchmark 2: diff test-data/huge-base test-data/huge-very-similar
    Time (mean ± σ):      1.119 s ±  0.003 s    [User: 1.068 s, System: 0.044 s]
    Range (min … max):    1.115 s …  1.126 s    10 runs

    Warning: Ignoring non-zero exit code.

  Summary
    ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar ran
     10.66 ± 0.26 times faster than diff test-data/huge-base test-data/huge-very-similar

It's not all flowers, however. Without heuristics we suffer on files
which are very different, especially if they are large, but even if
they are small. Diffing two ~36MB and completely different  files may
take tens of minutes - but it at least works. This is where our ability
to add custom heuristics is helpful, though - we can avoid some of the
most pathological cases. Those come on the next couple commits.

  Benchmark 1: ./target/release/diffutils diff test-data/LGPL2.1 test-data/GPL3
    Time (mean ± σ):       6.5 ms ±   0.3 ms    [User: 5.5 ms, System: 0.8 ms]
    Range (min … max):     6.1 ms …   8.0 ms    435 runs

    Warning: Ignoring non-zero exit code.

  Benchmark 2: diff test-data/LGPL2.1 test-data/GPL3
    Time (mean ± σ):       1.5 ms ±   0.1 ms    [User: 1.1 ms, System: 0.3 ms]
    Range (min … max):     1.4 ms …   4.1 ms    1968 runs

    Warning: Ignoring non-zero exit code.
    Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

  Benchmark 3: ./target/release/diffutils.old diff test-data/LGPL2.1 test-data/GPL3
    Time (mean ± σ):       2.1 ms ±   0.2 ms    [User: 1.2 ms, System: 0.8 ms]
    Range (min … max):     1.8 ms …   2.9 ms    1435 runs

    Warning: Ignoring non-zero exit code.

  Summary
    diff test-data/LGPL2.1 test-data/GPL3 ran
      1.42 ± 0.17 times faster than ./target/release/diffutils.old diff test-data/LGPL2.1 test-data/GPL3
      4.35 ± 0.43 times faster than ./target/release/diffutils diff test-data/LGPL2.1 test-data/GPL3

It is worth pointing out as well that the reason GNU diff is outperformed
in that best case scenario is because it does a lot more work to enable
other optimizations we do not implement such as hashing each line and
separating out those that only appear on one of the files. That work
adds up on big files, but allows GNU diff to outperform by a similar
factor when the files are not just different by rearranging lines, but
by having completely different lines.
This change adds some checks to decide that the search for the best
place to split the diffing process has gone for too long, or long
enough while finding a good chunk of matches.

They are based on similar heuristics that GNU diff applies and will
help in cases in which files are very long and have few common
sequences.

This brings comparing some large files (~36MB) that are very different
from ~1 hour to ~8 seconds, but it will still hit some pathological
cases, such as some very large cpp files I created for some benchmarking
that still take 1 minute.

  Benchmark 1: diff test-data/huge-base test-data/huge-very-different
    Time (mean ± σ):      2.790 s ±  0.005 s    [User: 2.714 s, System: 0.063 s]
    Range (min … max):    2.781 s …  2.798 s    10 runs

    Warning: Ignoring non-zero exit code.

  Benchmark 2: ./target/release/diffutils.no-heuristics diff test-data/huge-base test-data/huge-very-different
    Time (mean ± σ):     4755.084 s ± 172.607 s    [User: 4727.169 s, System: 0.330 s]
    Range (min … max):   4607.522 s … 5121.135 s    10 runs

    Warning: Ignoring non-zero exit code.

  Benchmark 3: ./target/release/diffutils diff test-data/huge-base test-data/huge-very-different
    Time (mean ± σ):      7.197 s ±  0.099 s    [User: 7.055 s, System: 0.094 s]
    Range (min … max):    7.143 s …  7.416 s    10 runs

    Warning: Ignoring non-zero exit code.
    Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

  Summary
    diff test-data/huge-base test-data/huge-very-different ran
      2.58 ± 0.04 times faster than ./target/release/diffutils diff test-data/huge-base test-data/huge-very-different
   1704.04 ± 61.93 times faster than ./target/release/diffutils.no-heuristics diff test-data/huge-base test-data/huge-very-different

Note that the worse that should happen by heuristics causing the search
to end early is a suboptimal diff, but the diff will still be correct
and usable with patch.
This is the last piece of the puzzle to get somewhat comparable to GNU
diff performance without implementing all of its tricks - although this
one is also used by GNU diff, in its own way. It brings down a diff
which still takes over a minute with the previous commit to under a
second.

  Benchmark 1: diff test-data/b.cpp test-data/c.cpp
    Time (mean ± σ):      2.533 s ±  0.011 s    [User: 2.494 s, System: 0.027 s]
    Range (min … max):    2.519 s …  2.553 s    10 runs

    Warning: Ignoring non-zero exit code.

  Benchmark 2: ./target/release/diffutils.local-heuristics diff test-data/b.cpp test-data/c.cpp
    Time (mean ± σ):     65.798 s ±  1.080 s    [User: 65.367 s, System: 0.053 s]
    Range (min … max):   64.962 s … 68.137 s    10 runs

    Warning: Ignoring non-zero exit code.

  Benchmark 3: ./target/release/diffutils diff test-data/b.cpp test-data/c.cpp
    Time (mean ± σ):     580.6 ms ±   6.5 ms    [User: 521.9 ms, System: 38.8 ms]
    Range (min … max):   570.7 ms … 589.6 ms    10 runs

    Warning: Ignoring non-zero exit code.

  Summary
    ./target/release/diffutils diff test-data/b.cpp test-data/c.cpp ran
      4.36 ± 0.05 times faster than diff test-data/b.cpp test-data/c.cpp
    113.33 ± 2.26 times faster than ./target/release/diffutils.local-heuristics diff test-data/b.cpp test-data/c.cpp

It basically keeps track of how much work we have done overall for a
diff job and enables giving up completely on trying to find ideal split
points if the cost implies we had to trigger the "too expensive"
heuristic too often.

From that point forward it only does naive splitting of the work.
This should not generate diffs which are much worse than doing the
diagonal search, as it should only trigger in cases in which the
files are so different it won't find good split points anyway.

This is another case in which GNU diff's additional work with hashing
and splitting large chunks of inclusion / deletion from the diff work
and trying harder to find ideal splits seem to cause it to perform
slightly poorer:

That said, GNU diff probably still generates better diffs not due to
this, but due to its post-processing of the results, trying to create
more hunks with nearby changes staying close to each other, which we
do not do (but we didn't do that before anyway).
@oSoMoN
Copy link
Collaborator

oSoMoN commented Nov 1, 2024

I have yet to do a proper code review, but the approach looks sound to me, and a huge undertaking, so thank you very much for this!

@sylvestre any objection to not depending on the external diff crate and re-implementing it in-house?

@kov can you add a heading comment to src/engine.rs that briefly explains the algorithm being implemented, together with the heuristics, like you did in the commit messages? The GNU diff implementation also quotes and provides a (paywall) link to the original paper by Eugene W. Myers, could you add that too?

Add links to the original paper, and an explanation of the overall
design to the implementation file.
@kov
Copy link
Author

kov commented Nov 2, 2024

FWIW, I joked on Discord that this was potentially controversial because I would completely understand not wanting to maintain a custom engine, especially if the current diff crate gets its Myers implementation merged, and maybe we could instead do a simpler decision of which "backend" of the diff crate to use based on file size, for instance, and potentially not support the most niche pathological cases (diffing completely different massive files is not that common after all).

I don't know if you guys may have an easier time reaching the diff.rs maintainer. I would be totally fine with whatever decision, I have had a lot of fun learning about this stuff and writing this patch, but I am not emotionally tied to it, up to you!

@sylvestre
Copy link
Collaborator

@sylvestre any objection to not depending on the external diff crate and re-implementing it in-house?

fine with that but I am wondering if the algo itself could not a into a different crate (in case others would like to use it) ?

@kov
Copy link
Author

kov commented Nov 5, 2024

@sylvestre any objection to not depending on the external diff crate and re-implementing it in-house?

fine with that but I am wondering if the algo itself could not a into a different crate (in case others would like to use it) ?

I'd be fine with that, do you envision it being a crate of the uutils project or do yu mean external? When I pondered this as I was starting I thought it may make more sense if we exported the diffutils crate as a library because then you'd get the whole functionality, but there is a benefit to having the generic implementation that can be used for file lines or anything else exposed separately, I suppose.

@sylvestre
Copy link
Collaborator

yeah, we can create https://github.com/uutils/diff-mayers-based-lcs-engine for example :)

@kov
Copy link
Author

kov commented Nov 11, 2024

Let me make it into a separate crate then, I'll start on my namespace and we can move. Will give API a bit more thought for that. I think we want to add some parameters to control the heuristics and whether we only care about a difference existing, too. But feel free to comment on the current code!

@sylvestre
Copy link
Collaborator

i noticed this: https://crates.io/crates/lcs-diff
did you try with it ? :)

@kov
Copy link
Author

kov commented Nov 13, 2024

i noticed this: https://crates.io/crates/lcs-diff did you try with it ? :)

I looked at this one when I was starting my research, yeah. It's basically implementing the same algorithm as the diff.rs crate that is used now, which creates a full matrix to compute the LCS:

fn create_table<T: PartialEq + Clone>(old: &[T], new: &[T]) -> Vec<Vec<u32>> {
    let new_len = new.len();
    let old_len = old.len();
    let mut table = vec![vec![0; old_len + 1]; new_len + 1];
    <snip>
    table
}

So it will still do a large allocation for medium files and be unable to diff very large files. I could try to reach out to the author and see if they have plans of implementing a myers-based algo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants