Switch to a custom Myers-based LCS engine #103

kov · 2024-10-27T00:41:24Z

This is a big one. It basically allows our version of diff to be used on larger files, which right now crash due to an impossibly large allocation, due to the library we use implementing the LCS algorithm using a full matrix matching lines of both files.

I looked at implementing this on the library, but the author already has a patch for months and has been unresponsive. I also think it makes sense to have our own engine that can be customized and optimized as required by the goals of the tool rather than relying on a library which may be more conservative. The author mentioned in their PR that the plan there is to make the Myers diagonal search optional due to the pathological cases - which to me means heuristics will probably not be welcome.

I split the actual feature in 3 commits:

implementation and adoption of the Myers diagonal search
importing some heuristics used by GNU diff to bail out on "good enough" or "too expensive" searches
adapting yet another heuristic to give up on the algorithm completely for files that seem just too large and different

See each commit for more details of performance, it is looking pretty good. I would prefer to keep the commits separate, like I did for the cmp optimizations, so that we have a good base for debugging and bisecting if we find some weird corner case.

Tests all pass, the diffs produced I looked manually at seem to have good quality. To weed out any corner cases (and I did find some!) I have been running the following script for the last couple of days. Since I last fired it up it's been going for over 12 hours now, basically diffing every single text and source code file I have in my home directory (over 370k files, I have multiple browser code bases fwiw) against /etc/passwd, an empty file, the previous file, all of them back and forth, then trying to apply it with patch and checking for correctness for each of those. So far no further issues found!

#!/bin/env python3
import os
import subprocess
import sys
import shutil

def run_diff(a, b):
    shutil.copyfile(a, "a");
    shutil.copyfile(b, "b");
    ret = subprocess.run(f"./target/debug/diffutils diff a b".split(' '), stdout=open('diff', 'w'))
    print(f"ret: {ret}")
    if ret.returncode < 0 or ret.returncode == 101:
        print(f"diff killed: {ret.returncode}")
        sys.exit(2)

    ret = subprocess.run(f"patch -p0 a diff".split(' '))
    print(f"ret: {ret}")
    if ret.returncode != 0:
        print(f"patch failed: {ret.returncode}")
        sys.exit(2)

    assert open('a', 'rb').read() == open('b', 'rb').read()

previous_file = '/etc/passwd'
for current_file in open('../diff.rs/text-files'):
    current_file = current_file.strip()
    print(f"Diffing {previous_file} to {current_file}")
    for reference in ['/etc/passwd', 'empty', previous_file]:
        run_diff(reference, current_file)
        run_diff(current_file, reference)
    previous_file = current_file

codecov · 2024-10-27T00:43:58Z

Codecov Report

Attention: Patch coverage is 92.50000% with 15 lines in your changes missing coverage. Please review.

Project coverage is 70.14%. Comparing base (1910cbf) to head (21e7a07).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/engine.rs	85.71%	15 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #103       +/-   ##
===========================================
- Coverage   85.11%   70.14%   -14.98%     
===========================================
  Files          12       13        +1     
  Lines        5818     7646     +1828     
  Branches      475      548       +73     
===========================================
+ Hits         4952     5363      +411     
- Misses        846     2266     +1420     
+ Partials       20       17        -3

Flag	Coverage Δ
macos_latest	`70.05% <92.50%> (-15.09%)`	⬇️
ubuntu_latest	`75.09% <92.50%> (-10.30%)`	⬇️
windows_latest	`23.77% <51.00%> (+0.78%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

GNU patch is getting confused and deciding the patch the wrong file on MacOS.

sylvestre · 2024-10-27T07:29:14Z

Interesting.
Did you try with hyperfine for benchmarking ? Thanks

Maybe create a BENCHMARKING.md file to document the improvement ?

kov · 2024-10-27T13:48:30Z

Interesting. Did you try with hyperfine for benchmarking ? Thanks

Yep! There are some details in the individual commit messages, but the most important improvement from my perspective is being able to handle larger files at all. For very short files it should be more or less the same.

The worst case for this algorithm is completely different files, and for those (on files short enough for the old version to be able to process ;D) we have a performance regression:

Before:

hyperfine -N -i --output=pipe --warmup 2 './target/release/diffutils diff -u './target/release/diffutils diff -u /usr/share/licenses/glib2/LGPL-2.1-or-later.txt /usr/share/licenses/coreutils-common/COPYING'
Benchmark 1: ./target/release/diffutils diff -u /usr/share/licenses/glib2/LGPL-2.1-or-later.txt /usr/share/licenses/coreutils-common/COPYING
Time (mean ± σ): 2.0 ms ± 0.2 ms [User: 1.2 ms, System: 0.7 ms]
Range (min … max): 1.6 ms … 3.4 ms 900 runs

After:

hyperfine -N -i --output=pipe --warmup 2 './target/release/diffutils diff -u /usr/share/licenses/glib2/LGPL-2.1-or-later.txt /usr/share/licenses/coreutils-common/COPYING'
Benchmark 1: ./target/release/diffutils diff -u /usr/share/licenses/glib2/LGPL-2.1-or-later.txt /usr/share/licenses/coreutils-common/COPYING
Time (mean ± σ): 4.8 ms ± 0.3 ms [User: 3.9 ms, System: 0.8 ms]
Range (min … max): 4.2 ms … 6.8 ms 444 runs

Maybe create a BENCHMARKING.md file to document the improvement ?

Will do later today!

sylvestre · 2024-10-27T15:34:03Z

Please run hyperfine once with the two commands. It produces better results and comparison

Until now we have used the engine provided by the 'diff' crate to do the actual diffing. It implements a Longest Common Subsequence algorithm that explores all potential offsets between the two collections. This produces high quality diffs, but has the downside of requiring a huge amount of memory for the left x right lines matrix, which makes it unable to process big files (~36MB): > ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar memory allocation of 2202701222500 bytes failed fish: Job 1, './target/release/diffutils diff…' terminated by signal SIGABRT (Abort) The author has begun an implementation of the Myers algorithm, which will be offered as an alternative to the full LCS one, but has not made any progress on merging it for months, and has not been responsive. It probably makes sense for us to have our own engine, in any case, so that we can evolve it along with the tool and make any adjustments or apply any heuristics we decide could be helpful for matching GNU diff's behavior or performance. The Myers algorithm is a more efficient implementation of LCS, as it only uses a couple vectors with 2 * (m + n) positions, rather than the m * n positions used by the full LCS matrix, where m and n are the number of lines of each file. With this new engine we outperform GNU diff significantly when comparing those two big files that are largely equal, with changes at the top and bottom while producing the exact same diff and using almost exactly the same amount of memory at the peak: Benchmark 1: ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar Time (mean ± σ): 105.0 ms ± 2.5 ms [User: 62.2 ms, System: 41.6 ms] Range (min … max): 101.7 ms … 111.3 ms 28 runs Warning: Ignoring non-zero exit code. Benchmark 2: diff test-data/huge-base test-data/huge-very-similar Time (mean ± σ): 1.119 s ± 0.003 s [User: 1.068 s, System: 0.044 s] Range (min … max): 1.115 s … 1.126 s 10 runs Warning: Ignoring non-zero exit code. Summary ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar ran 10.66 ± 0.26 times faster than diff test-data/huge-base test-data/huge-very-similar It's not all flowers, however. Without heuristics we suffer on files which are very different, especially if they are large, but even if they are small. Diffing two ~36MB and completely different files may take tens of minutes - but it at least works. This is where our ability to add custom heuristics is helpful, though - we can avoid some of the most pathological cases. Those come on the next couple commits. Benchmark 1: ./target/release/diffutils diff test-data/LGPL2.1 test-data/GPL3 Time (mean ± σ): 6.5 ms ± 0.3 ms [User: 5.5 ms, System: 0.8 ms] Range (min … max): 6.1 ms … 8.0 ms 435 runs Warning: Ignoring non-zero exit code. Benchmark 2: diff test-data/LGPL2.1 test-data/GPL3 Time (mean ± σ): 1.5 ms ± 0.1 ms [User: 1.1 ms, System: 0.3 ms] Range (min … max): 1.4 ms … 4.1 ms 1968 runs Warning: Ignoring non-zero exit code. Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark 3: ./target/release/diffutils.old diff test-data/LGPL2.1 test-data/GPL3 Time (mean ± σ): 2.1 ms ± 0.2 ms [User: 1.2 ms, System: 0.8 ms] Range (min … max): 1.8 ms … 2.9 ms 1435 runs Warning: Ignoring non-zero exit code. Summary diff test-data/LGPL2.1 test-data/GPL3 ran 1.42 ± 0.17 times faster than ./target/release/diffutils.old diff test-data/LGPL2.1 test-data/GPL3 4.35 ± 0.43 times faster than ./target/release/diffutils diff test-data/LGPL2.1 test-data/GPL3 It is worth pointing out as well that the reason GNU diff is outperformed in that best case scenario is because it does a lot more work to enable other optimizations we do not implement such as hashing each line and separating out those that only appear on one of the files. That work adds up on big files, but allows GNU diff to outperform by a similar factor when the files are not just different by rearranging lines, but by having completely different lines.

This change adds some checks to decide that the search for the best place to split the diffing process has gone for too long, or long enough while finding a good chunk of matches. They are based on similar heuristics that GNU diff applies and will help in cases in which files are very long and have few common sequences. This brings comparing some large files (~36MB) that are very different from ~1 hour to ~8 seconds, but it will still hit some pathological cases, such as some very large cpp files I created for some benchmarking that still take 1 minute. Benchmark 1: diff test-data/huge-base test-data/huge-very-different Time (mean ± σ): 2.790 s ± 0.005 s [User: 2.714 s, System: 0.063 s] Range (min … max): 2.781 s … 2.798 s 10 runs Warning: Ignoring non-zero exit code. Benchmark 2: ./target/release/diffutils.no-heuristics diff test-data/huge-base test-data/huge-very-different Time (mean ± σ): 4755.084 s ± 172.607 s [User: 4727.169 s, System: 0.330 s] Range (min … max): 4607.522 s … 5121.135 s 10 runs Warning: Ignoring non-zero exit code. Benchmark 3: ./target/release/diffutils diff test-data/huge-base test-data/huge-very-different Time (mean ± σ): 7.197 s ± 0.099 s [User: 7.055 s, System: 0.094 s] Range (min … max): 7.143 s … 7.416 s 10 runs Warning: Ignoring non-zero exit code. Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary diff test-data/huge-base test-data/huge-very-different ran 2.58 ± 0.04 times faster than ./target/release/diffutils diff test-data/huge-base test-data/huge-very-different 1704.04 ± 61.93 times faster than ./target/release/diffutils.no-heuristics diff test-data/huge-base test-data/huge-very-different Note that the worse that should happen by heuristics causing the search to end early is a suboptimal diff, but the diff will still be correct and usable with patch.

This is the last piece of the puzzle to get somewhat comparable to GNU diff performance without implementing all of its tricks - although this one is also used by GNU diff, in its own way. It brings down a diff which still takes over a minute with the previous commit to under a second. Benchmark 1: diff test-data/b.cpp test-data/c.cpp Time (mean ± σ): 2.533 s ± 0.011 s [User: 2.494 s, System: 0.027 s] Range (min … max): 2.519 s … 2.553 s 10 runs Warning: Ignoring non-zero exit code. Benchmark 2: ./target/release/diffutils.local-heuristics diff test-data/b.cpp test-data/c.cpp Time (mean ± σ): 65.798 s ± 1.080 s [User: 65.367 s, System: 0.053 s] Range (min … max): 64.962 s … 68.137 s 10 runs Warning: Ignoring non-zero exit code. Benchmark 3: ./target/release/diffutils diff test-data/b.cpp test-data/c.cpp Time (mean ± σ): 580.6 ms ± 6.5 ms [User: 521.9 ms, System: 38.8 ms] Range (min … max): 570.7 ms … 589.6 ms 10 runs Warning: Ignoring non-zero exit code. Summary ./target/release/diffutils diff test-data/b.cpp test-data/c.cpp ran 4.36 ± 0.05 times faster than diff test-data/b.cpp test-data/c.cpp 113.33 ± 2.26 times faster than ./target/release/diffutils.local-heuristics diff test-data/b.cpp test-data/c.cpp It basically keeps track of how much work we have done overall for a diff job and enables giving up completely on trying to find ideal split points if the cost implies we had to trigger the "too expensive" heuristic too often. From that point forward it only does naive splitting of the work. This should not generate diffs which are much worse than doing the diagonal search, as it should only trigger in cases in which the files are so different it won't find good split points anyway. This is another case in which GNU diff's additional work with hashing and splitting large chunks of inclusion / deletion from the diff work and trying harder to find ideal splits seem to cause it to perform slightly poorer: That said, GNU diff probably still generates better diffs not due to this, but due to its post-processing of the results, trying to create more hunks with nearby changes staying close to each other, which we do not do (but we didn't do that before anyway).

oSoMoN · 2024-11-01T23:03:36Z

I have yet to do a proper code review, but the approach looks sound to me, and a huge undertaking, so thank you very much for this!

@sylvestre any objection to not depending on the external diff crate and re-implementing it in-house?

@kov can you add a heading comment to src/engine.rs that briefly explains the algorithm being implemented, together with the heuristics, like you did in the commit messages? The GNU diff implementation also quotes and provides a (paywall) link to the original paper by Eugene W. Myers, could you add that too?

Add links to the original paper, and an explanation of the overall design to the implementation file.

kov · 2024-11-02T15:04:39Z

FWIW, I joked on Discord that this was potentially controversial because I would completely understand not wanting to maintain a custom engine, especially if the current diff crate gets its Myers implementation merged, and maybe we could instead do a simpler decision of which "backend" of the diff crate to use based on file size, for instance, and potentially not support the most niche pathological cases (diffing completely different massive files is not that common after all).

I don't know if you guys may have an easier time reaching the diff.rs maintainer. I would be totally fine with whatever decision, I have had a lot of fun learning about this stuff and writing this patch, but I am not emotionally tied to it, up to you!

sylvestre · 2024-11-02T18:46:33Z

@sylvestre any objection to not depending on the external diff crate and re-implementing it in-house?

fine with that but I am wondering if the algo itself could not a into a different crate (in case others would like to use it) ?

kov · 2024-11-05T23:12:51Z

@sylvestre any objection to not depending on the external diff crate and re-implementing it in-house?

fine with that but I am wondering if the algo itself could not a into a different crate (in case others would like to use it) ?

I'd be fine with that, do you envision it being a crate of the uutils project or do yu mean external? When I pondered this as I was starting I thought it may make more sense if we exported the diffutils crate as a library because then you'd get the whole functionality, but there is a benefit to having the generic implementation that can be used for file lines or anything else exposed separately, I suppose.

sylvestre · 2024-11-08T08:39:12Z

yeah, we can create https://github.com/uutils/diff-mayers-based-lcs-engine for example :)

kov · 2024-11-11T11:48:39Z

Let me make it into a separate crate then, I'll start on my namespace and we can move. Will give API a bit more thought for that. I think we want to add some parameters to control the heuristics and whether we only care about a difference existing, too. But feel free to comment on the current code!

sylvestre · 2024-11-11T13:46:25Z

i noticed this: https://crates.io/crates/lcs-diff
did you try with it ? :)

kov · 2024-11-13T00:27:42Z

i noticed this: https://crates.io/crates/lcs-diff did you try with it ? :)

I looked at this one when I was starting my research, yeah. It's basically implementing the same algorithm as the diff.rs crate that is used now, which creates a full matrix to compute the LCS:

fn create_table<T: PartialEq + Clone>(old: &[T], new: &[T]) -> Vec<Vec<u32>> {
    let new_len = new.len();
    let old_len = old.len();
    let mut table = vec![vec![0; old_len + 1]; new_len + 1];
    <snip>
    table
}

So it will still do a large allocation for medium files and be unable to diff very large files. I could try to reach out to the author and see if they have plans of implementing a myers-based algo.

diff tests: make file to patch explicit for missing_lines

82cdac6

GNU patch is getting confused and deciding the patch the wrong file on MacOS.

kov force-pushed the custom-engine branch from f2736bf to e79cdf0 Compare October 27, 2024 01:08

kov added 4 commits October 31, 2024 21:07

diff: add tests to cover heuristics

7485daf

Add BENCHMARKING.md file and more comments

21e7a07

Add links to the original paper, and an explanation of the overall design to the implementation file.

kov force-pushed the custom-engine branch from e79cdf0 to 21e7a07 Compare November 2, 2024 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to a custom Myers-based LCS engine #103

Switch to a custom Myers-based LCS engine #103

kov commented Oct 27, 2024

codecov bot commented Oct 27, 2024 •

edited

Loading

sylvestre commented Oct 27, 2024

kov commented Oct 27, 2024 •

edited

Loading

sylvestre commented Oct 27, 2024

oSoMoN commented Nov 1, 2024

kov commented Nov 2, 2024 •

edited

Loading

sylvestre commented Nov 2, 2024

kov commented Nov 5, 2024

sylvestre commented Nov 8, 2024

kov commented Nov 11, 2024

sylvestre commented Nov 11, 2024

kov commented Nov 13, 2024

Switch to a custom Myers-based LCS engine #103

Are you sure you want to change the base?

Switch to a custom Myers-based LCS engine #103

Conversation

kov commented Oct 27, 2024

codecov bot commented Oct 27, 2024 • edited Loading

Codecov Report

sylvestre commented Oct 27, 2024

kov commented Oct 27, 2024 • edited Loading

sylvestre commented Oct 27, 2024

oSoMoN commented Nov 1, 2024

kov commented Nov 2, 2024 • edited Loading

sylvestre commented Nov 2, 2024

kov commented Nov 5, 2024

sylvestre commented Nov 8, 2024

kov commented Nov 11, 2024

sylvestre commented Nov 11, 2024

kov commented Nov 13, 2024

codecov bot commented Oct 27, 2024 •

edited

Loading

kov commented Oct 27, 2024 •

edited

Loading

kov commented Nov 2, 2024 •

edited

Loading