Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

icdiff hanging on large files #213

Open
ChamiLamelas opened this issue Jul 1, 2023 · 5 comments
Open

icdiff hanging on large files #213

ChamiLamelas opened this issue Jul 1, 2023 · 5 comments

Comments

@ChamiLamelas
Copy link

ChamiLamelas commented Jul 1, 2023

Hello,

My colleague and I were interested in using icdiff inside of a university autograding software (linked here). However, we have encountered an issue where calling icdiff on somewhat large files seems to hang.

I have attached two ~1.5 MB UTF-8 files here, these files do differ. Some notes from running diff and icdiff on these files manually:

  • Standard diff runs in less than a second.
  • However, icdiff as well as python3 -m icdiff both seem to hang after displaying file1.txt and file2.txt.

Any input on this issue would be appreciated.

file1.txt
file2.txt

@jeffkaufman
Copy link
Owner

This is an issue in python's difflib. You can get diff.py from https://docs.python.org/3/library/difflib.html and you'll see python3 diff.py file1.txt file2.txt hangs (or just is very slow).

I think this is "expected behavior" for difflib: "The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear." --https://docs.python.org/3.5/library/difflib.html#difflib.SequenceMatcher

You could consider filing a bug on difflib, or contributing something there that improves worst-case behavior?

@mattrussell2
Copy link

mattrussell2 commented Jul 4, 2023

Is adding an option to use cdifflib something you'd consider?

https://stackoverflow.com/a/66694154/8742968

https://pypi.org/project/cdifflib/

@jeffkaufman
Copy link
Owner

@mattrussell2 I think that wouldn't solve the problem? Skimming that package. It looks like a re-implementation in C, but still the same algorithm?

(It's also missing some parts of difflib that icdiff depends on, but that might not be insurmountable?)

@mattrussell2
Copy link

Fair enough. Yeah 4x is certainly not going to fix polynomial-time issues.

If anyone's interested, the relevant issue re: difflib is here: python/cpython#51180

@mattrussell2
Copy link

What about using google's diff-match-patch?

https://github.com/google/diff-match-patch

This also takes an optional timeout parameter.

From my very quick reading of the source code here, seems like this library should be interoperable....but I'll dig into it a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants