Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: report file name of file that chardet fails to read #3524

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
33 changes: 17 additions & 16 deletions codespell_lib/_codespell.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,12 +227,12 @@ def init_chardet(self) -> None:

self.encdetector = UniversalDetector()

def open(self, filename: str) -> Tuple[List[str], str]:
def open(self, filename: str) -> Tuple[List[str], Optional[str]]:
if self.use_chardet:
return self.open_with_chardet(filename)
return self.open_with_internal(filename)

def open_with_chardet(self, filename: str) -> Tuple[List[str], str]:
def open_with_chardet(self, filename: str) -> Tuple[List[str], Optional[str]]:
self.encdetector.reset()
with open(filename, "rb") as fb:
for line in fb:
Expand All @@ -241,26 +241,26 @@ def open_with_chardet(self, filename: str) -> Tuple[List[str], str]:
break
self.encdetector.close()
encoding = self.encdetector.result["encoding"]

try:
f = open(filename, encoding=encoding, newline="")
except UnicodeDecodeError:
print(f"ERROR: Could not detect encoding: {filename}", file=sys.stderr)
raise
except LookupError:
if not encoding:
print(
f"ERROR: Don't know how to handle encoding {encoding}: {filename}",
f"WARNING: Chardet failed to detect encoding for file {filename}.",
file=sys.stderr,
)
try:
with open(filename, encoding=encoding, newline="") as f:
lines = self.get_lines(f)
except LookupError: # Raised by open() if encoding is unknown
error_msg = f"ERROR: Chardet returned unknown encoding for: {filename}."
print(error_msg, file=sys.stderr)
raise
except UnicodeDecodeError: # Raised by self.get_lines() if decoding fails
error_msg = f"ERROR: Failed decoding file: {filename}"
print(error_msg, file=sys.stderr)
raise
else:
lines = self.get_lines(f)
f.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To minimize changes, I would suggest:

        try:
            f = open(filename, encoding=encoding, newline="")
        except LookupError:
            print(
                f"ERROR: Don't know how to handle encoding {encoding}: {filename}",
                file=sys.stderr,
            )
            raise
        else:
            try:
                lines = f.readlines()
            except UnicodeDecodeError:
                print(f"ERROR: Could not detect encoding: {filename}", file=sys.stderr)
                raise
            finally:
                f.close()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to minimize changes? I'm happy to go with whatever you want but let me try to convince you (with Zen of Python):

  • "There should be an obvious way to do it": context managers are the way one should open files. Not use finally, it's messy
  • "Flat is better than nested": Your suggestion has loads of try/else/try/finally nesting, it's hard to grok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really think this is not much more readable?

        try:
            with open(filename, encoding=encoding, newline="") as f:
                lines = self.get_lines(f)
        except LookupError:  # Raised by open() if encoding is unknown
            error_msg = f"ERROR: Chardet returned unknown encoding for: {filename}."
            print(error_msg, file=sys.stderr)
            raise
        except UnicodeDecodeError:  # Raised by self.get_lines() if decoding fails
            error_msg = f"ERROR: Failed decoding file: {filename}"
            print(error_msg, file=sys.stderr)
            raise

Also note that you introduced a bug by replacing self.get_lines(f) with f.readlines() in your suggestion 🙈

Copy link
Collaborator

@DimitriPapadopoulos DimitriPapadopoulos Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I agree that in general flat is better, but I would also like to limit the code under try to the strict minimum, as a way to document which exceptions each piece of code is expected to raise. Some linters do enforce that.
    • open only raises LookupError
    • readlines only raises UnicodeDecodeError
  2. I'd like to keep the codebase consistent. See Fix uncaught exception on empty files #2195.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with using context managers to open files. I just didn't know how to make it compatible with try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, nice to keep try local, but it's a tradeoff with nesting. I commented instead to make clear what raises what.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific regarding your "I would like to keep codebase consistent"? I don't know what exactly you would like me to change. Is anything inconsistent?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open_with_chardet and open_with_internal (already fixed in #2195) should be kept as similar as possible and even share code if possible at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!


return lines, f.encoding
return lines, encoding
DimitriPapadopoulos marked this conversation as resolved.
Show resolved Hide resolved

def open_with_internal(self, filename: str) -> Tuple[List[str], str]:
encoding = None
first_try = True
for encoding in ("utf-8", "iso-8859-1"):
if first_try:
Expand Down Expand Up @@ -869,7 +869,7 @@ def apply_uri_ignore_words(
return check_matches


def parse_file(
def parse_file( # noqa: PLR0915
filename: str,
colors: TermColors,
summary: Optional[Summary],
Expand All @@ -887,6 +887,7 @@ def parse_file(
bad_count = 0
lines = None
changed = False
encoding: Optional[str]

if filename == "-":
f = sys.stdin
Expand Down
45 changes: 41 additions & 4 deletions codespell_lib/tests/test_basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -569,14 +569,51 @@ def test_encoding(
assert "WARNING: Binary file" in stderr


def test_unknown_encoding_chardet(
def test_chardet_exceptions(
tmp_path: Path,
capsys: pytest.CaptureFixture[str],
) -> None:
"""Test opening a file with unknown encoding using chardet"""
"""Test encoding handling with chardet exceptions."""
fname = tmp_path / "tmp"
fname.touch()
assert cs.main("--hard-encoding-detection", fname) == 0
fname.write_bytes("naïve\n".encode())
with mock.patch(
"chardet.universaldetector.UniversalDetector"
) as mock_detector_class:
# Configure the mock to simulate an incorrect encoding detection
mock_detector = mock.MagicMock()
mock_detector.result = {"encoding": None}
mock_detector.done = True
mock_detector_class.return_value = mock_detector

# Simulate chardet not detecting any encoding
result = cs.main("-e", fname, std=True, count=False)
assert isinstance(result, tuple)
code, stdout, stderr = result
assert code == 0
assert not stdout
assert "WARNING: Chardet failed to detect encoding" in stderr
assert str(fname) in stderr

# Simulate chardet falsely detecting utf-8, instead of the correct iso-8859-1
mock_detector.result = {"encoding": "utf-8"} # Simulate wrong encoding detected
mock_detector_class.return_value = mock_detector
fname.write_bytes(b"Speling error, non-ASCII: h\xe9t\xe9rog\xe9n\xe9it\xe9\n")
with pytest.raises(UnicodeDecodeError) as exc_info_unicode:
cs.main("-e", fname, std=True, count=False)
stderr = capsys.readouterr().err
assert "ERROR: Failed decoding file:" in stderr
assert str(fname) in stderr
assert "utf-8" in str(exc_info_unicode.value)

# Simulate chardet detecting non-existent encoding
mock_detector.result = {"encoding": "UTF-doesnotexist"}
mock_detector_class.return_value = mock_detector
with pytest.raises(LookupError) as exc_info_lookup:
cs.main("-e", fname, std=True, count=False)
stderr = capsys.readouterr().err
assert "ERROR: Chardet returned unknown encoding" in stderr
assert str(fname) in stderr
assert "UTF-doesnotexist" in str(exc_info_lookup.value)


def test_ignore(
Expand Down
Loading