All notable changes to this project will be documented in this file.
This project adheres to Semantic Versioning.
- Set minimum supported Python version to 3.9
- Remove pyre2 as it hasn't been updated in a long time and regex seems to be the better lib here
- Remove long deprecated methods "decode_email", "decode_email_b"
- Add ruff config
- Add Python 3.12 to tests
- Remove obsolete methods.
- Rework hashing wrapper methods.
- Add a custom e-mail parsing policy for fixhing invalid values as soon as possible.
- Currently implemented for invalid message-id and date parsing.
- Renamed eml_parser.eml_parser to eml_parser.parser to make imports safer. This should not break any usage but nonetheless make sure to verify that you are not importing eml_parser.eml_parser.
- While adding tests for Python 3.11, cchardet fails to install. Turns out it seems to be abandoned and as such it has been replaced with charset-normalizer.
- Migrate setup.cfg to pyproject.toml
- Fix typing and linter issues.
- Fix parsing bad message-id formats #79.
- When serialising RFC822 payloads, use a custom policy which has no limits on line-lenthgs as this breaks badly encoded messages.
- Fix issue #76 "If a CR or LF is found in a malformed email address header fields (From/To/etc.), the ValueError breaks the parsing." (@malvidin, @cccs-rs)
- Add Public Suffix List validation options for URLs and email addresses. (@malvidin)
- Add ip_force_routable option to filter out non-routable IPs. (@malvidin)
- Add domain_force_tld option to filter out domains with invalid TLDs. (@malvidin)
- Add include_www option to include potential URLs without a scheme. (@malvidin)
- Add IP, domain, and Public Suffix List filtering tests. (@malvidin)
- Add www_regex and dom_regex tests. (@malvidin)
- Add optional matching for HTML SRC and HREF. (@malvidin)
- Moved URL parsing options to EmlParser from get_uri_ondata. (@malvidin)
- Ensure string_sliding_window_loop includes the last slice of the body. (@malvidin)
- Keep subsequent URLs if URLs are comma separated. (@malvidin)
- Fix linter warnings.
- Add typing dev dependencies.
- Fix catastrophic backtracking on url regex, add related tests for backtracking, unicode, and IPv6. (thanks @malvidin)
- Add Unicode character ranges for re2. (thanks @malvidin)
- Add tests for url_regex_simple, change where parens are matched in url_regex_simple, specify which re engine needs which expression. (thanks @malvidin)
- Match URLs with trailing ? with url_regex_simple. (thanks @malvidin)
-
As has been reported in #62 and #63 there can be issues with certain regular expressions (in this case URL regex) where the regex engine just runs forever (commonly referred to "catastrophic backtracking"). In order to make testing two seemingly popular (and with good cross-platform wheel support) alternative regex engines easier, two extra flags have been introduced:
Note-1: These are temporary extra tags which may be removed in future releases.
Note-2: eml_parser will transparently use regex if it is found, or pyre2 (in that order).
- eml_parser.regex has been renamed to eml_parser.regexes in order not to clash with the regex python module.
- Converted the documentation to mkdocs.
- Fixed a bug in FROM header field parsing. In case the display name part contained an e-mail address, that one was naively used instead of properly parsing the field.
- Cleanup example scripts.
- Handle extra case of when chardet detects VISCII text which Python is currently unable to decode (thanks @cccs-rs #59).
- Add multipart boundary marker as discussed in #56, in order to easier distinguish parts.
- Fixed a major bug which resulted in not all URLs being returned because of a variable which was overwritten instead of being extended.
- Handle URL parsing issue and only emit a warning with the problematic URL but do not break the rest of the parsing.
- Filter out any scheme-only URLs.
- Make sure the URL parsing regex only matches URLs with scheme (as it is supposed to).
- Try to detect partial URLs (looking for a scheme) and extend the sliced body window accordingly. This allows for better URL extraction.
- Prevent routing.parserouting() from throwing an exception on unparsable receive lines (thanks @kinoute #54).
- Do not unnecessarily call eml_parser.decode.robust_string2date on an empty string.
- Fix routing.parserouting() to handle domains containing the word 'from' by themselves (thanks @jgru #51).
Adapted the examples/simple_test.py to use the eml_parser class instead of the deprecated method.
- When parsing URLs from the body:
- do not try to replace "hxxp" by "http" as we do not parse "hxxp" anyway (legacy)
- skip URLs with no "."
- update the regex for searching for URLs based on https://gist.github.com/gruber/8891611 in order to prevent infinite runs in certain cases (thanks @kevin-dunas)
Implemented a workaround for an upstream bug (https://bugs.python.org/issue30681) which breaks EML parsing if the source EML contains an unparsable date-time field (thanks @nth-attempt).
Fixed a bug which prevented correct attachment parsing in certain situations (thanks @ninoseki).
Use simple less time consuming regular expression for searching for IPv4 addresses, in turn use ipaddress for both IPv4 and IPv6 address validation which is fast and gives in turn leads to more correct matches.
- Simplify the code by using a sliding window body slicing method
- Use alternative URL extraction regular-expression
- Fix other regular-expressions (non-required escaping and ^)
- No longer support parsing hxxp(s) style URLs
- In some cases the extracted features (i.e. domain, IP, URL, e-mail) were not correct due to wrongfully cutting through the body. This has been fixed by extending the text slice to a character unrelated to the match pattern.
- Added EmlParser class in order to simplify inner workings.
- Moved typing annotations inline.
- Replaced a couple of regular expression used by simpler string operations for improved parsing speed.
- Renamed (internal) method give_dom_ip to give_dom_ip.
- Simplify mime-type detection
- Deprecated Python support for versions <3.7.
- Deprecated the usage of eml_parser.decode_email and eml_parser.decode_email_b. You should use the class instead.
- Fixed docstrings.
- Removed any broad Exception usage.
- Fixed import orders.
- Extra requires option file-magic was renamed to filemagic -> pip does not seem to work with "-" in the name.