Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve code support #401

Merged
merged 13 commits into from
Sep 5, 2023
Merged

improve code support #401

merged 13 commits into from
Sep 5, 2023

Conversation

idoshamun
Copy link
Contributor

This PR improves support in detecting and handling code snippets as follows:

Add support in inline and multi-line code snippets from pip
Example: https://pypi.org/project/openai-function-call/
When fetching Medium pages without javascript enabled, code snippets are structured as <pre><span>code</span></pre>
Example: https://medium.com/@jxnlco/seamless-integration-with-openai-and-pydantic-a-powerful-duo-for-output-parsing-fcb1e616167b
Use `<lb/>` instead of `\n` for code snippets
@codecov
Copy link

codecov bot commented Aug 3, 2023

Codecov Report

Merging #401 (375fc9a) into master (088283c) will increase coverage by 0.01%.
Report is 1 commits behind head on master.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #401      +/-   ##
==========================================
+ Coverage   96.81%   96.82%   +0.01%     
==========================================
  Files          22       22              
  Lines        3323     3334      +11     
==========================================
+ Hits         3217     3228      +11     
  Misses        106      106              
Files Changed Coverage Δ
trafilatura/core.py 97.98% <100.00%> (-0.01%) ⬇️
trafilatura/htmlprocessing.py 95.96% <100.00%> (+0.22%) ⬆️
trafilatura/utils.py 97.53% <100.00%> (ø)
trafilatura/xml.py 98.83% <100.00%> (ø)

@adbar
Copy link
Owner

adbar commented Aug 7, 2023

Hi @idoshamun, thanks for the PR! Could you please add coverage for the line that is missing?

@idoshamun
Copy link
Contributor Author

@adbar will do!

@idoshamun
Copy link
Contributor Author

@adbar we should be ready to go!

trafilatura/utils.py Outdated Show resolved Hide resolved
trafilatura/core.py Outdated Show resolved Hide resolved
trafilatura/xpaths.py Outdated Show resolved Hide resolved
@adbar
Copy link
Owner

adbar commented Aug 14, 2023

@idoshamun Thanks for adding a corresponding test!

While your changes make perfect sense the cases tackled by them are quite specific and subject to change over time. I just want to make sure that the software can continue being used as a general-purpose extractor, the questions above arise from this fact.

@idoshamun
Copy link
Contributor Author

@adbar I'll give it a look hopefully during the weekend :)

@adbar
Copy link
Owner

adbar commented Aug 23, 2023

@idoshamun I tried to address part of our concerns, feel free to have a look at the code.

The tests are failing right now but this is something else, I'll retry later.

@idoshamun
Copy link
Contributor Author

@adbar sorry for the delay on my side kinda busy these days. We'll do my best to review all changes asap

trafilatura/utils.py Outdated Show resolved Hide resolved
@idoshamun
Copy link
Contributor Author

@adbar thanks for the fixes, I only have one concern. I don't see where we replace spaces with ;cs; in code elements. I see only the other way around but the tests pass so maybe I missed something.

@adbar
Copy link
Owner

adbar commented Aug 31, 2023

It's at the end of this long line:
line = line.replace('&#13;', '\r').replace('&#10;', '\n').replace('&nbsp;', '\u00A0').replace(';cs;', ' ')

I still need to address the questions above but we're moving forward, thanks for your help!

@idoshamun
Copy link
Contributor Author

@adbar this long line switches the encoded space to a regular space. But I don't see where we do the other way around. It was removed in your changes but I didn't find the alternative

@adbar
Copy link
Owner

adbar commented Sep 1, 2023

@idoshamun I see what you mean, you added lines like this: child.text = child.text.replace(' ', CODE_SPACE)

The tests are passing though, so either it's not so important or you didn't add a corresponding test. Could you please check if the current version works for you? If not feel free to put back the corresponding lines along with a suitable test case.

@idoshamun
Copy link
Contributor Author

@adbar I think I saw in one of the commits you removed the spaces in the tests. I'll give it a look

@adbar
Copy link
Owner

adbar commented Sep 1, 2023

@idoshamun yes, my bad, it was in 52688ea because I didn't understand the point. Now I see.

@idoshamun
Copy link
Contributor Author

@adbar is there anything I can do at this point to help?

@adbar
Copy link
Owner

adbar commented Sep 5, 2023

@idoshamun I was able to solve most problems mentioned above and will now merge the PR as substantial progress has been made.

The spacing issue remains, it's not clear to me how to address it at this point. You could work on it in another PR, as well as add further tests if you witness issues with the robustness of code extraction.

@adbar adbar merged commit f101371 into adbar:master Sep 5, 2023
13 checks passed
@idoshamun
Copy link
Contributor Author

@adbar thanks a lot. Can you give me some pointers of where to look for the spaces? What will be the best place?
For some languages it's crucial to keep proper spacing

@adbar
Copy link
Owner

adbar commented Sep 5, 2023

@idoshamun This commit was fine but I'm not sure it works with the latest changes, the function handle_code_blocks() seems like a good fit.

6ca5a20#diff-abef8bfc97e60c84035950c7da832f5ca07580db79e4fb9e77ef0f8b82112b30R229

@adbar
Copy link
Owner

adbar commented Sep 5, 2023

The problem is that the spaces are stripped at some point, I can look further into it if you don't find the culprit.

@idoshamun
Copy link
Contributor Author

I'm gonna have a look now at what will be a better way to implement the spaces. Any pointers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants