Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The content extracted by newspape is out of order #947

Open
riusksk opened this issue Aug 17, 2022 · 0 comments
Open

The content extracted by newspape is out of order #947

riusksk opened this issue Aug 17, 2022 · 0 comments

Comments

@riusksk
Copy link

riusksk commented Aug 17, 2022

When use newspaper to extract articles containing code, the content sequence is incorrect,
for example, http://akat1.pl/?id=2

The error is placed in the pass-through() function of mail.local:
<code>

After extraction, it becomes:

<code>
The error is placed in the pass() function of mail.local: 

this bug is exist in convert_to_text() function of outputformatters.py:

    def convert_to_text(self):
        txts = []
        for node in list(self.get_top_node()):  # Bug!!!!
            try:
                txt = self.parser.getText(node)

If you use the following code to output txt, the order is correct ( it just doesn't wrap the line correctly), but if you use the for loop above, it will be out of order.
txt = self.parser.getText(self.get_top_node())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant