Can pypdf correctly extract text from two-column PDF documents? #2467

rybshik · 2024-02-24T10:34:57Z

rybshik
Feb 24, 2024

Consider the following article: https://arxiv.org/pdf/2106.13823.pdf

It is an academic paper formatted in two columns.

Can pypdf correctly extract text from PDF documents with a two-column format?

MartinThoma · 2024-02-24T10:41:18Z

MartinThoma
Feb 24, 2024
Maintainer

PyPDF2 is deprecated and PyPDF3 / PyPDF4 are different projects. I've edited your question to be only about pypdf.

0 replies

MartinThoma · 2024-02-24T10:43:37Z

MartinThoma
Feb 24, 2024
Maintainer

Is the layout mode text extraction what you're looking for?

https://pypdf.readthedocs.io/en/stable/user/extract-text.html

3 replies

rybshik Feb 24, 2024
Author

I want to extract text from a two-column PDF in the natural reading order: first down the left column, then down the right.
Does PyPDF2 do this by default?

pubpub-zz Feb 24, 2024
Maintainer

The standard extract_text() seems to work properly on your text. Have a try.

pubpub-zz Feb 24, 2024
Maintainer

I mean .extract_text(extraction_mode='plain')). .extract_text(extraction_mode='layout')) will put the columns on the same line

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can pypdf correctly extract text from two-column PDF documents? #2467

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can pypdf correctly extract text from two-column PDF documents? #2467

rybshik Feb 24, 2024

Replies: 2 comments · 3 replies

MartinThoma Feb 24, 2024 Maintainer

MartinThoma Feb 24, 2024 Maintainer

rybshik Feb 24, 2024 Author

pubpub-zz Feb 24, 2024 Maintainer

pubpub-zz Feb 24, 2024 Maintainer

rybshik
Feb 24, 2024

Replies: 2 comments 3 replies

MartinThoma
Feb 24, 2024
Maintainer

MartinThoma
Feb 24, 2024
Maintainer

rybshik Feb 24, 2024
Author

pubpub-zz Feb 24, 2024
Maintainer

pubpub-zz Feb 24, 2024
Maintainer