Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract mathematical formulas from PDF files. #850

Open
ggservice007 opened this issue Mar 14, 2024 · 2 comments
Open

Extract mathematical formulas from PDF files. #850

ggservice007 opened this issue Mar 14, 2024 · 2 comments
Assignees
Labels
data-processing Data Processing Database priority-low
Milestone

Comments

@ggservice007
Copy link
Collaborator

what

Extract mathematical formulas from PDF files.

for example
image

extract it and save it with LaTeX code.

@ggservice007 ggservice007 added priority-low data-processing Data Processing Database labels Mar 14, 2024
@ggservice007 ggservice007 added this to the v0.3.0 milestone Mar 14, 2024
@ggservice007 ggservice007 self-assigned this Mar 14, 2024
@nkwangleiGIT
Copy link
Contributor

nkwangleiGIT commented Mar 15, 2024

pdfimages can get the image from pdf file using command pdfimages -list <path of pdf file>

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     164   164  index   1   8  image  no        18  0   197   197  647B 2.4%
   1     1 smask     164   164  gray    1   8  image  no        18  0   197   197   51B 0.2%
   1     2 image     893   550  index   1   8  image  no        19  0   150   150 1601B 0.3%
   1     3 smask     893   550  gray    1   8  image  no        19  0   150   150  545B 0.1%
   1     4 image     166    43  icc     3   8  image  no        20  0   151   152 9226B  43%
   1     5 smask     166    43  gray    1   8  image  no        20  0   151   152   32B 0.4%
   1     6 image     183   254  icc     3   8  jpeg   no        21  0   220   220 10.1K 7.4%
   3     7 image     615   579  rgb     3   8  jpx    yes       64  0   220   220 35.7K 3.4%
   4     8 image     606   589  rgb     3   8  jpx    yes       69  0   220   220 25.4K 2.4%
   7     9 image     606   672  rgb     3   8  jpx    yes       82  0   220   220 40.0K 3.4%

page: The page number of the image in the PDF file.
num: A unique identifier for each image on the page.
type: The type of image, such as "image" or "smask" (soft mask).
width: The width of the image in pixels.
height: The height of the image in pixels.
color: The color space of the image, such as "index" (indexed color) or "gray" (grayscale).
comp: The number of color components in the image.
bpc: The number of bits per color component.
enc: The encoding type of the image, such as "image" or "jpeg".
interp: Indicates whether the image has an interpolation algorithm applied.
object ID: The object ID of the image in the PDF file.
x-ppi: The horizontal resolution of the image in pixels per inch (PPI).
y-ppi: The vertical resolution of the image in pixels per inch (PPI).
size: The size of the image file.
ratio: The compression ratio of the image.

@nkwangleiGIT
Copy link
Contributor

For LateX OCR, refer to https://github.com/lukas-blecher/LaTeX-OCR

@bjwswang bjwswang modified the milestones: v0.3.0, v0.4.0 Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-processing Data Processing Database priority-low
Projects
None yet
Development

No branches or pull requests

3 participants