Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple panics on Arxiv.org PDFs #75

Open
jlandahl opened this issue Dec 4, 2023 · 2 comments
Open

Multiple panics on Arxiv.org PDFs #75

jlandahl opened this issue Dec 4, 2023 · 2 comments

Comments

@jlandahl
Copy link

jlandahl commented Dec 4, 2023

I'm attempting to extract the text from multiple PDFs from arxiv.org, and 15 out of the 20 I just attempted resulted in panics, many (but not all) apparently Unicode-related. Here are the links to the PDFs that failed:

Here are some of the errors:

For http://arxiv.org/pdf/2312.00064v1:

Unicode mismatch true fl "fl" Ok("fl") [64258]
Unicode mismatch true fi "fi" Ok("fi") [64257]
Unicode mismatch true fl "fl" Ok("fl") [64258]
thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 16 in map {60: "\u{f8f2}", 208: "Γ", 218: "Ω", 65: "\u{f8f8}", 217: "Ψ", 210: "Θ", 213: "Π", 63: "\u{f8e6}", 50: "\u{f8ee}", 160: " ", 57: "\u{f8fc}", 64: "\u{f8ed}", 212: "Ξ", 55: "\u{f8fa}", 209: "∆", 66: "\u{f8ec}", 49: "\u{f8f6}", 59: "\u{f8fe}", 48: "\u{f8eb}", 67: "\u{f8f7}", 51: "\u{f8f9}", 61: "\u{f8fd}", 52: "\u{f8f0}", 62: "\u{f8f4}", 211: "Λ", 159: "√", 53: "\u{f8fb}", 215: "Υ", 58: "\u{f8f3}", 214: "Σ", 54: "\u{f8ef}", 56: "\u{f8f1}", 216: "Φ"} for <</Type /Font/Subtype /Type1/BaseFont /VSLKGG+CMEX10/FirstChar 0/FontDescriptor 4273 0 R/LastChar 125/ToUnicode 4304 0 R/Widths 4259 0 R>>

For http://arxiv.org/pdf/2312.00140v1:

thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 0 in map {50: "\u{f8ee}", 54: "\u{f8ef}", 67: "\u{f8f7}", 53: "\u{f8fb}", 48: "\u{f8eb}", 160: " ", 63: "\u{f8e6}", 215: "Υ", 61: "\u{f8fd}", 214: "Σ", 57: "\u{f8fc}", 66: "\u{f8ec}", 60: "\u{f8f2}", 64: "\u{f8ed}", 209: "∆", 65: "\u{f8f8}", 208: "Γ", 218: "Ω", 159: "√", 213: "Π", 211: "Λ", 49: "\u{f8f6}", 212: "Ξ", 58: "\u{f8f3}", 56: "\u{f8f1}", 51: "\u{f8f9}", 62: "\u{f8f4}", 210: "Θ", 217: "Ψ", 52: "\u{f8f0}", 55: "\u{f8fa}", 216: "Φ", 59: "\u{f8fe}"} for <</Type /Font/Subtype /Type1/BaseFont /BJKPRR+CMEX10/FirstChar 0/FontDescriptor 1313 0 R/LastChar 88/ToUnicode 1374 0 R/Widths 1287 0 R>>

For http://arxiv.org/pdf/2309.02511v2:

thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 44 in map {43: "⇁", 165: "Ξ", 13: "γ", 91: "♭", 184: "λ", 46: "▷", 74: "J", 85: "U", 78: "N", 121: "y", 111: "o", 28: "τ", 89: "Y", 101: "e", 176: "γ", 191: "τ", 162: "∆", 15: "ϵ", 5: "Π", 109: "m", 178: "ϵ", 177: "δ", 103: "g", 98: "b", 174: "α", 173: "Ω", 125: "℘", 194: "χ", 100: "d", 8: "Φ", 94: "⌣", 26: "ρ", 68: "D", 30: "ϕ", 12: "β", 75: "K", 54: "6", 70: "F", 175: "β", 181: "θ", 104: "h", 34: "ε", 4: "Ξ", 42: "⇀", 62: ">", 23: "ν", 119: "w", 38: "ς", 11: "α", 90: "Z", 195: "ψ", 193: "ϕ", 180: "η", 86: "V", 17: "η", 124: "ȷ", 35: "ϑ", 128: "ψ", 73: "I", 36: "ϖ", 166: "Π", 189: "ρ", 112: "p", 170: "Ψ", 107: "k", 77: "M", 120: "x", 99: "c", 76: "L", 93: "♯", 27: "σ", 64: "∂", 190: "σ", 50: "2", 29: "υ", 53: "5", 188: "π", 24: "ξ", 115: "s", 97: "a", 168: "Υ", 164: "Λ", 9: "Ψ", 39: "φ", 41: "↽", 25: "π", 118: "v", 66: "B", 67: "C", 187: "ξ", 81: "Q", 83: "S", 88: "X", 179: "ζ", 95: "⌢", 3: "Λ", 52: "4", 14: "δ", 122: "z", 31: "χ", 183: "κ", 22: "µ", 113: "q", 80: "P", 60: "<", 102: "f", 47: "◁", 82: "R", 32: "ψ", 6: "Σ", 110: "n", 169: "Φ", 84: "T", 123: "ı", 167: "Σ", 192: "υ", 87: "W", 161: "Γ", 106: "j", 37: "ϱ", 48: "0", 117: "u", 71: "G", 72: "H", 65: "A", 108: "l", 49: "1", 1: "∆", 96: "ℓ", 2: "Θ", 51: "3", 186: "ν", 59: ",", 63: "⋆", 16: "ζ", 105: "i", 92: "♮", 7: "Υ", 56: "8", 55: "7", 21: "λ", 160: " ", 33: "ω", 57: "9", 20: "κ", 58: ".", 69: "E", 116: "t", 18: "θ", 10: "Ω", 40: "↼", 114: "r", 19: "ι", 182: "ι", 0: "Γ", 185: "µ", 126: "\u{20d7}", 79: "O", 163: "Θ", 61: "/"} for <</Type /Font/Subtype /Type1/BaseFont /APPDUE+CMMI10/FirstChar 11/FontDescriptor 1143 0 R/LastChar 122/ToUnicode 1193 0 R/Widths 1129 0 R>>

For http://arxiv.org/pdf/2312.00735v1:

thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 118 in map {159: "√", 62: "\u{f8f4}", 57: "\u{f8fc}", 218: "Ω", 213: "Π", 63: "\u{f8e6}", 64: "\u{f8ed}", 50: "\u{f8ee}", 66: "\u{f8ec}", 212: "Ξ", 55: "\u{f8fa}", 65: "\u{f8f8}", 58: "\u{f8f3}", 49: "\u{f8f6}", 215: "Υ", 53: "\u{f8fb}", 56: "\u{f8f1}", 67: "\u{f8f7}", 208: "Γ", 59: "\u{f8fe}", 216: "Φ", 160: " ", 210: "Θ", 217: "Ψ", 211: "Λ", 51: "\u{f8f9}", 54: "\u{f8ef}", 52: "\u{f8f0}", 60: "\u{f8f2}", 214: "Σ", 48: "\u{f8eb}", 61: "\u{f8fd}", 209: "∆"} for <</Type /Font/Subtype /Type1/BaseFont /KFVYMG+CMEX10/FirstChar 16/FontDescriptor 638 0 R/LastChar 118/ToUnicode 671 0 R/Widths 617 0 R>>
@jrmuizel
Copy link
Owner

jrmuizel commented Dec 5, 2023

aeb9a9d fixes the first pdf. I haven't tested the other ones yet.

@jrmuizel
Copy link
Owner

2312.00577v1 crashes with

thread 'main' panicked at src/lib.rs:820:27:
missing char 17 in map {49: "1", 71: "Γ", 103: "g", 48: "0", 69: "E", 110: "n", 13: "K", 45: "−", 81: "Q", 53: "5", 114: "r", 101: "e", 50: "2", 121: "y", 77: "M", 46: "."}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants