bibtextparser does not properly handle escaped dollar signs in input file #264

simsong · 2020-08-03T16:31:17Z

When it reads this record:

@book{bug1,
 title="Demonstrate a bug. A \$1 billion!",
 author="Dr. Evil",
 year=2000
}

bibtexparser writes it this way:

@book{bug1,
 author = {Dr. Evil},
 title = {{D}emonstrate a bug. {A} \textbackslash \textdollar 1 billion!},
 year = {2000}
}

This breaks the file!

Here is my demo program:

import sys
import io
import bibtexparser
from bibtexparser.bparser import BibTexParser
from bibtexparser.customization import homogenize_latex_encoding

BUG1 = """
@book{bug1,
 title="Demonstrate a bug. A \$1 billion!",
 author="Dr. Evil",
 year=2000
}
"""

if __name__=="__main__":
    bugfile = io.StringIO(BUG1)
    parser = BibTexParser()
    parser.customization = homogenize_latex_encoding
    bib_database = bibtexparser.load(bugfile, parser=parser)

    print("INPUT:")
    print(BUG1)
    print("OUTPUT:")
    bibtexparser.dump(bib_database, sys.stdout)

The text was updated successfully, but these errors were encountered:

MiWeiss · 2022-07-09T13:20:25Z

Successfully reproduced:

Created branch https://github.com/sciunto-org/python-bibtexparser/compare/issue/264-dollarsign with a test capturing this.
Tracked down the problem to the (as it appears) latexenc.py > latex_to_unicode method.

Someone should have a good look at latex_to_unicode - this does not appear to be the only problem. E.g. one line calls replace("{", "") only for the next line to check if there is still a } present in the same string.

We cannot be the first ones with this problems. E.g., a quick google search pointed to this lib which may be helpful.

@simsong , or anyone else, would you be willing to look into this and attempt a fix of the method?

simsong · 2022-07-09T21:58:53Z

Hi. I have lots of code that I've written for automatically escaping LaTeX code.

I like this a lot:

https://github.com/simsong/ctools/blob/5e2387668852c5858367f6e8c2b75e06bba5839e/latex_tools.py#L129

MiWeiss · 2022-07-10T10:15:56Z

Sounds great. Would you be able to work that into a fix? Unicode to Latex and vice versa in the current code has quite some technical dept and it would be great if someone with experience on the problem could take it on :-)

simsong · 2022-07-10T12:46:55Z

Sure thing. Let me try to do it sometime today. I didn’t know how open you would be to pull request.

simsong · 2022-07-10T12:54:46Z

Actually, before I go forward: I see you have a lot of test vectors but do not have the correct outputs from reading the test factors. Do you think this is in oversight?

Also, do you have a theory of operation document? In particular, I want to know if you have a policy on how information is represented in the dictionary for each entry. Is it supposed to be plain Unicode, or is it supposed to have some LaTeX interpretation?

Finally, I think it would be useful for your regression tests to also run bibtex and LaTeX on the code produced to make sure it is valid. Do they do this? If not, would that be useful for me to add?

MiWeiss · 2022-07-10T13:06:28Z

Actually, before I go forward: I see you have a lot of test vectors but do not have the correct outputs from reading the test factors. Do you think this is in oversight?

That sounds strange. Not sure which tests you are referring to. The only possible explanation I can imagine right now, other than an oversight, would be to use these tests as smoke tests, i.e., to "just make sure no exceptions are raised". Note that I joined the project rather recently - so I would have to dig into this.

Also, do you have a theory of operation document? In particular, I want to know if you have a policy on how information is represented in the dictionary for each entry. Is it supposed to be plain Unicode, or is it supposed to have some LaTeX interpretation?

No such document, to the best of my knowledge. From reading the source code, it appears that first, all latex is converted to unicode, and then this unicode is properly escaped.
IMHO: There's quite some technical dept in that part of the code - and it is one of our largest sources of errors. The ones who wrote it are nowhere to be asked ;-) Hence it's probably good if you think about a sustainable solution in the "bigger picture", even if that means that we have to change the writer here and there. What I would optimally expect is just nice, plain unicode in the parsed entries, with a (optional) backwards escaping when writing back to the bibtex file.

IMO, in the long term, all these things should be optional, i.e., users should decide if conversions and special escapings are applied - this refactoring should probably only be addressed once the current backlog of issues is resolved :-)

Finally, I think it would be useful for your regression tests to also run bibtex and LaTeX on the code produced to make sure it is valid. Do they do this? If not, would that be useful for me to add?

Love the idea. But this should be in a separate PR, and outside of the current unit tests. Kind of an "integration test suite", which we run on PRs and releases - while keeping the unit test suite simple and fast.

Edit: Might be worth also looking at the "convert_to_unicode" customization.

MiWeiss · 2022-07-10T13:11:33Z

Related issues (might or might not be fixed by your changes):
#272
#274

MiWeiss · 2022-07-10T15:41:51Z

Just read my comment again, and to be more precise: All of this should of course remain part of the customizations. I think enforcing an encoding change would just break too many applications relying on... ;-)

If changes are large, we can then just make these "new" customizations, allowing users to slowly migrate to the better (your) implementation...

Is that in line with what you were thinking?

simsong · 2022-07-10T15:59:34Z

I need to have a better idea of how the internal representations work. At very least, the program should not be outputting invalid BiBTeX. The only way to truly validate it is to run a LaTeX, BiBTeX, LaTeX cycle, so I think that needs to be added to the tests. I can do that with a separate issue and PR.

I can't assign things to myself, so please assign #313 to me.

Separately, proper Unicode discipline is to hold all strings as Unicode characters internally and then to transcode them to your representation (in this case, LaTeX-encoded ASCII) on output. I discuss this in an article I wrote for ;login: several years ago, but it's still pretty accurate. See also the Python Unicode How-To.

I'm nor proposing a complete rewrite of the code base here.

I also have limited availability help on this project, but this is important to me for a variety of my tasks, so I'll make improvements to it over the next few weeks and send them to you as PRs.

Thanks!

simsong · 2022-07-10T19:38:43Z

Actually, before I go forward: I see you have a lot of test vectors but do not have the correct outputs from reading the test factors. Do you think this is in oversight?

That sounds strange. Not sure which tests you are referring to. The only possible explanation I can imagine right now, other than an oversight, would be to use these tests as smoke tests, i.e., to "just make sure no exceptions are raised". Note that I joined the project rather recently - so I would have to dig into this.

The tests are here: https://github.com/sciunto-org/python-bibtexparser/tree/master/bibtexparser/tests

The test data are here: https://github.com/sciunto-org/python-bibtexparser/tree/master/bibtexparser/tests/data

Note that the test files consist of BiBTeX files that are parsed.

Consider the test file string.bib

It was added for #90.

However, I'm unable to find a single unit test that references the file stirng.bib or verifies that db['booktitle']='Proceedings of the {IEEE} Symposium on Security and Privacy'

Also, do you have a theory of operation document? In particular, I want to know if you have a policy on how information is represented in the dictionary for each entry. Is it supposed to be plain Unicode, or is it supposed to have some LaTeX interpretation?

No such document, to the best of my knowledge. From reading the source code, it appears that first, all latex is converted to unicode, and then this unicode is properly escaped.

Right. But it the unicode escaped in the internal Python data structures, or when the bibtex file is output? The later is appropriate.

IMO, in the long term, all these things should be optional, i.e., users should decide if conversions and special escapings are applied - this refactoring should probably only be addressed once the current backlog of issues is resolved :-)

Okay. Why would you want the module to be able to output invalid BiBTeX?

simsong · 2022-07-10T21:16:53Z

Hi. I have a question about this comment:

Just read my comment again, and to be more precise: All of this should of course remain part of the customizations. I think enforcing an encoding change would just break too many applications relying on... ;-)

If changes are large, we can then just make these "new" customizations, allowing users to slowly migrate to the better (your) implementation...

Is that in line with what you were thinking?

What do you mean by "enforcing an encoding" ?

MiWeiss · 2022-07-16T19:24:45Z

Thanks for your comments!

However, I'm unable to find a single unit test that references the file stirng.bib (sic!) or verifies that db['booktitle']='Proceedings of the {IEEE} Symposium on Security and Privacy'

This file, including the above-mentioned verification, are part of this test (link).

Right. But it the unicode escaped in the internal Python data structures, or when the bibtex file is output? The later is appropriate.

This, to some extent, is customizable. See e.g. the convert_to_unicode customizer. We have >1200 users with many diverse use cases, and while I do agree with you on what's "appropriate", some of them rely on receiving escaped strings. We want to be backwards compatible, hence we cannot aggressively change "uncustomizable" parts of the code. Proposing new and better customizations thus appears to be the most logical way to go. We can describe them in the docs as the best way to use the library, without breaking anything downstream for our existing users.

Your remaining two questions should also be answered by the above, right?

Of course, if we identify clear bugs (such as the dollar sign mentioned in the title), we still need a fix. But if I did not misunderstand you above, you were proposing to implement a whole new "bibtex to unicode" conversion, thus going beyond a small fix. Please don't hesitate to correct me if I misunderstood your intentions.

fgrosshans · 2023-03-25T13:11:27Z

I guess the following is an instance of the same bug. Or is it worth opening a new one ?

>>>` import bibtexparser
>>> bibtexparser.latexenc.latex_to_unicode(r"V\'ictor, V\'ictor")
'Víctor, V̧́tor'
>>> #Expected output : 'Víctor, Víctor'
>>> bibtexparser.__version__
'1.1.0'

MiWeiss · 2023-03-25T19:45:02Z

Hi @fgrosshans

Thanks. I guess it could be seen as a separate bug, but I think your intuition is right: The conceptional fix, by just releasing a new customization would have to be aiming to fix both of these points - and there's a very tiny chance that they get fixed individually. Hence, I guess, you can leave this here. 👍

MiWeiss · 2023-05-26T14:07:36Z

The initially reported problem (escaped dollar sign) is fixed in v2

Test case: "I payed $10" is converted to "I payed $10" by applying the latex cecoding middleware

MiWeiss added bug good first issue needs help labels Jul 9, 2022

MiWeiss assigned simsong Jul 10, 2022

MiWeiss removed good first issue needs help labels Jul 10, 2022

MiWeiss mentioned this issue Jul 10, 2022

Failing in converting strings to unicode #272

Closed

MiWeiss closed this as completed May 26, 2023

MiWeiss added the fixed in v2 label May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bibtextparser does not properly handle escaped dollar signs in input file #264

bibtextparser does not properly handle escaped dollar signs in input file #264

simsong commented Aug 3, 2020

MiWeiss commented Jul 9, 2022

simsong commented Jul 9, 2022

MiWeiss commented Jul 10, 2022

simsong commented Jul 10, 2022

simsong commented Jul 10, 2022

MiWeiss commented Jul 10, 2022 •

edited

Loading

MiWeiss commented Jul 10, 2022

MiWeiss commented Jul 10, 2022

simsong commented Jul 10, 2022

simsong commented Jul 10, 2022

simsong commented Jul 10, 2022

MiWeiss commented Jul 16, 2022

fgrosshans commented Mar 25, 2023

MiWeiss commented Mar 25, 2023

MiWeiss commented May 26, 2023

bibtextparser does not properly handle escaped dollar signs in input file #264

bibtextparser does not properly handle escaped dollar signs in input file #264

Comments

simsong commented Aug 3, 2020

MiWeiss commented Jul 9, 2022

simsong commented Jul 9, 2022

MiWeiss commented Jul 10, 2022

simsong commented Jul 10, 2022

simsong commented Jul 10, 2022

MiWeiss commented Jul 10, 2022 • edited Loading

MiWeiss commented Jul 10, 2022

MiWeiss commented Jul 10, 2022

simsong commented Jul 10, 2022

simsong commented Jul 10, 2022

simsong commented Jul 10, 2022

MiWeiss commented Jul 16, 2022

fgrosshans commented Mar 25, 2023

MiWeiss commented Mar 25, 2023

MiWeiss commented May 26, 2023

MiWeiss commented Jul 10, 2022 •

edited

Loading