Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bibtextparser does not properly handle escaped dollar signs in input file #264

Closed
simsong opened this issue Aug 3, 2020 · 15 comments
Closed
Assignees

Comments

@simsong
Copy link
Contributor

simsong commented Aug 3, 2020

When it reads this record:

@book{bug1,
 title="Demonstrate a bug. A \$1 billion!",
 author="Dr. Evil",
 year=2000
}

bibtexparser writes it this way:

@book{bug1,
 author = {Dr. Evil},
 title = {{D}emonstrate a bug. {A} \textbackslash \textdollar 1 billion!},
 year = {2000}
}

This breaks the file!

Here is my demo program:

import sys
import io
import bibtexparser
from bibtexparser.bparser import BibTexParser
from bibtexparser.customization import homogenize_latex_encoding

BUG1 = """
@book{bug1,
 title="Demonstrate a bug. A \$1 billion!",
 author="Dr. Evil",
 year=2000
}
"""

if __name__=="__main__":
    bugfile = io.StringIO(BUG1)
    parser = BibTexParser()
    parser.customization = homogenize_latex_encoding
    bib_database = bibtexparser.load(bugfile, parser=parser)

    print("INPUT:")
    print(BUG1)
    print("OUTPUT:")
    bibtexparser.dump(bib_database, sys.stdout)
@MiWeiss
Copy link
Collaborator

MiWeiss commented Jul 9, 2022

Successfully reproduced:

Someone should have a good look at latex_to_unicode - this does not appear to be the only problem. E.g. one line calls replace("{", "") only for the next line to check if there is still a } present in the same string.

We cannot be the first ones with this problems. E.g., a quick google search pointed to this lib which may be helpful.

@simsong , or anyone else, would you be willing to look into this and attempt a fix of the method?

@simsong
Copy link
Contributor Author

simsong commented Jul 9, 2022

Hi. I have lots of code that I've written for automatically escaping LaTeX code.

I like this a lot:

https://github.com/simsong/ctools/blob/5e2387668852c5858367f6e8c2b75e06bba5839e/latex_tools.py#L129

@MiWeiss
Copy link
Collaborator

MiWeiss commented Jul 10, 2022

Sounds great. Would you be able to work that into a fix? Unicode to Latex and vice versa in the current code has quite some technical dept and it would be great if someone with experience on the problem could take it on :-)

@simsong
Copy link
Contributor Author

simsong commented Jul 10, 2022

Sure thing. Let me try to do it sometime today. I didn’t know how open you would be to pull request.

@simsong
Copy link
Contributor Author

simsong commented Jul 10, 2022

Actually, before I go forward: I see you have a lot of test vectors but do not have the correct outputs from reading the test factors. Do you think this is in oversight?

Also, do you have a theory of operation document? In particular, I want to know if you have a policy on how information is represented in the dictionary for each entry. Is it supposed to be plain Unicode, or is it supposed to have some LaTeX interpretation?

Finally, I think it would be useful for your regression tests to also run bibtex and LaTeX on the code produced to make sure it is valid. Do they do this? If not, would that be useful for me to add?

@MiWeiss
Copy link
Collaborator

MiWeiss commented Jul 10, 2022

Actually, before I go forward: I see you have a lot of test vectors but do not have the correct outputs from reading the test factors. Do you think this is in oversight?

That sounds strange. Not sure which tests you are referring to. The only possible explanation I can imagine right now, other than an oversight, would be to use these tests as smoke tests, i.e., to "just make sure no exceptions are raised". Note that I joined the project rather recently - so I would have to dig into this.

Also, do you have a theory of operation document? In particular, I want to know if you have a policy on how information is represented in the dictionary for each entry. Is it supposed to be plain Unicode, or is it supposed to have some LaTeX interpretation?

No such document, to the best of my knowledge. From reading the source code, it appears that first, all latex is converted to unicode, and then this unicode is properly escaped.
IMHO: There's quite some technical dept in that part of the code - and it is one of our largest sources of errors. The ones who wrote it are nowhere to be asked ;-) Hence it's probably good if you think about a sustainable solution in the "bigger picture", even if that means that we have to change the writer here and there. What I would optimally expect is just nice, plain unicode in the parsed entries, with a (optional) backwards escaping when writing back to the bibtex file.

IMO, in the long term, all these things should be optional, i.e., users should decide if conversions and special escapings are applied - this refactoring should probably only be addressed once the current backlog of issues is resolved :-)

Finally, I think it would be useful for your regression tests to also run bibtex and LaTeX on the code produced to make sure it is valid. Do they do this? If not, would that be useful for me to add?

Love the idea. But this should be in a separate PR, and outside of the current unit tests. Kind of an "integration test suite", which we run on PRs and releases - while keeping the unit test suite simple and fast.

Edit: Might be worth also looking at the "convert_to_unicode" customization.

@MiWeiss
Copy link
Collaborator

MiWeiss commented Jul 10, 2022

Related issues (might or might not be fixed by your changes):
#272
#274

@MiWeiss
Copy link
Collaborator

MiWeiss commented Jul 10, 2022

Just read my comment again, and to be more precise: All of this should of course remain part of the customizations. I think enforcing an encoding change would just break too many applications relying on... ;-)

If changes are large, we can then just make these "new" customizations, allowing users to slowly migrate to the better (your) implementation...

Is that in line with what you were thinking?

@simsong
Copy link
Contributor Author

simsong commented Jul 10, 2022

I need to have a better idea of how the internal representations work. At very least, the program should not be outputting invalid BiBTeX. The only way to truly validate it is to run a LaTeX, BiBTeX, LaTeX cycle, so I think that needs to be added to the tests. I can do that with a separate issue and PR.

I can't assign things to myself, so please assign #313 to me.

Separately, proper Unicode discipline is to hold all strings as Unicode characters internally and then to transcode them to your representation (in this case, LaTeX-encoded ASCII) on output. I discuss this in an article I wrote for ;login: several years ago, but it's still pretty accurate. See also the Python Unicode How-To.

I'm nor proposing a complete rewrite of the code base here.

I also have limited availability help on this project, but this is important to me for a variety of my tasks, so I'll make improvements to it over the next few weeks and send them to you as PRs.

Thanks!

@simsong
Copy link
Contributor Author

simsong commented Jul 10, 2022

Actually, before I go forward: I see you have a lot of test vectors but do not have the correct outputs from reading the test factors. Do you think this is in oversight?

That sounds strange. Not sure which tests you are referring to. The only possible explanation I can imagine right now, other than an oversight, would be to use these tests as smoke tests, i.e., to "just make sure no exceptions are raised". Note that I joined the project rather recently - so I would have to dig into this.

The tests are here: https://github.com/sciunto-org/python-bibtexparser/tree/master/bibtexparser/tests

The test data are here: https://github.com/sciunto-org/python-bibtexparser/tree/master/bibtexparser/tests/data

Note that the test files consist of BiBTeX files that are parsed.

Consider the test file string.bib

It was added for #90.

However, I'm unable to find a single unit test that references the file stirng.bib or verifies that db['booktitle']='Proceedings of the {IEEE} Symposium on Security and Privacy'

Also, do you have a theory of operation document? In particular, I want to know if you have a policy on how information is represented in the dictionary for each entry. Is it supposed to be plain Unicode, or is it supposed to have some LaTeX interpretation?

No such document, to the best of my knowledge. From reading the source code, it appears that first, all latex is converted to unicode, and then this unicode is properly escaped.

Right. But it the unicode escaped in the internal Python data structures, or when the bibtex file is output? The later is appropriate.

IMO, in the long term, all these things should be optional, i.e., users should decide if conversions and special escapings are applied - this refactoring should probably only be addressed once the current backlog of issues is resolved :-)

Okay. Why would you want the module to be able to output invalid BiBTeX?

@simsong
Copy link
Contributor Author

simsong commented Jul 10, 2022

Hi. I have a question about this comment:

Just read my comment again, and to be more precise: All of this should of course remain part of the customizations. I think enforcing an encoding change would just break too many applications relying on... ;-)

If changes are large, we can then just make these "new" customizations, allowing users to slowly migrate to the better (your) implementation...

Is that in line with what you were thinking?

What do you mean by "enforcing an encoding" ?

@MiWeiss
Copy link
Collaborator

MiWeiss commented Jul 16, 2022

Thanks for your comments!

However, I'm unable to find a single unit test that references the file stirng.bib (sic!) or verifies that db['booktitle']='Proceedings of the {IEEE} Symposium on Security and Privacy'

This file, including the above-mentioned verification, are part of this test (link).

Right. But it the unicode escaped in the internal Python data structures, or when the bibtex file is output? The later is appropriate.

This, to some extent, is customizable. See e.g. the convert_to_unicode customizer. We have >1200 users with many diverse use cases, and while I do agree with you on what's "appropriate", some of them rely on receiving escaped strings. We want to be backwards compatible, hence we cannot aggressively change "uncustomizable" parts of the code. Proposing new and better customizations thus appears to be the most logical way to go. We can describe them in the docs as the best way to use the library, without breaking anything downstream for our existing users.

Your remaining two questions should also be answered by the above, right?

Of course, if we identify clear bugs (such as the dollar sign mentioned in the title), we still need a fix. But if I did not misunderstand you above, you were proposing to implement a whole new "bibtex to unicode" conversion, thus going beyond a small fix. Please don't hesitate to correct me if I misunderstood your intentions.

@fgrosshans
Copy link

I guess the following is an instance of the same bug. Or is it worth opening a new one ?

>>>` import bibtexparser
>>> bibtexparser.latexenc.latex_to_unicode(r"V\'ictor, V\'ictor")
'Víctor, V̧́tor'
>>> #Expected output : 'Víctor, Víctor'
>>> bibtexparser.__version__
'1.1.0'

@MiWeiss
Copy link
Collaborator

MiWeiss commented Mar 25, 2023

Hi @fgrosshans

Thanks. I guess it could be seen as a separate bug, but I think your intuition is right: The conceptional fix, by just releasing a new customization would have to be aiming to fix both of these points - and there's a very tiny chance that they get fixed individually. Hence, I guess, you can leave this here. 👍

@MiWeiss
Copy link
Collaborator

MiWeiss commented May 26, 2023

The initially reported problem (escaped dollar sign) is fixed in v2

Test case: "I payed $10" is converted to "I payed $10" by applying the latex cecoding middleware

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants