-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bibtextparser does not properly handle escaped dollar signs in input file #264
Comments
Successfully reproduced:
Someone should have a good look at We cannot be the first ones with this problems. E.g., a quick google search pointed to this lib which may be helpful. @simsong , or anyone else, would you be willing to look into this and attempt a fix of the method? |
Hi. I have lots of code that I've written for automatically escaping LaTeX code. I like this a lot: https://github.com/simsong/ctools/blob/5e2387668852c5858367f6e8c2b75e06bba5839e/latex_tools.py#L129 |
Sounds great. Would you be able to work that into a fix? Unicode to Latex and vice versa in the current code has quite some technical dept and it would be great if someone with experience on the problem could take it on :-) |
Sure thing. Let me try to do it sometime today. I didn’t know how open you would be to pull request. |
Actually, before I go forward: I see you have a lot of test vectors but do not have the correct outputs from reading the test factors. Do you think this is in oversight? Also, do you have a theory of operation document? In particular, I want to know if you have a policy on how information is represented in the dictionary for each entry. Is it supposed to be plain Unicode, or is it supposed to have some LaTeX interpretation? Finally, I think it would be useful for your regression tests to also run bibtex and LaTeX on the code produced to make sure it is valid. Do they do this? If not, would that be useful for me to add? |
That sounds strange. Not sure which tests you are referring to. The only possible explanation I can imagine right now, other than an oversight, would be to use these tests as smoke tests, i.e., to "just make sure no exceptions are raised". Note that I joined the project rather recently - so I would have to dig into this.
No such document, to the best of my knowledge. From reading the source code, it appears that first, all latex is converted to unicode, and then this unicode is properly escaped. IMO, in the long term, all these things should be optional, i.e., users should decide if conversions and special escapings are applied - this refactoring should probably only be addressed once the current backlog of issues is resolved :-)
Love the idea. But this should be in a separate PR, and outside of the current unit tests. Kind of an "integration test suite", which we run on PRs and releases - while keeping the unit test suite simple and fast. Edit: Might be worth also looking at the "convert_to_unicode" customization. |
Just read my comment again, and to be more precise: All of this should of course remain part of the customizations. I think enforcing an encoding change would just break too many applications relying on... ;-) If changes are large, we can then just make these "new" customizations, allowing users to slowly migrate to the better (your) implementation... Is that in line with what you were thinking? |
I need to have a better idea of how the internal representations work. At very least, the program should not be outputting invalid BiBTeX. The only way to truly validate it is to run a LaTeX, BiBTeX, LaTeX cycle, so I think that needs to be added to the tests. I can do that with a separate issue and PR. I can't assign things to myself, so please assign #313 to me. Separately, proper Unicode discipline is to hold all strings as Unicode characters internally and then to transcode them to your representation (in this case, LaTeX-encoded ASCII) on output. I discuss this in an article I wrote for ;login: several years ago, but it's still pretty accurate. See also the Python Unicode How-To. I'm nor proposing a complete rewrite of the code base here. I also have limited availability help on this project, but this is important to me for a variety of my tasks, so I'll make improvements to it over the next few weeks and send them to you as PRs. Thanks! |
The tests are here: https://github.com/sciunto-org/python-bibtexparser/tree/master/bibtexparser/tests The test data are here: https://github.com/sciunto-org/python-bibtexparser/tree/master/bibtexparser/tests/data Note that the test files consist of BiBTeX files that are parsed. Consider the test file string.bib It was added for #90. However, I'm unable to find a single unit test that references the file
Right. But it the unicode escaped in the internal Python data structures, or when the bibtex file is output? The later is appropriate.
Okay. Why would you want the module to be able to output invalid BiBTeX? |
Hi. I have a question about this comment:
What do you mean by "enforcing an encoding" ? |
Thanks for your comments!
This file, including the above-mentioned verification, are part of this test (link).
This, to some extent, is customizable. See e.g. the Your remaining two questions should also be answered by the above, right? Of course, if we identify clear bugs (such as the dollar sign mentioned in the title), we still need a fix. But if I did not misunderstand you above, you were proposing to implement a whole new "bibtex to unicode" conversion, thus going beyond a small fix. Please don't hesitate to correct me if I misunderstood your intentions. |
I guess the following is an instance of the same bug. Or is it worth opening a new one ?
|
Hi @fgrosshans Thanks. I guess it could be seen as a separate bug, but I think your intuition is right: The conceptional fix, by just releasing a new customization would have to be aiming to fix both of these points - and there's a very tiny chance that they get fixed individually. Hence, I guess, you can leave this here. 👍 |
The initially reported problem (escaped dollar sign) is fixed in Test case: "I payed $10" is converted to "I payed $10" by applying the latex cecoding middleware |
When it reads this record:
bibtexparser writes it this way:
This breaks the file!
Here is my demo program:
The text was updated successfully, but these errors were encountered: