Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible handling of case sensitivity in all keys #477

Open
kmccurley opened this issue Mar 31, 2024 · 3 comments
Open

More flexible handling of case sensitivity in all keys #477

kmccurley opened this issue Mar 31, 2024 · 3 comments

Comments

@kmccurley
Copy link

Is your feature request related to a problem? Please describe.
The bibtex file format is ill-defined when it comes to case sensitivity on keys. This is NOT a duplicate of #453, because that only talks about entry types.

There is a great deal of confusion about case sensitivity in bibtex. This applies to:

  1. entry types
  2. entry keys
  3. field keys

There is also some different use cases for bibtexparser. Some people want to use it to parse a bibtex file and get the same thing back when they print it out. Others want to use bibtexparser to parse in a way that is close to the behavior of some other tool to parse bibtex files (notably the bibtex binary and the biblatex package). This is part of the problem, because different processing tools will exhibit different behavior when they encounter keys that agree in lower case.

For example, consider the following LaTeX file:

\begin{filecontents}[overwrite]{the.bib}
@misc{CamelCase,
  author = {Fester Bestertester},
  Title = {What happens to this title?},
  title = {This has a camel case key},
}
@misc{camelcase,
  author = {Foster Bostertoster},
  title = {This has a lower case key}
}
@misc{another, 
 aUTHOR = {Anthony Ordinary},
 title = {Just a third entry},
}
\end{filecontents}
%%%%%%%%%%%%%%%%%%%
\documentclass{article}
% uncomment this out and use biber to see the difference. You will have to remove main.bbl and main.aux first.
%\usepackage{biblatex}
\usepackage{hyperref}
\IfPackageLoadedTF{biblatex}{\addbibresource{the.bib}}{\bibliographystyle{alpha}}
\begin{document}
I don't have much to say.
\cite{camelcase} and \cite{CamelCase} and \cite{another}.
\IfPackageLoadedTF{biblatex}{\printbibliography}{\bibliography{the}}
\end{document}

This example can be used to illustrate the difference between bibtex and biblatex. If you process this with the bibtex binary, it produces two warnings from bibtex:

Case mismatch error between cite keys CamelCase and camelcase
---line 7 of file main.aux
 : \citation{CamelCase
 :                    }
I'm skipping whatever remains of this command
Database file #1: the.bib
Warning--I'm ignoring camelcase's extra "title" field
--line 8 of file the.bib
Repeated entry---line 11 of file the.bib
 : @misc{camelcase
 :                ,
I'm skipping whatever remains of this entry

If you view the PDF, it took the first Title field and dropped the second title field. It also dropped the second camelcase entry, producing an undefined reference. Hence you may consider the bibtex binary to treat both entry keys and field keys as case-insensitive. From my observation of author behavior, about 90% use bibtex, and maybe 10% use biblatex. Since the bibtex file format was original bundled to the bibtex binary, I consider this to be the proper interpretation of case-sensitivity but others may disagree.

Now consider the case of biblatex. Uncomment the line to load biblatex, remove main.aux and main.bbl, and run pdflatex main;biber main;pdflatex main;pdflatex main. The resulting PDF file contains three references, and the first reference takes the second title "This has a camel case key".

The decades-long problem here is that the syntax for original bibtex file format was never really defined (and it's still on version 0.99d). There are various tools to parse and handle them, but they have different behavior because they interpret the file format differently. You could argue that both bibtex and biblatex treat entry keys and field keys as lower case, but they have different behavior when they encounter keys that have the same lower case. Perhaps other tools have their own weird behavior based on their own interpretation of the incomplete bibtex file format.

I came across this problem because I was using bibtexparser to produce an HTML format for the bibtex entries, and I wanted our system to emulate the behavior of both biblatex and bibtex.

The solutions that I came to:

  1. I wrote middleware to convert entry types to lower case, and both bibtex and biblatex do the same. That way it's easier to decide how to format the entries. The only reason I can see to preserve this is if the bibtexparser user is expecting to see the same thing after parsing and writing out again.
  2. Because of the behavior of \cite is case-sensitive, I decided not to convert entry keys to lower case. It appears that bibtexparser.parse_string does not check case of keys, and only declares a duplicate if the keys match in their original case. The second and subsequent entries with the same key are kept as DuplicateBlockKeyBlocks but are not treated as entries. This is not the same behavior of the bibtex tool, which drops entries if the lower case key is the same as something already seen. It is consistent with how biber parses the entries.
  3. I wrote middleware to convert field keys to lower case (it's easier to look them up that way, and both biblatex and bibtex treat them as such). There is a question as to whether to take the first or last field encountered when there are duplicates, and it depends on whether you are trying to mimic bibtex or biblatex (or something else). I see no reason to keep both 'title' and 'Title' field keys, but this depends on the use case. I use a flag in the constructor to choose between "keep all", "keep first", or "keep last".

We are using bibtexparser in a system to process latex+bibtex that is uploaded by authors. Our system uses bibexport to extract the entries that are actually cited, and this uses the bibtex binary in the script. This tool only works if the authors use the bibtex tool, since it looks in the .aux file for \bibcite. In order to get around this for authors who use biblatex, our system creates an artificial .aux file that looks like it was produced by the bibtex tool, and we process that with bibexport so that it can extract the entries. Of course biber and bibtex treat duplicate keys differently, so this will fail if authors depend on the biber behavior to save entries with keys that collide in lower case with others.

Describe the solution you'd like

The bottom line here is that software tools to handle the bibtex file format are inconsistent on how they treat keys. It seems useful to offer options for bibtexparser to emulate the behavior of other tools that process the bibtex file format. This can be customized by the use of middleware, and it might be useful to have additional standard middleware classes to support the different behavior required. It also seems like it's long overdue for a bibtex file format replacement. There are too many nonstandard entry types and field types. It's probably too late to fix the definition of the bibtex file format unless we add something like @version at the beginning of the file to say what tools the file is intended to be processed with. I don't think that's the job of bibtexparser though unless it is used in a tool to replace the bibtex or biber tools

I would be willing to contribute a PR to offer other middleware to handle these cases.

@MiWeiss
Copy link
Collaborator

MiWeiss commented Apr 2, 2024

Hey @kmccurley

First an formost, thanks a lot for opening the issue, for your detailed research into the matter and for your willingness to contribute a PR.

A couple of (unordered) notes / thoughts from my side:

  1. Note that as of very recently, on the main branch, there's a new NormalizeFieldKeys . It is not yet part of a tagged/released version and likely subject to future change, but it should partially support your usecase. Install instructions to install from the main branch are in the readme.
  2. I fully agree with you that our users may expect different behaviors (some want to extract keys and types case sensitive, as they are, others want to normalize them). Thus, we should indeed make this configurable and a middleware is certainly the way to go.
  3. You remark that we create a DuplicateBlockKeyBlocks instead of dropping one item. That's on purpose as it makes potential errors in the bibfile explicit and as it allows more general usecases: After all, it's easy to drop all DuplicateBlockKeyBlocks ex-post while retrieving previously deleted entries is not possible. Note that by calling failed_block.ignore_error_block you get the duplicate block as entry, while by calling failed_block.previous_block you get the originally parsed block with the same key.
  4. Our docs forsee a listing of great community middleware here. I believe that your usecase is such a valid one that we may actually want to merge that into our main branch and make it part of the middlewares maintained and shipped directly in this project. I'm still mentioning the community middleware in case you have implemented more that you'd want to share but don't think it belongs in this repo :-)

Possible solution:

  1. We create two additional middlewares in line with NormalizeFieldKeys , named NormalizeEntryTypes and NormalizeEntryKeys?
  2. Rename the NormalizeFieldKeys to NormalizeKeys and allow boolean constructor arguments to specify which keys: FieldKeys (default true), EntryKeys (default false, as \cite is case-sensitive) and EntryType (default true).

Advantage of (1) would be that its very obvious whats going to be done. Advantage of (2) is that it's less boilerplate-code and also somewhat better regarding performance (we only have to loop over each block once instead of three times).
I'd tend to option (2) but I'd be interested to hear you opinion, and also comments from @tdegeus if he's available (he has been quite involved in the merging of the NormalizeFieldKeys middleware.

@tdegeus
Copy link
Collaborator

tdegeus commented Apr 3, 2024

Thanks for a very detailed report and willingness to improve bibtexparser. Your suggestions seem fair @MiWeiss . I guess for simplicity, we could indeed go for 2. The only thing is that I find it confusing to have camel-case options.

@kmccurley
Copy link
Author

The NormalizeFieldKeys is very close to what I suggested in the third observation, but it makes an arbitrary decision to keep the last seen field key that collides. This is the same behavior as the (less commonly used) biblatex/biber option. By contrast, bibtex keeps the first key encountered. I'd suggest having a parameter in the constructor to choose between keeping the first and last colliding field key, so that the user could choose between bibtex and biber behavior. I don't particularly care which is default (but bibtex behavior is probably more common).

Another very tiny point about NormalizeFieldKeys is that if you use it on an older version of python (<3.7) then it may reorder fields because dict only guaranteed insertion order on values starting in version 3.7. That doesn't concern me at all, but others who expect it to make as few modifications as possible might be surprised. I don't think I'd change the code just for this reason though.

I think it's a great idea to keep DuplicateBlockKeyBlocks, but my point was that biber and bibtex tools have different ideas of what constitutes a duplicate block. For some reason bibtex considers it a collision if the lower case entry keys collide, but biber treats them as distinct entries. Your approach seems to follow the biber behavior, but it might be useful to have this configurable. I need a tool that can emulate both biber and bibtex, so I have to use custom middleware to achieve that.

I don't have a strong opinion between your options 1) or 2). I regularly parse a file with tens of thousands of blocks, so I might opt for the higher-performance case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants