Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitter interprets any @ sign as new block start #488

Open
1 of 2 tasks
juthilo opened this issue Jul 26, 2024 · 0 comments
Open
1 of 2 tasks

Splitter interprets any @ sign as new block start #488

juthilo opened this issue Jul 26, 2024 · 0 comments

Comments

@juthilo
Copy link

juthilo commented Jul 26, 2024

Describe the bug
The splitter's methods _move_to_comma_or_closing_curly_bracket and _move_to_closed_bracket each contain a check for unexpected block starts. Unfortunately, this interferes with the parsing of entries that contain the @ sign as raw text.

Reproducing

Version: 2.0.0b7

Code:
This example parse fails because of the @ in the title, raising a BlockAbortedException and adding the block to failed_blocks.

test = bibtexparser.parse_string(
    """
    @inproceedings{DBLP:conf/cikm/EsuliM021,
      author       = {Andrea Esuli and Alejandro Moreo and Fabrizio Sebastiani},
      editor       = {Gao Cong and Maya Ramanath},
      title        = {LeQua @ {CLEF} 2022: {A} Shared Task for Evaluating Quantification Systems},
      booktitle    = {Proceedings of the {CIKM} 2021 Workshops co-located with 30th {ACM}
                      International Conference on Information and Knowledge Management {(CIKM}
                      2021), Gold Coast, Queensland, Australia, November 1-5, 2021},
      series       = {{CEUR} Workshop Proceedings},
      volume       = {3052},
      publisher    = {CEUR-WS.org},
      year         = {2021},
      url          = {https://ceur-ws.org/Vol-3052/abstract4.pdf},
      timestamp    = {Fri, 10 Mar 2023 16:22:33 +0100},
      biburl       = {https://dblp.org/rec/conf/cikm/EsuliM021.bib},
      bibsource    = {dblp computer science bibliography, https://dblp.org}
    }
    """
)
print(test.entries_dict['DBLP:conf/cikm/EsuliM021'])

Bibtex:

@inproceedings{DBLP:conf/cikm/EsuliM021,
      author       = {Andrea Esuli and Alejandro Moreo and Fabrizio Sebastiani},
      editor       = {Gao Cong and Maya Ramanath},
      title        = {LeQua @ {CLEF} 2022: {A} Shared Task for Evaluating Quantification Systems},
      booktitle    = {Proceedings of the {CIKM} 2021 Workshops co-located with 30th {ACM}
                      International Conference on Information and Knowledge Management {(CIKM}
                      2021), Gold Coast, Queensland, Australia, November 1-5, 2021},
      series       = {{CEUR} Workshop Proceedings},
      volume       = {3052},
      publisher    = {CEUR-WS.org},
      year         = {2021},
      url          = {https://ceur-ws.org/Vol-3052/abstract4.pdf},
      timestamp    = {Fri, 10 Mar 2023 16:22:33 +0100},
      biburl       = {https://dblp.org/rec/conf/cikm/EsuliM021.bib},
      bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Workaround
Monkey-patching the two methods by removing the @ check leads to a successful parse.

Remaining Questions (Optional)

  • I would be willing to contribute a PR to fix this issue.
  • This issue is a blocker, I'd be grateful for an early fix.

It says in the code that new blocks are identified by being after a new line. If that assumption is generally safe to make, I could remove the two checks altogether. The only other solution I could think of is replacing the "@" check with a tuple of the most common entry types, e.g. startswith(("@article", "@book", "@proceedings", ...)). Let me know if one of those works and I'll gladly prepare a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant