Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaps in documentation #12

Open
lukavdplas opened this issue Mar 5, 2024 · 1 comment
Open

Gaps in documentation #12

lukavdplas opened this issue Mar 5, 2024 · 1 comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed

Comments

@lukavdplas
Copy link
Contributor

I've tried to document the package as best I can in #11, but sometimes my own understanding falls short.

Here are some questions I had, that should still be answered in the documentation:

  • How does the external_file option in the XML extractor work? Can we include an example?
  • Idem for secondary_tag option in the XML extractor. How can this be used?
  • I understand how to use the ExternalFile extractor, but not why. Since the file has to be specified during metadata extraction, why not just read the file at that stage? Can we add an example where this would be useful?

Lastly, the FilterAttribute extractor (a subclass of XML) sounds straightforward, but is there a difference between these two extractors?

extractor_1 = XML({'foo': 'bar'})
extractor_2 = FilterAttribute({'attribute': 'foo', 'value': 'bar'})

If so, what is it?

@lukavdplas lukavdplas added help wanted Extra attention is needed documentation Improvements or additions to documentation labels Mar 5, 2024
@BeritJanssen
Copy link
Member

BeritJanssen commented Jul 3, 2024

The external_file argument was introduced for the sake of the dutchnewspapers corpora by some green programmer in the distant past. This corpus has a .xml file for every newspaper, and containing ids of every article, and then a .xml file for each article in the newspaper. It's not unlikely that there is a more elegant solution to achieve this: retrieve all information for a document from one file, but other fine-grained information (url, category, ocr confidence) from another.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants