-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Case: Aggregate number of pages/canvases across multiple METS derived from search query #25
Comments
These types come from the METS
We currently don't have any statistics from the
to sum up the count of annotation elements. These are logical elements, not necessarily physical pages - mapping those would be outside the scope of this project, I believe. Would this answer the question? |
Thanks @mikegerber . @cneud proposed something similar. Counting I assume there is no structured data (i.e. an integer number) for Sorry for the confusion. |
(I'm replying to one aspect at a time, to structure this!)
The purpose of mods4pandas is to transform METS XML files into a table (or, more specific, a pandas DataFrame, Excel and CSV was added later as a byproduct) to perform analysis. That is, as far as can be done with such a hierarchical format like METS / MODS. Here is the latest result for the Digitized Collection of the SBB: https://zenodo.org/record/7716032 Querying an API is not really the purpose of the program. That being said, I could try to figure out which API your above query is using - I honestly don't know at the moment. |
Logical elements are not pages at all and don't even map to pages 1:1. For example, in the METS you linked,
Sorry if this seems to be pedantic, I just try to point out potential misunderstandings, because I want to understand the problem better :-) |
TODOs for me:
|
Sorry for the confusion. So, although I'm interested in the physical pages of books, a digitzed book might contain more images (or, canvases) as, e.g. there's an image of the binding (which is not a page), maybe by mistake the same phyical page was scanned twice (so, 4 digital images represent 2 physical pages only) etc.. I could live with this discrepancy, but I'd like to have a guess about the amount of "digital pages" (in the broad data-contrained sense sketched above) of, say, all books in the SBB digi collection printed in 1666. |
That is easily answered with the file above, here with pandas in Python: https://gist.github.com/mikegerber/72e57c847486163f46de94a71987ef5c But it's a bit unclear what numbers you really want? a. Number of annotations? - This can be implemented and is on the TODO above |
"b." was the most pressing issue and the notebook is already very useful. I will see if I can manage to filter in such a way that it returns those works like in my query above. Publication date 1666 was just an example. I actually need to see how many total pages works have that have at least one "annotation" recorded, belong to some of the subject groups (e.g. "Naturwissenschaft..."), and are published before 1800. But the desired count is for all pages, not only those that carry an annotation Numbers for "a." and "c." would be even better, but in practice the librarians have not recorded all "annotations" but only the first (<10) to occur in a book, I was told. So, with the current data it's not overly meaningful (for me) but as the data grows, this would be fantastic. Again, thank you so much! |
How could I get the total number of pages/images/canvases from the METS files of all objects returned in a query like https://digital.staatsbibliothek-berlin.de/suche?queryString=type%3Aannotation%20date_issued%3A%3E1455%20date_issued%3A%3C1800&category=Naturwissenschaften%20%2F%20Mathematik
The text was updated successfully, but these errors were encountered: