Skip to content

Curation rate estimation plan

Gaurav Vaidya edited this page Jun 27, 2018 · 13 revisions

This plan documents the methodologies we have used to quantify the rate at which we can curate phyloreferences in order to estimate how long it would take to curate a certain number of phyloreferences.

Current methodology

Prerequisites

  • Choose a set of papers from the list of papers to curate at the bottom of this page.
    • Initially, it is fine for the curator to curate papers that have already been curated, since these are the only ones we know can be curated. However, they must not curate a paper they are familiar with themselves or that they have curated before, as that would provide an underestimate for the time taken to understand and curate a paper.

Procedure before curation

  • Before starting to time themselves, the curator should confirm that:
    • The paper contains at least one clade definition.
    • The paper contains at least one phylogeny containing all the specifiers for at least one clade definition.

Procedure during curation

  1. Enter the metadata for the study, including title, citation and DOI.
    • The curator should not be documenting bugs while timing themselves. Another person could watch the curator and write down suggested user interface improvements.
  2. Curate the phyloreferences in any order using the Curation Tool.
    • The verbatim clade definition and every verbatim specifier should be entered for all phyloreferences. If possible, scientific names or specimen identifiers should also be added.
    • The verbatim specifier should include all the information included in the original description, including authority information, higher taxonomy, whether the definition points to a taxon or the type specimen of the taxon, and any other included information.
      • For example, when curating "Gnetum gnemon Linnaeus 1767 (Gnetophyta) and Pinus strobus Linnaeus 1753 (Coniferae)", two specifiers should be extracted: "Gnetum gnemon Linnaeus 1767 (Gnetophyta)" and "Pinus strobus Linnaeus 1753 (Coniferae)".
  3. Curate the phylogenies in any order using the Curation Tool.
    • If the reference phylogeny is available in digital format (e.g., a Newick or Nexus file), proceed to upload the phylogeny. If the phylogeny is not available digitally, first write to author of the paper that publishes the phylogeny. If no response is given, then proceed to manually transcribe the phylogeny to a digital format.
    • All phylogenies should be titled using a descriptive title (e.g. "Fig 3 from the paper", "Downloaded from TreeBase Study S2914", etc) and should contain a Newick string that as closely matches the phylogeny in the paper as possible.
    • Only the phylogeny that shows where clade definitions are expected to resolve needs to be curated. Other phylogenies may be included if the curator believes that they will help test phyloreference resolution.
    • The curator should identify the expected nodes for each phyloreference to resolve to based on where the authors expected their clade definition to resolve on their phylogeny. Any differences should be noted in the curation notes field.
  4. Once all phylogenies and phyloreferences have been added, the curator should go through all phyloreferences to ensure that all specifiers that were expected to match are matching correctly. Additional taxonomic units may need to be added to the phylogeny to ensure that they match.
    • The curator should also ensure that the expected node for each phyloreference is set.
    • When a specifier does not match, the curator must click on the asterisk beside the specifier and set a reason for why this specifier does not match. Usually this will be because it is not present in any phylogeny, but any other reason can be provided here.
  5. Finally, the curator should note that they curated this PHYX file. Until we have a proper way to do this (see phyloref/curation-tool#26), they can leave a note in the curator notes fields for each phyloreference.

Procedure after curation

  • The curator should document:
    • the paper curated,
    • time taken (both including and excluding time taken to obtain digital copies of the phylogenies),
    • number of phyloreferences completely curated,
    • number of phyloreferences incompletely curated, such as where a specifier is not shown on any phylogeny in the paper, and
    • any issues which might have slowed down curation.
  • The time taken must be accurate to within 10 minutes, and ideally should be accurate to the minute.

List of papers to curate

Papers already curated

Papers that could not be curated

Papers that can be curated but cannot validated

Papers to be curated