From d4aa1c0188ffb69df0c4c27156cf541768c82b9a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Timoth=C3=A9e=20Poisot?= Date: Wed, 9 Dec 2020 10:18:16 -0500 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=96=20update=20the=20portal=20use-case?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/Project.toml | 1 + docs/src/portal.md | 87 ++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 85 insertions(+), 3 deletions(-) diff --git a/docs/Project.toml b/docs/Project.toml index 0b29ef2..80920e7 100644 --- a/docs/Project.toml +++ b/docs/Project.toml @@ -3,3 +3,4 @@ DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" DataFramesMeta = "1313f7d8-7da2-5740-9ea0-a2ca25f37964" Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4" JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6" +StringDistances = "88034a9c-02f8-509d-84a9-84ec65e18404" diff --git a/docs/src/portal.md b/docs/src/portal.md index 3aaf08e..1da2b48 100644 --- a/docs/src/portal.md +++ b/docs/src/portal.md @@ -13,11 +13,14 @@ We will download a list of species from figshare, which is given as a JSON file: using NCBITaxonomy using DataFrames using JSON +using StringDistances species_file = download("https://ndownloader.figshare.com/files/3299486") species = JSON.parsefile(species_file) ``` +## Cleaning up the portal names + There is are two things we want to do at this point: extract the species names from the file, and then validate that they are spelled correctly, or that they are the most recent taxonomic name according to NCBI. @@ -58,6 +61,8 @@ end first(cleanup, 5) ``` +## Looking at species with a name discrepancy + Finally, we can look at the codes for which there is a likely issue because the names do not match -- this can be because of new names, improper use of vernacular, or spelling issues: @@ -67,13 +72,13 @@ filter(r -> r.portal != r.name, cleanup) ``` Note that these results should *always* be manually curated. For example, two -species have been assigned to groups that are *obiously* wrong: +species have been assigned to groups that are *obviously* wrong: ```@example portal filter(r -> r.order ∈ ["Gentianales","Hemiptera"], cleanup) ``` -How can we fix this? +## Fixing the mis-identified species Well, the obvious choice here is *manual cleaning*. This is a good solution. Another thing that `NCBITaxonomy` offers is the ability to build a `namefinder` @@ -84,5 +89,81 @@ In this case, we know that the species are going to be vertebrates, so we can us the `vertebratefinder` function to restrict the search to these groups: ```@example portal -vertebratefinder(true)("Lizard") +vertebratefinder(true)("Lizard"; fuzzy=true) +``` + +However, this approach does not seem to work for the second group: + +```@example portal +vertebratefinder(true)("Perognathus hispidus"; fuzzy=true) +``` + +## The mystery of the hispid pocket mouse + +This one will not be solved by our approach, as it is an invalid name -- +*Perognathus hispidus* should actually be *Chaetodipus hispidus*. Here are the +list of issues that result in this name not being identifiable easily. First, +*Chaetodipus* is a valid name, for which *Perognathus* is not a synonym. So +searching by genus is not going to help. Second, there are a whole lot of +species that end with *hispidus*, and trying different string distances is not +going to help. We can try: + +```@example portal +vertebratefinder(true)("Perognathus hispidus"; fuzzy=true, dist=DamerauLevenshtein) +``` + +This returns a valid taxon, but an incorrect one (the Olive-backed pocket +mouse). There is no obvious way to solve this problem. + +*Or is it?* + +To solve the issue with Lizards, we had to move away from `taxid`, and use +`verterbatefinder` to limit the scope of the search. It would save some time to +use this for the entire portal dataset, so let's create a `portalnamesolver` +function: + +```@example portal +portalnamesolver = vertebratefinder(true) ``` + +It currently does *not* help with our example - but this is ok, as we cal use +one of Julia's features to hard-code the solution: dispatching on values. +Because `portalnamesolver` is a singleton function (due to the way `namefinder` +works), we need to be explicit about which module we want to expand it from (the +`@__MODULE__` will get the appropriate value, which can be `Main` if you work +from the REPL, the Weave sandbox if you are generatic a document, or your own +module if you structure your analysis this wat): + +```@example portal +Env = @__MODULE__ +function Env.portalnamesolver(::Type{Val{Symbol("Perognathus hispidus")}}) + return ncbi"Chaetodipus hispidus" +end +``` + +This definition says "every time we call the `portalnamesolver` with a `Symbol` +containing this species name, return this species". We can call it with: + +```@example portal +portalnamesolver(Val{Symbol("Perognathus hispidus")}) +``` + +Note that this is *not* changing the behavior of our `portalnamesolver`, it is +simply adding a method: + +```@example portal +portalnamesolver("Lizards"; fuzzy=true) +``` + +At this point, we may want to update the very first loop, to use the +`portalnamesolver` throughout. + +## Wrapping-up + +This vignette illustrates how to go through a list of names, and match them +against the NCBI taxonomy. We have seen a number of functions from +`NCBITaxonomy`, including fuzzy string searching,. using custom string +distances, limiting the taxonomic scope of the search, and finally using +value-based dispatch to fix the unfixable. The last step can be automated a lot +by relying on Julia's existing code generation techniques, but this goes beyond +the scope of this vignette. \ No newline at end of file