Skip to content

Commit

Permalink
📖 update the portal use-case
Browse files Browse the repository at this point in the history
  • Loading branch information
tpoisot committed Dec 9, 2020
1 parent 2d6a626 commit d4aa1c0
Show file tree
Hide file tree
Showing 2 changed files with 85 additions and 3 deletions.
1 change: 1 addition & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
DataFramesMeta = "1313f7d8-7da2-5740-9ea0-a2ca25f37964"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
StringDistances = "88034a9c-02f8-509d-84a9-84ec65e18404"
87 changes: 84 additions & 3 deletions docs/src/portal.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,14 @@ We will download a list of species from figshare, which is given as a JSON file:
using NCBITaxonomy
using DataFrames
using JSON
using StringDistances
species_file = download("https://ndownloader.figshare.com/files/3299486")
species = JSON.parsefile(species_file)
```

## Cleaning up the portal names

There is are two things we want to do at this point: extract the species names
from the file, and then validate that they are spelled correctly, or that they
are the most recent taxonomic name according to NCBI.
Expand Down Expand Up @@ -58,6 +61,8 @@ end
first(cleanup, 5)
```

## Looking at species with a name discrepancy

Finally, we can look at the codes for which there is a likely issue because the
names do not match -- this can be because of new names, improper use of
vernacular, or spelling issues:
Expand All @@ -67,13 +72,13 @@ filter(r -> r.portal != r.name, cleanup)
```

Note that these results should *always* be manually curated. For example, two
species have been assigned to groups that are *obiously* wrong:
species have been assigned to groups that are *obviously* wrong:

```@example portal
filter(r -> r.order ∈ ["Gentianales","Hemiptera"], cleanup)
```

How can we fix this?
## Fixing the mis-identified species

Well, the obvious choice here is *manual cleaning*. This is a good solution.
Another thing that `NCBITaxonomy` offers is the ability to build a `namefinder`
Expand All @@ -84,5 +89,81 @@ In this case, we know that the species are going to be vertebrates, so we can us
the `vertebratefinder` function to restrict the search to these groups:

```@example portal
vertebratefinder(true)("Lizard")
vertebratefinder(true)("Lizard"; fuzzy=true)
```

However, this approach does not seem to work for the second group:

```@example portal
vertebratefinder(true)("Perognathus hispidus"; fuzzy=true)
```

## The mystery of the hispid pocket mouse

This one will not be solved by our approach, as it is an invalid name --
*Perognathus hispidus* should actually be *Chaetodipus hispidus*. Here are the
list of issues that result in this name not being identifiable easily. First,
*Chaetodipus* is a valid name, for which *Perognathus* is not a synonym. So
searching by genus is not going to help. Second, there are a whole lot of
species that end with *hispidus*, and trying different string distances is not
going to help. We can try:

```@example portal
vertebratefinder(true)("Perognathus hispidus"; fuzzy=true, dist=DamerauLevenshtein)
```

This returns a valid taxon, but an incorrect one (the Olive-backed pocket
mouse). There is no obvious way to solve this problem.

*Or is it?*

To solve the issue with Lizards, we had to move away from `taxid`, and use
`verterbatefinder` to limit the scope of the search. It would save some time to
use this for the entire portal dataset, so let's create a `portalnamesolver`
function:

```@example portal
portalnamesolver = vertebratefinder(true)
```

It currently does *not* help with our example - but this is ok, as we cal use
one of Julia's features to hard-code the solution: dispatching on values.
Because `portalnamesolver` is a singleton function (due to the way `namefinder`
works), we need to be explicit about which module we want to expand it from (the
`@__MODULE__` will get the appropriate value, which can be `Main` if you work
from the REPL, the Weave sandbox if you are generatic a document, or your own
module if you structure your analysis this wat):

```@example portal
Env = @__MODULE__
function Env.portalnamesolver(::Type{Val{Symbol("Perognathus hispidus")}})
return ncbi"Chaetodipus hispidus"
end
```

This definition says "every time we call the `portalnamesolver` with a `Symbol`
containing this species name, return this species". We can call it with:

```@example portal
portalnamesolver(Val{Symbol("Perognathus hispidus")})
```

Note that this is *not* changing the behavior of our `portalnamesolver`, it is
simply adding a method:

```@example portal
portalnamesolver("Lizards"; fuzzy=true)
```

At this point, we may want to update the very first loop, to use the
`portalnamesolver` throughout.

## Wrapping-up

This vignette illustrates how to go through a list of names, and match them
against the NCBI taxonomy. We have seen a number of functions from
`NCBITaxonomy`, including fuzzy string searching,. using custom string
distances, limiting the taxonomic scope of the search, and finally using
value-based dispatch to fix the unfixable. The last step can be automated a lot
by relying on Julia's existing code generation techniques, but this goes beyond
the scope of this vignette.

0 comments on commit d4aa1c0

Please sign in to comment.