Skip to content

Commit

Permalink
Lots of changes. Added clinspacy_init() and changed how the package l…
Browse files Browse the repository at this point in the history
…oads. Uses miniconda by default but the user can configure.
  • Loading branch information
Singh authored and Singh committed Aug 21, 2020
1 parent a873190 commit f73fa35
Show file tree
Hide file tree
Showing 13 changed files with 774 additions and 139 deletions.
287 changes: 256 additions & 31 deletions R/clinspacy.R

Large diffs are not rendered by default.

57 changes: 33 additions & 24 deletions R/cui2vec_data.R
Original file line number Diff line number Diff line change
@@ -1,41 +1,50 @@
#' Cui2vec concept embeddings
#'
#' This dataset contains sample medical transcriptions for various medical specialties.
#' This dataset contains Unified Medical Langauge System (UMLS) concept embeddings from
#' Andrew Beam's \href{https://github.com/beamandrew/cui2vec}{cui2vec R package}. There are
#' 500 embeddings included for each concept.
#'
#' Acknowledgements
#' Citation
#'
#' This data was scraped from mtsamples.com by Tara Boyle and is made available
#' under a CC0: Public Domain license.
#' Beam, A.L., Kompa, B., Schmaltz, A., Fried, I., Griffin, W, Palmer, N.P., Shi, X.,
#' Cai, T., and Kohane, I.S.,, 2019. Clinical Concept Embeddings Learned from Massive
#' Sources of Multimodal Medical Data. arXiv preprint arXiv:1804.01486.
#'
#' @format A data frame with 4999 rows and 6 variables:
#' License
#'
#' This data is made available under a
#' \href{https://creativecommons.org/licenses/by/4.0/}{CC BY 4.0 license}. The only change
#' made to the original dataset is the renaming of columns.
#'
#' @format A data frame with 109053 rows and 501 variables:
#' \describe{
#' \item{note_id}{A unique identifier for each note}
#' \item{description}{A description or chief concern}
#' \item{medical_specialty}{Medical specialty of the note}
#' \item{sample_name}{mtsamples.com note name}
#' \item{transcription}{Transcription of note text}
#' \item{keywords}{Keywords}
#' \item{cui}{A Unified Medical Language System (UMLS) Concept Unique Identifier (CUI)}
#' \item{emb_001}{Concept embedding vector #1}
#' \item{emb_002}{Concept embedding vector #2}
#' \item{...}{...}
#' \item{emb_500}{Concept embedding vector #500}
#' }
#' @source \url{https://www.kaggle.com/tboyle10/medicaltranscriptions/data}
#' @source \url{https://figshare.com/s/00d69861786cd0156d81}
'cui2vec_embeddings'

#' Cui2vec concept definitions
#'
#' This dataset contains sample medical transcriptions for various medical specialties.
#' This dataset contains definitions for the Unified Medical Language System (UMLS)
#' Concept Unique Identifiers (CUIs). These come from Andrew Beam's
#' \href{https://github.com/beamandrew/cui2vec}{cui2vec R package}.
#'
#' Acknowledgements
#' License
#'
#' This data was scraped from mtsamples.com by Tara Boyle and is made available
#' under a CC0: Public Domain license.
#' This data is made available under a
#' \href{https://github.com/beamandrew/cui2vec/blob/master/LICENSE.md}{MIT license}. The data
#' is copyrighted in 2019 by Benjamin Kompa, Andrew Beam, and Allen Schmaltz. The only change
#' made to the original dataset is the renaming of columns.
#'
#' @format A data frame with 4999 rows and 6 variables:
#' @format A data frame with 3053795 rows and 3 variables:
#' \describe{
#' \item{note_id}{A unique identifier for each note}
#' \item{description}{A description or chief concern}
#' \item{medical_specialty}{Medical specialty of the note}
#' \item{sample_name}{mtsamples.com note name}
#' \item{transcription}{Transcription of note text}
#' \item{keywords}{Keywords}
#' \item{cui}{A Unified Medical Language System (UMLS) Concept Unique Identifier (CUI)}
#' \item{semantic_type}{Semantic type of the CUI}
#' \item{definition}{Definition of the CUI}
#' }
#' @source \url{https://www.kaggle.com/tboyle10/medicaltranscriptions/data}
#' @source \url{https://github.com/beamandrew/cui2vec}
'cui2vec_definitions'
7 changes: 5 additions & 2 deletions R/mtsamples.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,11 @@
#'
#' Acknowledgements
#'
#' This data was scraped from mtsamples.com by Tara Boyle and is made available
#' under a CC0: Public Domain license.
#' This data was scraped from \href{https://mtsamples.com}{https://mtsamples.com} by Tara Boyle.
#'
#' License
#' This data is made available under a
#' \href{https://creativecommons.org/share-your-work/public-domain/cc0/}{CC0: Public Domain license}.
#'
#' @format A data frame with 4999 rows and 6 variables:
#' \describe{
Expand Down
36 changes: 32 additions & 4 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,20 +29,26 @@ You can install the GitHub version of clinspacy with:
remotes::install_github('ML4LHS/clinspacy', INSTALL_opts = '--no-multiarch')
```

## Example
## Examples

```{r}
library(clinspacy)
clinspacy('This patient has diabetes and CKD stage 3 but no HTN.')
clinspacy('This patient is taking omeprazole, Protonix, and lisinopril 10 mg. He has diabetes.',
semantic_types = 'Disease or Syndrome')
clinspacy('This patient is taking omeprazole, Protonix, and lisinopril 10 mg. He has diabetes.',
semantic_types = 'Pharmacologic Substance')
```

## Using the mtsamples dataset

```{r}
data(mtsamples)
str(mtsamples[1:5,])
mtsamples[1:5,]
```


Expand All @@ -51,8 +57,30 @@ str(mtsamples[1:5,])
This function binds columns containing concept unique identifiers with which scispacy has 99% confidence of being present with values containing frequencies. Negated concepts, as identified by negspacy's NegEx implementation, are ignored and do not count towards the frequencies.

```{r}
mtsamples_with_cuis = bind_clinspacy(mtsamples[1:5,], text = 'description')
bind_clinspacy(mtsamples[1:5, 1:2],
text = 'description')
str(mtsamples_with_cuis)
bind_clinspacy(mtsamples[1:5, 1:2],
text = 'description',
semantic_types = 'Diagnostic Procedure')
```

## Binding Concept Embeddings to a Data Frame

```{r}
bind_clinspacy_embeddings(mtsamples[1:5, 1:2],
text = 'description',
num_embeddings = 5)
bind_clinspacy_embeddings(mtsamples[1:5, 1:2],
text = 'description',
num_embeddings = 5,
semantic_types = 'Diagnostic Procedure')
```

# UMLS CUI definitions

```{r}
data(cui2vec_definitions)
head(cui2vec_definitions)
```
Loading

0 comments on commit f73fa35

Please sign in to comment.