Biomedical pre-trained word embeddings #28

gbrokos · 2018-05-07T23:01:28Z

We (AUEB's NLP group: http://nlp.cs.aueb.gr/) recently released word embeddings pre-trained on text from 27 million biomedical articles from the MEDLINE/PubMed Baseline 2018.

Two versions of word embeddings are provided, both in Word2Vec's C binary format:
200-dimensional: https://archive.org/download/pubmed2018_w2v_200D.tar/pubmed2018_w2v_200D.tar.gz
400-dimensional: https://archive.org/download/pubmed2018_w2v_400D.tar/pubmed2018_w2v_400D.tar.gz

Each .tar.gz file contains a folder with the pre-trained model and a readme file which you can also find here:
https://archive.org/download/pubmed2018_w2v_200D.tar/README.txt
The readme file contains details, statistics and license information for this dataset.

We would be happy to contribute this dataset to the gensim-data project. Let me know if you need any additional information or change in the files' format.

Code example: Load and use the 200D pre-trained model.

$ tar -zxvf pubmed2018_w2v_200D.tar.gz
$ python3
>>> from gensim.models import KeyedVectors
>>> word_vectors = KeyedVectors.load_word2vec_format('pubmed2018_w2v_200D/pubmed2018_w2v_200D.bin', binary=True)
>>> word_vectors.most_similar(positive=['dna'])
[('deoxyribonucleic', 0.7673395872116089), ('dnas', 0.7249159216880798), ('dnarna', 0.72159743309021), ('dsdna', 0.68665611743927), ('ndna', 0.6813312768936157), ('checkpoint-blind', 0.6774483919143677), ('ssdna', 0.677314043045044), ('multi-dna', 0.6761660575866699), ('o⁶-methylguanine', 0.670427680015564), ('mtdna', 0.6684218645095825)]

The text was updated successfully, but these errors were encountered:

piskvorky · 2018-05-08T10:06:11Z

Thanks @gbrokos ! That's definitely useful.

What we also need is a clear description of the preprocessing (especially since this is biomedical domain, where good tokenization / phrase detection is important). How can users of your dataset match this preprocessing, in order to look up words?

The license seems a bit limiting. What is the reason not to allow commercial use?

gbrokos · 2018-05-09T17:12:45Z

For text preprocessing we used the "bioclean" lambda function defined in the code below. Originally this were included in the toolkit.py script that accompanies the word embeddings of the BioASQ challenge. I just removed the surrounding ' '.join() to avoid joining tokens by spaces after splitting.

Here is a python code example:

import re

# clean for BioASQ
bioclean = lambda t: re.sub('[.,?;*!%^&_+():-\[\]{}]', '', t.replace('"', '').replace('/', '').replace('\\', '').replace("'",'').strip().lower()).split()

tokens = bioclean('This is a sentence to preprocess and tokenize!')
print(tokens)

Output:
['this', 'is', 'a', 'sentence', 'to', 'preprocess', 'and', 'tokenize']

Other than that, we followed the exact workflow described in the readme file's preprocessing section.

Regarding the license, MEDLINE/PubMed Terms and Conditions declare that "some PubMed/MEDLINE abstracts may be protected by copyright." We are still, not sure if this affects word embeddings produced using this dataset.

MEDLINE/PubMed Terms and Conditions: https://www.nlm.nih.gov/databases/download/terms_and_conditions.html

zhqu1148980644 · 2019-05-07T04:41:05Z

Downloading Failure.Please upload your file to another place.

piskvorky · 2019-05-07T07:33:56Z

CC @mpenkov

gbrokos · 2019-05-07T14:02:08Z

Downloading Failure.Please upload your file to another place.

Links of the original comment have been updated. It should work now.

prabhatM · 2019-05-15T14:17:12Z

@gbrokos

Thank you for the pretrained file.

Why simple disorders like "breast cancer" and "heart attack" show Out Of Vocabulary? Pubmed must have references to such common disorders!

gbrokos · 2019-05-18T11:19:49Z

Hi, the preprocessing and tokenization of the text has been done as described above. Word2vec was trained on the words resulted by this process and not bi-grams like "breast cancer" or "heart attack". However, word embeddings for the uni-grams "breast", "cancer", "heart" and "attack" exist.

prabhatM · 2019-05-27T06:41:48Z

Hi,
I was testing pubmed_word2vec_2018 in one of my project. I got a long list of OOV words.
oov_words = ['craniofaciofrontodigital', 'neurolepticinduced', 'cerabrooculonasal','inhalationinduced', 'cooccurrent', 'papillonlèfevre', 'nephroticnephritic', 'atretocephalus', 'seablue', 'unverrichtlundborg', 'portulacca', 'faceinduced', 'hexachlorine', 'twolevel', 'charcotmarietooth', 'dysphagocytosis', 'copperassociated', 'hugelbergwelander']

At the rate I am getting OOV after processing a small chunk of my data, I think I would expect 100x OOV of this list.

I guess, one needs a text file, not the bin file to add to the vocabulary, right?

prabhatM · 2019-06-11T04:35:06Z

Hi,
Is it possible to have the access to the .vec file for some retraining experimentation ?

Regards

Prabhat

dkajtoch · 2019-09-01T18:43:10Z

Do you include full model with both input and output layers (center and context word embeddings)?

romanegloo · 2019-10-17T13:57:22Z

FYI
vocab size is 2,665,547, and the most common 100 words are

['the', 'of', 'and', 'in', 'to', 'a', 'with', 'for', 'was', 'were', 'is', 'by', 'that', 'on', 'patients', 'as', 'from', 'this', 'or', 'an', 'are', 'at', 'be', 'we', 'study', 'results', 'not', 'these', 'cells', 'after', 'between', 'have', 'which', 'treatment', 'than', 'using', 'cell', 'but', 'been', 'has', 'group', 'during', 'p', 'both', 'two', 'may', 'it', 'their', 'also', 'had', 'all', 'more', 'used', 'no', 'disease', 'can', 'clinical', 'activity', 'analysis', 'data', '1', 'methods', 'expression', 'protein', 'effects', 'effect', 'increased', '2', 'associated', 'levels', 'compared', 'significantly', 'studies', 'other', 'human', 'significant', 'cancer', 'found', 'one', 'its', 'different', 'high', 'showed', 'use', 'control', 'there', 'risk', 'however', 'years', 'when', 'into', 'time', 'our', 'most', 'only', '3', 'gene', 'cases', 'blood', 'health']

mariask2 · 2020-04-17T11:13:30Z

Thanks a lot for creating this model!
I'm using it in this Kaggle note Notebook.
https://www.kaggle.com/mariaskeppstedt/a-risk-factor-logistic-regression-model
I'm referencing this web page in the Notebook. Would you like me to reference something else, e.g., a paper about it?

gbrokos · 2020-04-30T18:13:35Z

Hi @mariask2, glad you found it useful!
If still possible, please cite the publication mentioned at the end of the readme file:
https://archive.org/download/pubmed2018_w2v_200D.tar/README.txt

mpenkov self-assigned this Jun 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Biomedical pre-trained word embeddings #28

Biomedical pre-trained word embeddings #28

gbrokos commented May 7, 2018 •

edited

Loading

piskvorky commented May 8, 2018

gbrokos commented May 9, 2018

zhqu1148980644 commented May 7, 2019

piskvorky commented May 7, 2019

gbrokos commented May 7, 2019 •

edited

Loading

prabhatM commented May 15, 2019

gbrokos commented May 18, 2019

prabhatM commented May 27, 2019

prabhatM commented Jun 11, 2019

dkajtoch commented Sep 1, 2019

romanegloo commented Oct 17, 2019

mariask2 commented Apr 17, 2020

gbrokos commented Apr 30, 2020

Biomedical pre-trained word embeddings #28

Biomedical pre-trained word embeddings #28

Comments

gbrokos commented May 7, 2018 • edited Loading

piskvorky commented May 8, 2018

gbrokos commented May 9, 2018

zhqu1148980644 commented May 7, 2019

piskvorky commented May 7, 2019

gbrokos commented May 7, 2019 • edited Loading

prabhatM commented May 15, 2019

gbrokos commented May 18, 2019

prabhatM commented May 27, 2019

prabhatM commented Jun 11, 2019

dkajtoch commented Sep 1, 2019

romanegloo commented Oct 17, 2019

mariask2 commented Apr 17, 2020

gbrokos commented Apr 30, 2020

gbrokos commented May 7, 2018 •

edited

Loading

gbrokos commented May 7, 2019 •

edited

Loading