Pubmed_mining

My foray into text mining data from pubmed.

##To Do:

Mesh Headings and Keywords will need to be inspected for run-on words
Use n-grams (i.e. stem cell instead of cell and stem) -- bigram tokenizer initated
Create dictionary of relevant terms
Fix stemCompletion2 code -- package update may have broken the code.
Add topic model river plots to shiny
add Grant-PMID netowrk visualization and analytics

##Usage:

The script takes pubmed data in xml form and extracts the abstracts for each citation. Abstracts are then processed in what seems to be a pretty standard way (remove numbers, puncuation and stems). Stems are completed and then some basic frequency and associations are computed. Lastly three graphics are generated, a word cloud, a dendrogram and graph for the most frequently occuring words.

Qualitative Performance Notes:

XML reading and traversing seems memory efficient and fast. Whatever problem I encountered previous has been resolved with better functions.

tm_map calls seem relatively speedy. stop word removal and stemming are by far slower than to lower and remove numbers. Stem completion is very slow, distributing the task helps but a large corpus may need to be moved to larger machine. However, memory usage has been reasonable throughout the transformation processes

Name		Name	Last commit message	Last commit date
Latest commit History 480 Commits
data/Strategic_goals		data/Strategic_goals
mallet		mallet
shiny		shiny
.gitignore		.gitignore
CoOccurence_analysis.R		CoOccurence_analysis.R
LICENSE		LICENSE
NIH_reporter_cleaner.py		NIH_reporter_cleaner.py
Ones_grantees.R		Ones_grantees.R
POStagging.R		POStagging.R
README.md		README.md
SP_analysis.R		SP_analysis.R
database_analysis.R		database_analysis.R
getNIHExporterData.R		getNIHExporterData.R
getTopicAssign.R		getTopicAssign.R
hierarchical_analysis.R		hierarchical_analysis.R
makeCorpus.R		makeCorpus.R
mallet_toGephi.pl		mallet_toGephi.pl
network_analysis.R		network_analysis.R
reporterParse.R		reporterParse.R
scratchWork.R		scratchWork.R
shiny_Init.R		shiny_Init.R
stopwords.txt		stopwords.txt
textMine_funcs.R		textMine_funcs.R
text_analysis.R		text_analysis.R
text_mine_report.Rmd		text_mine_report.Rmd
topic_model.R		topic_model.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pubmed_mining

Qualitative Performance Notes:

About

Releases

Packages

Languages

License

emilliman5/Pubmed_mining

Folders and files

Latest commit

History

Repository files navigation

Pubmed_mining

Qualitative Performance Notes:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages