MircoRNA results available in analysis.ipynb
, viewable here
DNAm results availabile in analysis-dnam.ipynb
, viewable here
This project uses virtualenv to create isolated Python environments.
- Download isoforms from 17 different classes of cancer from TCGA
- Put all samples of the same type into a matrix using rptashkin's TCGA_miRNASeq_Matrix (rows are features; columns are samples)
- Merge matrices
- Transpose
- Randomize, split labels
- Select features based on low NA-values
- Put all samples of the same type into a matrix using rptashkin's TCGA miRNASeq Matrix (rows are features; columns are samples)
- Merge matrices
- Transpose
- Test random forest, knn, and svm baselines
- Visualize keras tuning data from cluster
- Attempt cross validation
- Download 27k Illumina samples from TCGA using TCGA2STAT
- Get data from TCGA using
tcga2stat.R
- Select features based on low NA-values
- Select for high variability (20-80 percentile)
- Merge samples into one data matrix
- Randomize, split labels
- Baseline models to guage accuracy before feature selection
- Tune nnet hyperparameters
- Visualize tuning data
- Does feature selection improve random forest model?
- Does feature selection improve NNet model?
- Scaling (0,1)
- Try KNN, SVM, baselines
- High variability feature selection
- Process methylation data
- Import additional metastatic datasets
- Attempt on non-TCGA datasets
-
Tang, W. et al. (2017). Tumor origin detection with tissue-specific miRNA and DNA methylation markers. Bioinformatics, 34:3. [https://doi.org/10.1093/bioinformatics/btx622][https://doi.org/10.1093/bioinformatics/btx622]
-
Zhuang, J. et al. (2012). A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinum platform. [https://doi.org/10.1186/1471-2105-13-59][https://doi.org/10.1186/1471-2105-13-59]
This work utilizes resources supported by the National Science Foundation's Major Research Instrumentation program, grant #1725729, as well as the University of Illinois at Urbana-Champaign