Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue/237 #241

Open
wants to merge 53 commits into
base: dev
Choose a base branch
from
Open

Issue/237 #241

wants to merge 53 commits into from

Conversation

NickEdwards7502
Copy link
Collaborator

Major issues and features addressed in this update

  • VariantSpark's python wrapper has been refactored to create Random Forest models from a standalone class

    • Previously, in the non-hail VariantSpark release, the model was initialised and trained from the context of importance analyses. This did not seem appropriate for supporting future releases
    • A new scala function was created to return a trained RandomForest model without using hail
    • Files updated/created
      • python/varspark/rfmodel.py
      • python/varspark/core.py
      • python/varspark/__init__.py
      • src/main/scala/au/csiro/variantspark/api/GetRFModel.scala
  • A non-hail export model function was created

    • This function now processes trees in batches to remediate OOM errors with very large models
    • Files created
      • src/main/scala/au/csiro/variantspark/api/ExportModel.scala
  • The FeatureSource class, which provides wrapper functionalities for initialising genotype data for model training, has been moved to a standalone class

    • For better separation of concerns, this class is now imported to the core python wrapper
    • head(nrows, ncols) allows the first n rows and columns to be viewed as a pandas DataFrame
    • Files updated/created
      • python/varspark/featuresource.py
      • python/varspark/core.py
      • src/main/scala/au/csiro/variantspark/input/FeatureSource.scala
  • Covariate support was extended

    • Covariate sources can be created from a transposed/non-transposed .csv or .txt files with optional per-feature type specification
    • From there, covariates can be unioned with a genotype FeatureSource and passed to the model as training data
    • Since Covariates are initialised from the same FeatureSource wrapper class and are also of type RDD[Feature], they also support head()
    • Note that feature and covariate sources can be unioned multiple times
    • Local FDR was updated to remove non-genotype information from manhattan plotting
    • Files updated/created
      • `python/varspark/core.py
      • src/main/scala/au/csiro/variantspark/api/VSContext.scala
      • src/main/scala/au/csiro/variantspark/input/CsvStdFeatureSource.scala
      • src/main/scala/au/csiro/variantspark/input/UnionedFeatureSource.scala
      • python/varspark/lfdrvsnohail.py
  • Importance analyses were moved to a standalone python wrapper class

    • Importance analyses are now created from the context of a random forest model
    • Functionality remains largely the same, with a few changes
      • Both important_variables() and variable_importance() are now returned as pandas DataFrames
      • Split counts are now included in the DataFrame returned by variable_importance() (required for Local FDR calculations)
      • Optional parameter precision supports rounding for variable_importance()
      • Optional parameter normalized indicates whether to normalise importances for both functions
    • Files updated/created
      • python/varspark/importanceanalysis.py
      • python/varspark/core.py
      • src/main/scala/au/csiro/variantspark/api/ImportanceAnalysis.scala
      • src/main/scala/au/csiro/variantspark/api/AnalyticsFunctions.scala
  • Move lfdr file to non-hail python directory

    • Created function for manhattan plotting lfdr derived p-values
    • Files removed/created
      • python/varspark/hail/lfdrvs.py
      • python/varspark/lfdrvs.py
  • Updated all test cases according to the above changes

    • Files updated/removed/created
      • src/test/scala/au/csiro/variantspark/api
        • /CommonPairwiseOperationTest.scala
        • /ImportanceApiTest.scala
      • src/test/scala/au/csiro/variantspark/misc
        • /ReproducibilityTest.scala
        • /CovariateReproducibilityTest.scala
      • src/test/scala/au/csiro/variantspark/test
        • /TestSparkContext.scala
      • python/varspark/test
        • /test_core.py
        • /test_hail.py
        • /test_pvalues_calculation.py
      • src/test/scala/au/csiro/variantspark/work/hail
        • /HailApiApp.scala
  • Removed all files used exclusively in hail version

    • python/varspark/hail
      • __init__.py
      • context.py
      • hail.py
      • methods.py
      • plot.py
    • src/main/scala/au/csiro/variantspark/hail/methods
      • RFModel.scala
  • Removed hail installation from pom.xml

FEAT: Implemented RF class method for fitting the model

FEAT: Implemented RF class method for obtaining importance analysis
from a fitted RF

FEAT: Implemented RF class method for returning oob error

FEAT: Implemented RF class method for obtaining FDR
from a fitted model

FEAT: Implemented RF class method for exporting forest to JSON

REFACTOR: Make RF model available at package level

CHORE: Added type checking to all methods
REFACTOR: Removed FeatureSource and
ImportanceAnalysis classes from core

REFACTOR: Added FeatureSource import so features
can be returned as a class instantiation
REFACTOR: Removed imp analysis and model training

FEAT: Added conversion from feature to RDD (python)

FEAT: Added conversion from feature to RDD (scala)

CHORE: Added type checking
separate wrapper file (#237)

REFACTOR: Updated important_variables and variable_importance
methods to convert to pandas DataFrames
REFACTOR: Removed model training from object instantation and
updated class to accept a model as a parameter

REFACTOR: Added normalisation as an optional parameter for
variable importance methods

FEAT: Updated variableImportance method to include splitCount in return as it is required for local FDR analysis
and passes back to python context (#237)
from importAnalysis method of AnalyticsFunctions (#237)
FIX: Update export function to process trees in batches,
instead of collecting the whole forest as a map as this
led to OOM errors on large forests
REFACTOR: Refactor to mirror changes to python wrapper

FEAT: Include FDR calculation in unit test
FEAT: Implement function for manhattan plotting negative log p values
FEAT: Add wrapper class for importing covariates

FEAT: Add wrapper class for unioning features and covariates
REFACTOR: Include covariate filtering in manhattan plot function

STYLE: Format with black (#237)
FEAT: Add functions for importing std and transposed CSVs

FEAT: Add function for unioning features and covariates
REFACTOR: Remove python component of converting Feature RDD to pandas

FEAT: Add RDD slice to DF function
REFACTOR: Remove conversion of whole RDD to DataFrame

FEAT: Add function for slicing rows and columns and converting to DF
@NickEdwards7502 NickEdwards7502 added enhancement dependencies Pull requests that update a dependency file java Pull requests that update Java code python Pull requests that update Python code labels Oct 2, 2024
@NickEdwards7502 NickEdwards7502 self-assigned this Oct 2, 2024
@NickEdwards7502 NickEdwards7502 marked this pull request as ready for review October 2, 2024 07:10
* .bgz loader function implemented by Christina
* Update python wrapper to include imputation strategy parameter

* Update scala API to pass imputation strategy to VCFFeatureSource

* Create functions to handle mode and zero imputation strategies

* Added imputation strategy to test cases

* Added imputation strategy to FeatureSource cli

* Remove sparkPar from test cases due to changes in class signature

* Updated DefVariantToFeatureConverterTest to use zeros imputation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file enhancement java Pull requests that update Java code python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant