-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce 200x200000 into 200x1000 #23
Comments
Hi @avilella, MDR can perform feature construction to compress some number of features down to a single feature. Theoretically, MDR could do so with thousands of features; practically, MDR works best when only passed up to about 5 features. As such, a common practice with MDR is to exhaustively evaluate up to all n-way MDR models and keep only the best k, where n and k are defined by the user. In your case, k=1000 and maybe n=2 (for example). MDR would have to evaluate ~19999900000 models, which is likely outside your computational budget. For that reason, we've developed some feature selection algorithms in the scikit-rebate package that may be better for your use case. The scikit-rebate algorithms can scan your dataset and assign feature importance scores to every feature (in terms of their ability to predict the outcome, potentially interacting with other features) and select a subset of features down to, say, 1000 features. From there, MDR can more reasonably be used in the way I describe above to explicitly construct new, condensed features from the remaining 1000 features. Hope that helps. |
Beautiful! I will try it!
|
Great. I should note that scikit-rebate may take a while to run on a dataset with 200k features, but there is a |
Hi, I have a ChIP-seq style dataset of RPKM values that I want to reduce from 200x200000 into 200x1000, so that I only end up with 1000 variables at the end of the MDR process, for my 200 records.
What would be the recommended way to use scikit-mdr for this task?
The text was updated successfully, but these errors were encountered: