Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No data preprocessing for SorelNet? #30

Open
MariaRigaki opened this issue Mar 29, 2022 · 2 comments
Open

No data preprocessing for SorelNet? #30

MariaRigaki opened this issue Mar 29, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@MariaRigaki
Copy link

MariaRigaki commented Mar 29, 2022

In the Sorel-20M repository, in the train.py, the train_network() function calls get_generator() which initializes the Generator class, which in turn calls the Dataset class that calls the LMDBReader class. LMDBReader has a function called features_postproc_func which per my understanding is applying some logarithmic function on the ember features before using them. This chain is not followed in the training of the LGB model where the Ember features are read directly from the numpy arrays and no pre-processing is applied (as expected).

Looking at the code in secml_malware I see that the ember features are fed directly to the neural network without any preprocessing and I'm wandering if this should be added in the feature extractor.

As a side note, in my testing of the Sorel models and data, if I don't apply the features_postproc_func I get really bad results with the pretrained sorel nets, so I think this is needed.

@zangobot
Copy link
Collaborator

Thank you for having opened the issue!
I will investigate. I naively thought that the models were trained on the plain EMBER features. I will have a look!

@zangobot zangobot added the bug Something isn't working label Mar 29, 2022
@zangobot
Copy link
Collaborator

So, small update, I included the feature post processing function inside the feature extractor (thank you for making me notice!). I still have to bulk test it on a larger dataset to see if the performances match the ones described in the paper, but this is already a step forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants