-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tons of garbage on opensnp #559
Comments
Hey @chaplin89, thanks for getting in touch and that list! In our pre-parsing of uploaded files we already try to unzip files and get rid of the "wrong" files (aka everything that doesn't look like it's a 'correct' genotyping file) (see here: snpr/app/workers/preparsing.rb Line 114 in 0a1d2aa
I'll have a think of how we can keep a better eye on it! |
I'm wondering what happens in that In any case, the way I would do this in python is probably to launch a As a side note, I see there's a EDIT |
Hey, not sure if you're aware but there's really a lot of garbage there, as OpenSNP is probably not checking what users are uploading.
Here's a normalized list of file types I've found in your db:
I was curious about the EXEs, at least they don't seem to contain virus. One of them are from a tool called "MyHeritage Family Builder Genealogy Software" and all the rest are called "23andme to FASTA".
It shouldn't be too hard to clean it and to put some checks after people are uploading something. I did this analysis using the
file
linux utility, I think it could probably be done on the server side as well? Watch out for command injection in case. A neat improvement would be to have all the files in the same format.I'm attaching a list of files with their format: file_type.csv
Also the phenotype section doesn't seem very well monitored as someone created a "naked body phenotype" to use it to share a naked picture of himself. Not sure about the scientific value of that lol
The text was updated successfully, but these errors were encountered: