Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exclude duplicate UniProt proteomes and relax protein count filter #251

Open
realmarcin opened this issue Sep 18, 2024 · 1 comment
Open

Comments

@realmarcin
Copy link
Contributor

Using downloaded files through the UniProt query&download links. This gives a smaller set of reference + other proteomes but does not include duplicated proteomes (according to the UI). The number is similar to what we are ingesting now. I checked this data and there are some very low and high outlier counts. Our > 1000 protein filter was aggressive and excluded a lot of data (because we had too many proteomes, but most are duplicates).

Todo:

@realmarcin
Copy link
Contributor Author

Currently in kg-microbe-function, NCBITaxon:1898103 has 40 linked proteomes and according to UniProt this UP000235024 is the reference one:

UP000235024

D SELECT subject
FROM edges
WHERE object = 'NCBITaxon:1898103' AND subject LIKE 'Proteomes:%';
┌───────────────────────┐
│ subject │
│ varchar │
├───────────────────────┤
│ Proteomes:UP000235024 │
│ Proteomes:UP000264384 │
│ Proteomes:UP000267091 │
│ Proteomes:UP000273204 │
│ Proteomes:UP000318886 │
│ Proteomes:UP000321170 │
│ Proteomes:UP000321575 │
│ Proteomes:UP000323901 │
│ Proteomes:UP000473006 │
│ Proteomes:UP000486387 │
│ Proteomes:UP000517624 │
│ Proteomes:UP000520548 │
│ Proteomes:UP000540725 │
│ Proteomes:UP000555766 │
│ Proteomes:UP000664379 │
│ Proteomes:UP000664772 │
│ Proteomes:UP000664804 │
│ Proteomes:UP000674046 │
│ Proteomes:UP000696334 │
│ Proteomes:UP000697603 │
│ Proteomes:UP000698183 │
│ Proteomes:UP000714254 │
│ Proteomes:UP000720119 │
│ Proteomes:UP000725274 │
│ Proteomes:UP000725850 │
│ Proteomes:UP000731274 │
│ Proteomes:UP000733931 │
│ Proteomes:UP000736971 │
│ Proteomes:UP000739626 │
│ Proteomes:UP000745925 │
│ Proteomes:UP000746928 │
│ Proteomes:UP000748029 │
│ Proteomes:UP000753692 │
│ Proteomes:UP000757370 │
│ Proteomes:UP000759488 │
│ Proteomes:UP000771717 │
│ Proteomes:UP000782948 │
│ Proteomes:UP000784857 │
│ Proteomes:UP000808344 │
│ Proteomes:UP000811369 │
├───────────────────────┤
│ 40 rows │
└───────────────────────┘

The four files in the gdrive link above are for bacteria/archaea and reference/other proteome subsets. This is the better list to ingest because it will exclude duplicates. I can verify in these files that reference proteome id for this NCBITaxon is in the bacteria reference tsv file:

grep UP000235024 *
proteomes_AND_superkingdom_Bacteria_AND_2024_09_18_reference.tsv:UP000235024 Rhodocyclaceae bacterium 1898103 2909 C:78.7%[S:78.0%,D:0.7%],F:0.5%,M:20.7%,n:569 Standard GCA_002863805.1

So I think the easiest solution would be to take these four tsvs and use them proteome id column for lookup and filtering. If the proteome id in our download is find in the list then we ingest it, otherwise we reject it.

We could also try decreasing the protein count threshold to a lower number -- > 100 and < 10000 proteins would be ideal but we would need to check if this would actually increase the size of the transform relative to what we had. We currently have 17147 proteomes > 1000 (but more are non-Reference and non-Other duplicates) and the total number across these 4 files is 50793 but most are small so < 1000.

@hrshdhgd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant