-
Notifications
You must be signed in to change notification settings - Fork 14
Proteomics
- User has combined protein abundance - like file (combined_protein.tsv, https://fragpipe.nesvilab.org/docs/tutorial_fragpipe_outputs.html#combined_proteintsv)
- PK: This may not be the exact output-- the Fragpipe lab sent us a related, but different version of this output
- CT: This is the format we should expect (except we will request the row names are gene not gene|uniprot_id):
- We create a Python cli tool/api library function, which submits a batch request to filter the protein abundance file using a Bystro annotation.
- This requires the user to first authenticate using our existing Python API for authentication
- The batch submission requires:
- user id
- path to protein abundance file or multipart file upload stream of the protein abundance file if local (in v1 ok to support 1 only)
- the job id
- basename for the output
- the query string query
- The batch submission API function will follow the existing create_job function, with the
payload
containing the filtering job basename, the query string query, and thefiles
array containing the list of files (in our case this is 1 file)
- The job is processed through our API server, which results in a beanstalkd submission to our cluster, where the filtering of the protein abundance file happens, resulting in a new protein abundance file written to disk, and a path for that file returned to the user.
- We re-use the code (duplication is OK for v1) from bystro/search/save/handler.py for handling the scrolling through annotation index
- We pull down the annotation index as in bystro/search/save/handler.py, and add the protein abundance values
- We persist these values on EFS, return a path to the results
- We create a Python CLI/API tool/library function to allow the user to pull down the filtered protein abundance results using the user id and the job basename.
- Will require authentication just like item 2. See get_job for another reference point.
- In this function we'll want to either persist the file stream from the API server to a local file (if the user specifies a local file path), else write to stdout.
- I query
exonic (gnomad.genomes.af:<0.1 || gnomad.exomes.af:<0.1)
- I accumulate heterozygotes, homozygotes, missingGenos and refSeq.name2
- I filter my protein abundance table, but filtering to genes found in refSeq.name2, and zero'ing out any sample that are reference or missing, or otherwise somehow recording their zygosity/missingness because I want to know whether my protein dosage is normal variation or due to pqtl or otherwise abundance-shifting mutations in the gene body.
A proteomic dataset could be analyzed either in isolation, or jointly with a genomic dataset that describes the same samples. For example, a researcher might ask: "which proteins show the greatest difference in expression between cases and controls?" which would require only proteomic data to answer. Or a researcher could ask: "what is the expression of protein X in samples with any missense mutation in gene Y?" which would require the integration of proteomic and genomic data.
For joint genomic/proteomic analysis, we must be able to join an annotation file to a proteomics dataset on sample_ids. In order for this join to take place, two criteria must be met:
- The user must have permission to access each file.
- The sample_ids must belong to the same universe of discourse. That is, because sample_ids are not necessarily globally unique identifiers, the sample_ids in the annotation file must refer to the same entities as the sample_ids in the proteomics data. If this condition is not met (for example if "sample1" in the annotation file refers to Alice but "sample1" in the proteomics data refers to Bob), then joining the files on sample_id would produce a physically meaningless result.
We therefore must have a concept of a universe of discourse--a set of datasets in which a given sample-id always refers to the same entity-- in order to make joins meaningful.
Within a given universe of discourse, any two datasets may not always have the same set of sample-ids present. The datasets may not even have any sample-ids in common, in which case the join simply results in the empty set but is still physically meaningful.
First the user must upload a proteomics dataset. We describe how this is currently done, along with a sketch of a proposed amendment.
Presently (9/14/2023) proteomics upload is represented as a triplet:
(proteomics_dataset_type, protein_abundance_file, experiment_annotation_file).
The proteomics_dataset_type is a string describing the experimental protocol and file format. Currently the only supported value is "fragpipe-TMT", representing tandem mass tag experiments as processed by fragpipe.
The protein_abundance_file is the filename of the protein abundance file the user wants to upload. This file is a tsv containing an abundance matrix of dimensions (proteins x samples) along with other metadata columns.
The experiment_annotation_file is a tsv file containing metadata about samples, such as their experimental condition, replicate number, and so on.
The proteomics CLI currently expects to make a post request to the endpoint /api/jobs/upload_protein/.
The post request takes the following data payload:
payload = {
"job": mjson.encode(
{
"protein_abundance_file": protein_abundance_filename
"experiment_annotation_file": experiment_annotation_filename
"proteomics_dataset_type": "fragpipe-TMT", # we currently only support this format
}
)
}
and wraps it in a post request along with the files to be uploaded. For further details see proteomics/proteomics_cli.py.
In light of discussion between AK and PON on 9/14/2023 resulting in the findings in the above section "How do proteomics datasets relate to genetic annotations?", the upload post request should be amended to include the following:
(as before)
(as before)
(as before)
An identifier used to refer to the proteomics dataset, i.e. the triplet:
(protein_abundance_file, experiment_Annotation_file, proteomics_dataset_type)
Why can't we just use the triplet itself for that identifier? Not 100% certain of this but: not all triplets are valid identifiers, only the ones the user has uploaded. So we want to prevent the possibility of invalidly mixing and matching abundance files, annotation files and dataset type labels from different experiments, which wouldn't be physically meaningful. Moreover, if we refer to a dataset by the triplet itself, then the user has to keep track of that unwieldy name. But the user will probably have a clearer and more informative name for the dataset than the concatenation of its constiuent filenames.
Used to denote the universe of discourse in which the sample-ids co-refer: all datasets that can be joined or compared must belong to the same study. I'm not sure if "study_name" is the most consistent name for this concept, "workspace_name", "sample_universe" or similar might be better.
- Success
- Failure
- If the upload fails we want to report some reason for failure, either standard error codes or free text?
(as before)
(as before)
(as before)
(as before)
(as before)
Note that this describes the simplest API where the request blocks until success or failure. A more sophisticated design might immediately return a response acknowledging the receipt of the submission, allowing the user to poll for the current status of the upload by querying against the study name.
Note also that I think we also want to avoid returning a path to the uploaded file, which is an implementation detail we should probably hide in favor of just referring to the dataset by its name (so that, among other things, internal filenames don't become part of our API that we have to preserve backwards compatibility against.)
All actions are authenticated through the Bystro-web backend.
General workflow: User submits a request. Possible actions are 1) upload proteomics data, 2) associate proteomics data with 1+ genetics datasets, 3) filter the proteomics dataset, 4) fetch a particular proteomics result. Filtering the proteomics dataset will generate a new resource (result on the 'search' property of the proteomics document), that has an associated dataframe (Arrow IPC format) that is written to disk (API V1) or S3 (API V2), and metadata uniquely identifying the result (unique name with date time appended). That dataframe can be retrieved using Arrow IPC or pandas. In order to fetch the data, a pre-signed url must be generated. We should therefore make a pre-signed URL generating endpoint, which ideally is generic to the kind of data (ideally works for annotation results as well).
No AWS S3 support
Support AWS S3