Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to extend header parameters from tables/output/annotation_fields for 'vembrane table --header "{params.config[header]}" "{params.config[expr]}' #333

Open
delpropo opened this issue Oct 22, 2024 · 3 comments

Comments

@delpropo
Copy link

In table.smk, the rule vembrane_table calls get_vembrane_config to get the parameters for the vembrane header table. This should add the annotation fields from config.yaml.

Example from config.yaml

tables:
  activate: true
  output:
    annotation_fields:
      - BIOTYPE
      - LoFtool
      - REVEL
    event_prob: true
    observations: true
    short_observations: true
  generate_excel: false

get_vembrane_config appends the fields with the following code. The fields being BIOTYPE, Loftool, and REVEL in this example.

       annotation_fields = get_annotation_fields_for_tables(wildcards)
        append_items(
            annotation_fields,
            rename_ann_fields,
            "ANN['{}']".format,
            lambda x: x.lower(),
        )

sort_order is created and contains "ANN['SYMBOL']", "ANN['IMPACT']", etc, but is a limited list. It is extended with additional values but the list in not a complete list of all values available in vembrane ANN types.

Complete list of custome ANN types.
https://github.com/vembrane/vembrane/blob/main/docs/ann_types.md

Finally, the sorted_columns_dict is created to return the values used in the vembrane_table rule. This removes any values which were added to columns_dict if they are not in sort_order. It would remove BIOTYPE and Loftool, but not REVEL as it is in sort_order.

    # sort columns, keeping only those in sort_order
    sorted_columns_dict = {k: columns_dict[k] for k in sort_order if k in columns_dict}
    join_items = ", ".join
    return {
        "expr": join_items(sorted_columns_dict.keys()),
        "header": join_items(sorted_columns_dict.values()),
    }

For a variants.fdr-controlled.tsv file, the final output of the tsv files consists only of the following columns and can't be extended.

symbol
impact
hgvsp
hgvsc
consequence
clinical significance
gnomad genome af
exon
revel
chromosome
position
reference allele
alternative allele
protein position
protein alteration (short)
canonical
mane_plus_clinical
prob: absent
prob: artifact
prob: variant
patient: allele frequency
patient: read depth
patient: short ref observations
patient: short alt observations
patient: observations
id
gene
hgvsg
feature
end position

@dlaehnemann
Copy link
Contributor

Thanks for digging into this and for taking the time to report this thoroughly!

We recently changed how this works, and it seems we haven't gotten to a clean solution,yet. REVEL and LoFtool should automagically work (as in: be included in the datavzrd report table) when they are specified via the annotations: vep: final_calls: plugins: list in the config.yaml. For other (more standard) ANN entries like BIOTYPE, I'm not sure how one should request those in the config.yaml, as we are:

  1. Looking to probably deprecate the whole tables: entry.
  2. Want to have a stable sort order of entries.
  3. Don't want to over-cram the default report table with all possible annotations, but rather focus on something like "the most informative in most cases":tm: .

So do you have a clear idea of what you would want in this case, and what you would want in a more general case? Maybe this can help us figure out how exactly we want to eventually resolve this.

@delpropo
Copy link
Author

Vembrane table creates multiple lines for each variant based on annotation information. Because the files can be so large, I read the tsv and turn it into a zarr file then process it in sections while also doing some initial filtering and grouping. I don't generally use the config.yaml file for filtering because I have to change the criteria so often. I have had to use it to remove some chromosomal locations which have caused downstream processing issues as described in this article about problematic regions.

There can be multiple annotations for a single variant which means some variants have dozens of lines after vembrane is used to create the table. To get around this, I groupby each each individual variant for the file (subject or family) and include all columns which are unique for each variant ("chromosome", "position", "allele", max_af, etc). This ensures that each unique variant has its own line. Any additional annotation that has multiple values is aggregated into a list.

I currently process data using the flow chart method below. It can require allocating a lot of memory due to the size of the tsv files, but that is about the only issue.

I had thought about some alternative methods but those would have used the vcf file. The vcf files are already formatted for vembrane and I just needed to get the data out. The vembrane table rule already did this, and I can easily reformat using config.yaml.

graph TD
    A[BCF File] -->|vembrane| B[TSV File]
    B -->|xarray| C[Zarr Folder]          
    C -->|reread| H[Process by chromosome]
    H -->|filter, groupby | D[Processed Chr 1]
    H -->|filter, groupby| E[Processed Chr ..]
    D -->|aggregate| F[Save to TSV]
    E -->|aggregate| F[Save to TSV]   
    F -->|join| G[Final TSV File]
    I[Additional TSV files] -->|join| G[Final TSV File]

Loading

I think this answers your questions? I attached a copy of one of my config.yaml files for reference since I use a lot more columns than just BIOTYPE. I have also included a test file as an example of the final data exported.

config_test_file.zip

@dlaehnemann
Copy link
Contributor

A couple of thoughts of how you might get most of this done within the existing workflow:

  1. I think you could do all of your filtering before the vembrane table creation, by providing custom filters in the config.yaml here:
    impact_filter: "ANN['IMPACT'] == 'HIGH'"

    These can use arbitrary Python expressions, basically anything you can do with a vembrane filter command. As vembrane does this by streaming through the file, this should not require a lot of RAM. This will probably not include getting rid of multiple ANN entries per variant (basically per transcript), but that is something that later scripts in the workflow should take care of. Which brings me to:
  2. Have you previously used the report generated by snakemake --report report.zip? It should contain interactive views of you variant sets that you specify via the calling: fdr-control: events: definitions in config.yaml. You can execute this command, once the workflow has successfully completed, and if you unpack the resulting ZIP file, you just extract the contained directory and open the main index.html file. The contained interactive table views (generated with datavzrd of the variants are what we tailor the vembrane table command to in the workflow.

So if you haven't, maybe you can look into the in-workflow vembrane filtering and the report, and see if this meets some of your needs and if you need more things amended in that setup?

Also, if you have suggestions on making any of what I explained here clearer when deploying and configuring the workflow, please let me know. (Useful) documentation is hard...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants