-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
updated scripts/README.md with linkml_trimmer details
- Loading branch information
Puja Trivedi
authored and
Puja Trivedi
committed
Mar 11, 2024
1 parent
7c3d4f6
commit fb015e0
Showing
1 changed file
with
16 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,29 +1,28 @@ | ||
# gfftranslator.py | ||
|
||
gfftranslator.py is a Python script that generates GeneAnnotation objects from data stored in gff files. | ||
# linkml_trimmer.py | ||
|
||
linkml_trimmer returns a trimmed version of a linkml model. | ||
## Usage | ||
|
||
```python | ||
from bkbit.scripts.gfftranslator import gff_to_gene_annotation | ||
# Step 1: import YamlTrimmer | ||
from bkbit.scripts.linkml_trimmer import YamlTrimmer | ||
|
||
# input_fname is the name of the input csv file | ||
# Note: example input data can be found on Allen Teams under Knowledge Graph files. "20230412_subset_genome_annotation.csv" | ||
input_fname = 'XXX.csv' | ||
# Step 2: initialize YamlTrimmer Object with a linkml model | ||
trimmed_model = YamlTrimmer(path_to_linkml_model) | ||
|
||
# data_dir is the directory path where the input csv file exists | ||
data_dir = ' XXX/XXX/' | ||
# Step 3: define the classes, slots, and enums that should be included in the trimmed model | ||
classes = [...] # List of classes to keep | ||
slots = [...] # List of slots to keep | ||
enums = [...] # List of enums to keep | ||
|
||
# output_dir is the directory path where all of the generated output files will be saved | ||
# Note: if output_dir does not exist, gff_to_gene_annotation will create the directory | ||
output_dir = 'XXX/XXX/' | ||
# Step 4: call the trim_model function with the selected classes/slots/enums | ||
# Note: only classes is a required parameter. slots and enums are optional | ||
trimmed_model.trim_model(classes, slots, enums) | ||
|
||
gff_to_gene_annotation(input_fname, data_dir, output_dir) | ||
# Step 5: call the serialize function to produce trimmed linkml model | ||
trimmed_model.serialize() | ||
``` | ||
|
||
## Notes | ||
|
||
1. Input csv file | ||
a. Each row in the csv file contains a url to the .gff file as well as additional attributes to describe the dataset. The csv file must contain the following columns: authority, label, taxon_local_unique_identifier, version, gene_identifier_prefix, url. | ||
2. Generated files | ||
a. For each .gff file 3 files will be generated: (i) The raw data downloaded from the url provided will be saved as a csv file in the 'data_dir' directory provided. (ii) The parsed and cleaned data will be saved as a csv file in the 'output_dir' directory provided. (iii) The initialized GeneAnnotation objects will be saved as a list of json dictionaries in a json file in the 'output_dir' directory provided. | ||
1. To produce bican_biolink.yaml call trim_model with classes = ['gene', 'genome', 'organism taxon', 'thing with taxon', 'material sample', 'procedure', 'entity', 'activity', 'named thing'] |