Skip to content

Commit

Permalink
updated scripts/README.md with linkml_trimmer details
Browse files Browse the repository at this point in the history
  • Loading branch information
Puja Trivedi authored and Puja Trivedi committed Mar 11, 2024
1 parent 7c3d4f6 commit fb015e0
Showing 1 changed file with 16 additions and 17 deletions.
33 changes: 16 additions & 17 deletions bkbit/scripts/README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,28 @@
# gfftranslator.py

gfftranslator.py is a Python script that generates GeneAnnotation objects from data stored in gff files.
# linkml_trimmer.py

linkml_trimmer returns a trimmed version of a linkml model.
## Usage

```python
from bkbit.scripts.gfftranslator import gff_to_gene_annotation
# Step 1: import YamlTrimmer
from bkbit.scripts.linkml_trimmer import YamlTrimmer

# input_fname is the name of the input csv file
# Note: example input data can be found on Allen Teams under Knowledge Graph files. "20230412_subset_genome_annotation.csv"
input_fname = 'XXX.csv'
# Step 2: initialize YamlTrimmer Object with a linkml model
trimmed_model = YamlTrimmer(path_to_linkml_model)

# data_dir is the directory path where the input csv file exists
data_dir = ' XXX/XXX/'
# Step 3: define the classes, slots, and enums that should be included in the trimmed model
classes = [...] # List of classes to keep
slots = [...] # List of slots to keep
enums = [...] # List of enums to keep

# output_dir is the directory path where all of the generated output files will be saved
# Note: if output_dir does not exist, gff_to_gene_annotation will create the directory
output_dir = 'XXX/XXX/'
# Step 4: call the trim_model function with the selected classes/slots/enums
# Note: only classes is a required parameter. slots and enums are optional
trimmed_model.trim_model(classes, slots, enums)

gff_to_gene_annotation(input_fname, data_dir, output_dir)
# Step 5: call the serialize function to produce trimmed linkml model
trimmed_model.serialize()
```

## Notes

1. Input csv file
a. Each row in the csv file contains a url to the .gff file as well as additional attributes to describe the dataset. The csv file must contain the following columns: authority, label, taxon_local_unique_identifier, version, gene_identifier_prefix, url.
2. Generated files
a. For each .gff file 3 files will be generated: (i) The raw data downloaded from the url provided will be saved as a csv file in the 'data_dir' directory provided. (ii) The parsed and cleaned data will be saved as a csv file in the 'output_dir' directory provided. (iii) The initialized GeneAnnotation objects will be saved as a list of json dictionaries in a json file in the 'output_dir' directory provided.
1. To produce bican_biolink.yaml call trim_model with classes = ['gene', 'genome', 'organism taxon', 'thing with taxon', 'material sample', 'procedure', 'entity', 'activity', 'named thing']

0 comments on commit fb015e0

Please sign in to comment.