mark some pins.

scripts

retrieve_genome_assembly.py

Download a genome assembly, initially the human reference genome GRCh38 (called hg38 by Dfam), from the UCSC Genome Browser.

Split the genome sequences to smaller files ?

The data will look like the following:

sequence format: chr1: 1 - 100000, NNNNNN

retrieve_annotation.py

Download repeat annotations from Dfam and generate a subset of the annotations by selecting the desired repeat family or subtype. The selected annotations are saved as the repeat boundaries to create the sequence segmentation dataset.

retrieve_annotation.py

label format: chr1 start end subtype

generate_dataset.py

Dataset generation script with arguments for different dataset generation subtasks. We first choose 'LTR family' as training datasets, the length of 'LTR family' have variant length from 100bp to 5kb. So, when we training the datasets, if the result do not work well, this is one of the reason, we should consider.

utils.py

Project library module.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log.md

log.md

mark some pins.

scripts

Files

log.md

Latest commit

History

log.md

File metadata and controls

mark some pins.

scripts