Skip to content

Latest commit

 

History

History
276 lines (187 loc) · 18.7 KB

README.md

File metadata and controls

276 lines (187 loc) · 18.7 KB

VirusRecom: Detecting recombination of viral lineages using information theory

1. Download and install

VirusRecom is developed based on Python 3, and you can get and install the VirusRecom in a variety of ways.

1.1. pip method (recommend)

virusrecom has been distributed to the standard library of PyPI (https://pypi.org/project/virusrecom/), and the latest version can be easily installed by the tool pip.

Firstly, download Python3 (https://www.python.org/), and install Python3 and pip tool, then,

pip install virusrecom
virusrecom -h

1.2. Or conda method

virusrecom has been distributed to bioconda (https://anaconda.org/bioconda/virusrecom), and the latest version can be installed using the tool conda.

# (1) add bioconda origin
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

# (2) install virusrecom
## (i) create a separate environment for virusrecom (recommend)
conda create -n virusrecom_env python=3.7  # python >=3.5 but != 3.8  
conda activate virusrecom_env
conda install virusrecom  # or "conda install bioconda::virusrecom"

## (ii) or installation without creating separate environment (slow)
conda install virusrecom  # or "conda install bioconda::virusrecom"

# (3) view the help documentation
virusrecom -h

1.3. Or local installation

In addition to the pip and conda methods, you can also install virusrecom manually using the file setup.py.

Firstly, download this repository, then, run:

python setup.py install
virusrecom -h

1.4. Or directly run the source code

virusrecom can also be run using the source code without installation. First, download this repository, then, install the required python environment of virusrecom:

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

finally, run virusrecom by the file main.py. Please view the help documentation by python main.py -h.

1.5. Or use the binary files

For the two earlier release packages (versions v1.0 and v1.1), you can also directly run the binary files of virusrecom without installation. The binary files are provided at https://github.com/ZhijianZhou01/virusrecom/releases.

In general, the executable file of virusrecom is located at the main folder. Then, running the virusrecom.exe (windows system) or virusrecom (Linux or MacOS system) to start. If you could not get permission to run virusrecom on Linux system or MacOS system, you could change permissions by chmod -R 775 Directory or chmod -R 777 Directory.

2. Getting help

virusrecom is a command-line-interface program, users can get help documentation of the software by entering virusrecom -h or virusrecom --help .

Tip: since version 1.1, virusrecom optimizes the parameters of input-file, which is slightly different from virusrecom v1.0.

The simple help documentation of virusrecom v1.1.3 is as follows.

Parameter Description
-h, --help Show this help message and exit.
-a ALIGNMENT Aligned sequence file (*.fasta). Note, each sequence name requires containing lineage mark.
-ua UNALIGNMENT Unaligned (non-alignment) sequence file (*.fasta). Note, each sequence name requires containing lineage mark.
-at ALIGN_TOOL Program used for multiple sequence alignments (MSA).
-iwic INPUT_WIC Using the already obtained WIC values of reference lineages directly by a *.csv input-file.
-q QUERY Name of query lineage (usually potential recombinant), such as ‘-q xxxx’. Besides, ‘-q auto’ can scan all lineages as potential recombinant in turn.
-l LINEAGES Path of a text-file containing multiple lineage marks.
-g GAP Reserve sites containing gaps(-) in analyses? ‘-g y’ means to reserve, and ‘-g n’ means to delete.
-m METHOD Method for site scanning. ‘-m p’ uses polymorphic sites only, ‘-m a’ uses all the sites.
-w WINDOW Number of nucleotides sites per sliding window. Note: if the ‘-m p’ has been used, -w refers to the number of polymorphic sites per windows.
-s STEP Step size of the sliding window. Note: if the ‘-m p’ has been used, -s refers to the number of polymorphic sites per jump.
-mr MAX_REGION The maximum allowed recombination region. Note: if the ‘-m p’ method has been used, it refers the maximum number of polymorphic sites contained in a recombinant region.
-cp PERCENTAGE The cutoff threshold of proportion (cp, default: 0.9) used for searching recombination regions when mWIC/EIC >= cp, the maximum value of cp is 1.
-cu CUMULATIVE Simply using the max cumulative WIC of all sites to identify the major parent. Off by default. If required, specify ‘-cu y.
-b BREAKPOINT Possible breakpoint scan of recombination. ‘-b y’ means yes, ‘-b n’ means no. Note: this option only takes effect when ‘-m p’ has been specified.
-bw BREAKWIN The window size (default: 200) used for breakpoint scan. The step size is fixed at 1. Note: this option only takes effect when ‘-m p -b y’ has been specified.
-t THREAD Number of threads (or cores) for calculations, default: 4.
-y Y_START Starting value (default: 0) of the Y-axis in plot diagram.
-le LEGEND The location of the legend, the default is adaptive. '-le r' indicates placed on the right.
-owic ONLY_WIC Only calculate site WIC value. Off by default. If required, please specify ‘-owic y’.
-e ENGRAVE Engraves file name to sequence names in batches. By specifying a directory containing one or multiple sequence files (*.fasta).
-en EXPORT_NAME Export all sequence name of a *.fasta file.
-o OUTDIR Output directory to store all results.
--block BLOCK_SIZE Specifies the maximum number of sites per sub-block, different sub-blocks in sequence file will be sequentially loaded to calculate WIC. Default: 40000 sites.
--no_wic_fig Do not draw the image of WICs.
--no_mwic_fig Do not draw the image of mWICs.

For detailed documentation, please refer to Manual of VirusRecom v1.1.3

For more information about the algorithm of virusrecom, please refer to the publication of virusrecom.

3. Example of usage

The sequences data for test in the documentation was stored at https://github.com/ZhijianZhou01/virusrecom/tree/main/example.

Note, the recombination_test_data.zip in directory example is used for virusrecom v1.0, not virusrecom v1.1+.

In this demonstration, the test data is from the the recombination_test_data_v1.1.zip provided in the directory example.

3.1. Aligned input-sequences

If the input sequence-data has been aligned, and it should be loaded via the -a parameter. Multiple sequence alignments (MSA) can be pre-completed by many programs, this is not introduced. Now, let's focus on the directory aligned_input_sequences in the file recombination_test_data_v1.1.zip.

(1) An aligned sequence-file named alignment_lineages_data.fasta, which including multiple sequences from the query lineage and other reference lineages.

(2) A text-file named reference_lineages_name.txt, which including the names (marks) of these reference lineages.

     reference_lineage_1
     reference_lineage_2
     reference_lineage_3
     reference_lineage_4
     reference_lineage_5
     reference_lineage_6
     reference_lineage_7
     reference_lineage_8
     reference_lineage_9

Note, these marks of reference lineages should also appear in sequence names of the file alignment_lineages_data.fasta. The mark of each reference lineage should be unique, otherwise, there will be duplicate matches in subsequent analysis.

Before running the command of VirusRecom, let's think about the search strategy for recombination events. Firstly, we use only polymorphic sites considering that sequences from these lineages are highly similar, which means that the parameter -m p needs to be specified. Secondly, we do not consider gap-containing sites in this test and use the parameter -g n. Instead, if you consider these gap sites, you need to use the parameter -g y. Next, in the first run, let's try first with a window size of 100 and a step size of 20. Of note the value of “size” at this time represents the number of polymorphic sites because the -m p parameter has been specified. For the two parameters -cp and -mr, we use the default value of 0.9 and 1000 in this test. Finally, we specify a folder to save the results by parameter -o.

Then, switch the current directory to aligned_input_sequences, and run the following command (an example) to detect recombination events in query lineage:

virusrecom -a alignment_lineages_data.fasta -q query_recombinant -l reference_lineages_name.txt -g n -m p -w 100 -s 20 -o outdir

Note: (1) if the current directory is not switched to aligned_input_sequences, the file and directory path in command need the absolute paths instead of relative paths. (2) the string “query_recombinant” in command is the corresponding mark of query lineage in the file alignment_lineages_data.fasta.

After the run is complete, in the directory outdir, there are three subdirectories and two aggregated reports:

outdir.png

(1) In the directory run_record, if -g n is specified, and the file Record_of_deleted_gap_sites_*.txt containing all the gap sites will be created. Besides, If -m p is specified, and the file Record_of_same_sites_in_aligned_sequence*.txt containing all the same sites will be created.

(2) In the directory WICs_of_sites, the file *_site_WIC_from_lineages.pdf, *_site_WIC_from_lineages.xlsx and the file *_site_WIC.csv are used to record the WIC value of each site.

(3) In the directory WICs_of_slide_window, the file *_mWIC_from_lineages.xlsx and the file *_mWIC_from_lineages.pdf are used to record the mean WIC of each sliding window.

recombination_step4.png

The user can fine-tune the window size and step size according to the density of points in the generated graph. In general, very dense points means that the noise is too high and the window size can be increased appropriately in next scan.

In addition to the three sub-directories above, VirusRecom provides two summary files. The file Possible_recombination_event_conciseness.txt only retains results of recombination events with p-values less than 0.05.

Possible major parent: reference_lineage_1(global mWIC: 1.8976186779157704)

Other possible parents and significant recombination regions (p<0.05):
reference_lineage_2	7237 to 11539(mWIC: 1.9553354371515168), p_value: 7.831109305531908e-06	

Significance test of recombinant regions using Mann-Whitney-U test with two-tailed probabilities, p-value less than 0.05 indicates a significant difference.

In this output report, the major parent of query lineage was reference_lineage_1 and the minor parent was reference_lineage_2, and the recombination region was site 7237 to 11539 and the p-value was 7.83e-06. The identified recombination event was relatively close to the actual (from site 7333 to 11473 in the genome), and the error of the recombination boundary is also acceptable.

In fact, Possible_recombination_event_conciseness.txt is interpretations of the recombination information contained in *_mWIC_from_lineages.pdf. Although VirusRecom shows a good balance between precision and recall in simulated data, false positive or false negatives sometimes occur. Therefore, for the identification results from VirusRecom, users can make own judgment.

Besides, the output file Possible_recombination_event_detailed.txt shows those results with p-values greater than 0.05. Tip: recombination events with p-values over 0.001 are less reliable.

If -b y is specified, then VirusRecom will perform the search of recombination breakpoint and plot. For example:

virusrecom -a alignment_lineages_data.fasta -q query_recombinant -l reference_lineages_name.txt -g n -m p -w 100 -s 20 -b y -bw 200 -o outdir

Tip: (1) -b y only takes effect when -m p has been specified. (2) the step size of breakpoint search is fixed to 1.

The negative logarithm of p-value in each site is in the file *_-lg(p-value)_for_potential_breakpoint.pdf and the file *_-lg(p-value)_for_potential_breakpoint.xlsx.

breakpoint.jpg

The highest peak (the highest −lgP value) indicated the possible recombination breakpoint.

3.2. Unaligned input-sequences

VirusRecom can also handle unaligned input-sequences. In this case, multiple sequence alignment is performed by calling external program. In virusrecom v1.1, mafft, muscle, and clustal-omega is supported. It is worth mentioning that VirusRecom call them from the system path, so they need to be installed on the machine beforehand.

For the example data in directory unaligned_input_sequences, run the following command:

virusrecom -ua unalignment_lineages_data.fas -at mafft -q query_recombinant -l reference_lineages_name.txt -g n -m p -w 100 -s 20 -o outdir

Note: (1) -at mafft means to call mafft in the system path, and the alignment strategy is auto. Besides, using -at muscle to call muscle and using -at clustalo to call clustal-omega. (2) the string query_recombinant in command is the corresponding mark of query lineage in the file unalignment_lineages_data.fas.

The interpretation of the output result is consistent with section 3.1.

3.3. Non-lineage data

In VirusRecom, the reference lineage is allowed to contain only one single sequence. Under this condition, mWIC value of the fragment is essentially a multiple of shared identity. If -g n is used in the calculation, the mWIC is twice as large as shared identity. If -g y is used in the calculation, the mWIC is $\log_2{5}$ as large as shared identity.

Of noted, for recombination analysis without lineage data, the additional feature is only recommended for non-highly similar sequences and the user can use it to draw an identity point map.

The test data is in directory non_lineage_data of the file recombination_test_data_v1.1.zip.

The Delta-CoV HNU1-1 is a known recombinant from SpCoV HKU17-USA and ThCoV HKU12, and the break points were identified at genome positions nt 21017 and 25056, which is jointly identified and confirmed by RDP3 and Simplot by Wang et al., 2022.

Considering that they are not highly similar sequences, we use all sites (-m a) in the alignment. Then, we use a larger window value, and run following command:

virusrecom -a alns.fasta -q HNU1-1 -l alns_seq_taxon.txt -g n -m a -w 800 -s 100 -cp 0.7 -mr 6000 -le r -o output

The mWIC from reference lineages is as follows:

hnu1-1.jpg

Note, because each “lineage” contains only one sequence and -g n is used in the example, the mWIC in the picture is actually twice the size of “sequence identity”.

The possible recombination event identified by VirusRecom is as follows:

Possible major parent: HKU17-USA(global mWIC: 1.5914816042426252)  

Other possible parents and significant recombination regions (p<0.05): 
HKU12	20720 to 25297(mWIC: 1.8039433490697028), p_value: 2.783880536189705e-204

The possible major parent of HNU1-1 is HKU17-USA and minor parent is HKU12, and the recombination region is about 20720-25297 nt in the alignment.

4. Common questions

4.1. Default values of parameter

For the value of a parameter, if not specified, the software uses the default value. However, the default value is not suitable for all data. In addition to window size (-w) and step size (-s) of sliding window, values of -cp and -mr also require users to adjust based on the data.

When VirusRecom runs, the value of each parameter is printed printed on the screen and you can check them. What is more, users should try different values in multiple runs, which will effectively reduce false positives and false negatives.

4.2. How to set the appropriate window size and step size?

For the recombination analysis using polymorphic sites (-m p in virusrecom), the following is recommended based on our limited experience,

Number of polymorphic sites in alignment window size step size
polymorphic sites <= 2000 4% ~ 6% of all polymorphic sites 10% ~ 20% of the window size
polymorphic sites > 2000 >= 100 10% ~ 20% of the window size

Note, too large window size can't be used for the alignment with too few polymorphic sites.

4.3. How to mark lineage in sequence name?

Typically, this is part of the data preparation. In virusrecom v1.1, users can easily get it done via -e parameter. The -e parameter can engrave file-name to sequence names in batches. The example is as follows:

virusrecom -e input_directory -o outdir

Tip: The directory input_directory can contain multiple fasta files, and each fasta file can contain multiple sequences. After the running, finally, each sequence name will contain its file-name.

Therefore, if the file-name of fasta file is a lineage name, the lineage name can be written into the sequence name in batches.

4.4. How to change the color scheme in an image?

If you own programming skills, you can directly modify the order of the colors in the plt_corlor_list.py file. If not, you can use output matrix provided by VirusRecom, and they are usually suffixed with .xlsx or .csv.

5. Citation

Zhou ZJ, Yang CH, Ye SB, Yu XW, Qiu Y, Ge XY. VirusRecom: an information-theory-based method for recombination detection of viral lineages and its application on SARS-CoV-2. Brief Bioinform. 2023 Jan 19;24(1):bbac513. doi: 10.1093/bib/bbac513. PMID: 36567622.