Scripts for the evaluation of bnabs probabilities of recombination and evolution, according to models learned on Ig sequences from either healthy or hiv-infected donors.


Cosimo Lupo © 2018-2023

Free use of the present code is granted under the terms of the GNU General Public License version 3 (GPLv3).


For any issue, question or bug, please write us an email.


The present code, companion of the manuscript "Probabilities of developing HIV-1 bNAb sequence features in uninfected and chronically infected individuals", is conceptually divided in two sections:

  1. ig, where we process B-Cell Receptor (BCR) repertoires from either healthy or chronically infected patientes, grouped into three cohorts: "healthy_control", "hiv1", and "hcv". Sequences are firstly annotated through igBlast, then sorted and grouped by cohorts. Secondly, IGoR infers their recombination and evolution statistics, finally producing cohort-specific models.

  2. bnabs, where we annotate bNAbs sequences, analyze their features and put them in relation with their probability of generation and evolution, according to the cohort-specific models inferred in the ig section above.

Test datasets of 5'000 BCR heavy- and light-chain IgG sequences for two healthy donors are provided under ig/datasets/, which allows directly running the present code and recapitulating sequence annotation and model building as it was performed in the manuscript. Under bnabs/sequences, we provide the heavy- and light-chain sequences for the bNAbs analyzed in the manuscript, together with a summary of their features relevant for the present analysis (e.g. their neutralization score). IGoR models, inferred on the whole cohort of healthy patients, are also provided under the folder templates/igor_models/inferred, again for testing purposes.

All the scripts, written in Python3, can hence be run with the attached test datasets, with an expected single-core run time of approximatively one hour. However, they rely on a local installation of igBlast and IGoR softwares. Expand the sections below for installation and configuration details (expected installation time of approximately half an hour each). V(D)J templates for igBlast and standard IGoR models are also shipped with this code, under the folder templates.

Finally, though most of the Python packages used in the scripts are quite common (e.g. numpy or pandas), some others are less frequent and could not be alredy present in standard Python3 distributions. It can be the case for the following packages:

that can be pip-installed as usual.

Data availability

The complete set of FASTA files with quality-filtered and assembled IgG sequences (as described in the manuscript) can be downloaded from this link. They are grouped by chain, donor, and finally by cohort. For each sequence, the number of reads in which its UMI was initially found, is reported as Nx in the header, together with the ID of the donor and some other useful info. E.g., >BZR6_100_t0_IgG_1:UID110963:G30292:N40:IGHG:HC, specifies an IGHG isotype heavy chain consensus sequence of 40 reads that comes from healthy control donor 100. This allows to remove low-quality reads (N<3) at any desired step of the downstream analysis.

Corresponding FASTQ files with raw NGS reads for all repertoires have been deposited at the Sequence Read Archive (SRA) with accession number SAMN29624595-713 (BioProject Accession Number PRJNA857338).

igBlast installation

igBlast is a powerful and versatile Ig annotation software. We used the following releases:

with V,D,J templates extracted from IMGT, formatted and attached to this release (under templates).

Please go through the following of this section for the instructions on how to install igBlast and produce the templates in the desired format.

Installation details

Installation of the Blast command line tools

Go to the webpage: and follow the instructions. It is needed for eg building the database afterwards. A better guide can be found at: (recommended).

Unwrap the tar.gz file through the command:

tar xvzf file_name
Installation of igBlast

Go to the webpage: for the main instructions.

Unwrap the .tar.gz file through the command:

tar xvzf file_name

Change the permissions to directories and files downloaded, through the command:

chmod -R u+rw *
Setting paths

igBlast-related paths can be exported through commands like:

  • export PATH=$PATH:$HOME/igBlast/ncbi-blast-2.9.0+/bin:$HOME/igBlast/ncbi-igblast-1.13.0/bin

  • export BLASTDB=$HOME/igBlast/blastdb

  • export IGDATA=$HOME/igBlast

so to be able to use igBlast commands (e.g. igblastn) independently from the current working directory. Otherwise, the full path of such commands (e.g. $HOME/igBlast/ncbi-igblast-1.13.0/bin/igblastn for the executable, and analogously for pointing at the internal igBlast dataset) has to be used.

Making the database

IG databases can be downloaded from IMGT at:, at the section 'IG "V-REGION", "D-REGION", "J-REGION", "C-GENE exon" sets'. Ungapped germline genes should be downloaded from the column 'F+ORF+all P'. Each page should be opened, sequences copied and then pasted into a new file in the germline tree.

  • Another huge database from IMGT can be downloaded at: but it includes several species in the same files, and for each of them, also several other genes that do not appear in the above database. Again, be careful to download ungapped genes from the 'F+ORF+all P' section.

  • Sequences too short can cause problems when creating the database; a work-around is to shorten comments preceeding the sequence. Also, one could also keep just the name of the gene and nothing else. To do that on the IMGT database, use the following command: awk '{if(gsub(">",">")==1){split($0,a,"|"); print ">"a[2]}else{print $1}}' FILE_IN > FILE_OUT. This issue should have been fixed in the newer version of igBlast.

  • Otherwise (and more easily), run the following command contained into the Blast script ./ imgt_file > my_seq_file or, even better, rely on the following script (based on the one above) that both filters the IMGT nomenclature and pastes together different lines of the same sequence, putting a space between them: ./, each time choosing in the header of the script the kind of database you want to build.

  • Database can be made through the command makeblastdb -in database_file.fasta -parse_seqids -dbtype nucl for each of the three gene types (V,D,J).

IGoR installation

After a first sequence annotation and a quality filtering through igBlast, BCR sequences are then ready to be analyzed through IGoR software, which allows to infer V(D)J recombination related processes from sequencing data, as well as hyper-mutation statistics.

The underlying methodology and some biologically relevant results are described in the following paper:

  • Quentin Marcou, Thierry Mora, Aleksandra M. Walczak. "High-throughput immune repertoire analysis with IGoR". Nature Communications 9, 561 (2018).

We relied on a local installation of the 1.4.1 release.

Please go through the following of this section for the instructions on how to install and use IGoR.

Installation details

The complete documentation of IGoR can be found here, with all the installation details, a list of known issues and suggested solutions, and an exhaustive usage guide.

The desired release of IGoR can be downloaded from GitHub.

On Linux platforms

Once in the IGoR root directory, three key commands have to be launched, one after the other, to install IGoR on Linux platforms:

make install

For a user-level installation, if e.g. the user has no administrator privilegese, the flag --prefix has to appended to the ./ configure command. For example, in order to install IGoR under the user home directory: ./configure --prefix=$HOME

Then, make and make install steps can be executed smoothly as before.

At the end of the installation procedure, IGoR’s executable will appear under the igor_src folder.

In case of multiple IGoR versions, it is recommended to explicitly use the full path to the executable of the desired version, e.g. $HOME/igor_1.4.1/igor_src/igor. Otherwise, if correctly exported during the make install step, the igor command will be accessible from any location without any full path specification.

On MacOS platforms

The installation steps previously mentioned are based on OpenMP-compatible compilers, as e.g. GNU gcc. Unfortunately, the default Apple compiler, though still callable through the gcc command, does not belong to this class of compilers and hence will cause a fatal error.

The suggested workaround is to install Homebrew and then the GNU gcc compiler through it (gcc 7.x versions are recommended, as their compatibility with the IGoR release 1.4.1 used here is granted):

brew install gcc@7

The freshly installed compiler will not overwrite the default Apple compiler, so gcc command will still refer to the latter. To this aim, when launching the ./configure command for installing IGoR, a further flag has to be used, so to explicitly tell which compiler has to be used during the installation. Referring to releases 7.x of GNU gcc:

./configure CC="gcc-7" CXX="g++-7"

make and make install command can then be executed smoothly.

The ig section

There are two main scripts, and, respectively for sequence annotation through igBlast and IGoR evaluation and model inference.

The analysis run by each script can be customized through a dedicated .yaml config file. Among the possible accessible parameters, please notice (and modify, if needed) the full path of both igBlast and IGoR executables.

The annotation step can be invoked as:


with settings fully customizable from the config_annotation.yaml file. The user can modify the path of input data, choose the type of chain (heavy, kappa or lambda) and the cohort such data come from, and also if a certain step of the script has to be run or not (e.g., the user can choose to run only the core of igBlast annotation, leaving the parsing of igBlast intermediate output for later).

Its output consists in:

  • a .igBlast_statistics file, csv-formatted, with annotation results for each sequence (e.g., best-scoring V template, number and position of point-mutations, and so on);
  • a set of .csv and .fasta files, where annotated sequences are sorted, grouped and filtered according to customizable criterions (e.g., include sequences with indels or not, divide in-frame sequences from out-of-frame, and so on).

Both kinds of files are produced separately for each donor, and at the cohort level, by grouping together sequences from different donors belonging to the same cohort under examination.

The IGoR evaluation/inference step, launched through the command:


is again fully customizable from the config_igor.yaml file (e.g., the user can choose to run only the alignment step, leaving the inference or the final evaluation for later).

The output consists into a set of folders (aligns, evaluate, inference, output) containing IGoR results, as described in its documentation. In particular, the two files under the inference folder, final_parms.txt and final_marginals.txt, are the ones containing the inferred model to be used later for bNAb evaluation. To this aim, they will also be copied automatically under the template/igor_models/inferred folder.

Also, a .IGoR_summary summary file is produced by combining for each sequence the results from igBlast annotation and IGoR evaluation, potentially useful for a deeper analysis at the single-sequence level, or also to extract summary statistics at the donor or cohort level (e.g., the distribution of the fraction of point-mutated positions) by means of the auxiliary script

The "hiv1" cohort can be further stratified by antiretroviral therapy (ART) treatment:

  • "hiv1_art_off"
  • "hiv1_art_on"

and by serum neutralization breadth:

  • "hiv1_non_neutralizers"
  • "hiv1_weak_neutralizers"
  • "hiv1_intermediate_neutralizers"
  • "hiv1_top_neutralizers"

Related IGoR models can be inferred on these sub-cohorts, by modifying the cohort parameter in the config_igor.yaml file.

The auxiliary script can be used to infer lineages in annotated datasets. This file also contains RAxML commands used to reconstruct phylogenies and ancestral states in largest lineages, and functions to analyze phylogenies to quantify skewedness.

The bnabs section

This section includes the bnabs_neutr.csv file, listing the 70 bNAbs analyzed, together with info about their binding site and their neutralization potency. These (and more) info can be retrieved from the CATNAP database.

In order to run this part of the analysis pipeline, one should provide the fasta files for the 70 bnabs of interest (nucleotide sequences are publicly available, e.g. on the CATNAP database linked above), sorting heavy, kappa and lambda chain sequences into three different .fasta files (respectively named as bnabs_seqs_HC.fasta, bnabs_seqs_KC.fasta, and bnabs_seqs_LC.fasta).

Though already annotated, bNAbs can be re-annotated through the command:


in order to extract annotation results in the desired format and to prepare .csv and .fasta files for the following IGoR evaluation step. Again, through the config_annotation.yaml file, the user can customize the path of the input file, choose the type of chain to be analyzed, and so on.

Output files are exactly analogous with those of the ig annotation step.

Finally, by means of IGoR, it is possible to evaluate recombination and evolution probabilities of bNAbs, and correlate them with the neutralization score. Once chosen the desired model in the config_igor.yaml file (among the default IGoR ones or those inferred in the ig step on the three cohorts), the command:


allows to get the aforementioned probabilities for each bNAb. The output is of the same kind of that obtained in the ig step (apart from the inference folder, since here bNAbs are just evaluated according to some IGoR models, and not used for further model inference).

Finally, as for the ig step, a .IGoR_summary file is produced (under the bnabs/igor_bnabs_summary folder), combining for each bNAb the results from igBlast annotation and from IGoR evaluation according to a certain model, recombination and hyper-mutation probabilities, and neutralization properties. It's this set of data that is eventually used for the final bNAb analysis in the auxiliary script, i.e. the assessment of the correlation between their probability of being generated and developed, and their neutralization properties, whose result is stored in a .csv file under the bnabs/neutr_score_prediction folder.


