Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased

v0.9.10 - 2024-02-27

Fixed

Progress reading display when reading from compressed files.
Change labeling routine to use broad overlaps when annotating genes with cluster tables (#15).

Changed

Bump supported polars dependency to v0.20.
Bump supported statsmodels dependency to v0.14.
Report identifier of sequences with uni-valued labels when training.

v0.9.9 - 2023-11-23

Added

Support for gzip, bzip2, lz4 and xz-compressed input files.

Fixed

Outdated use of pandas API in gecco cv command.

Changed

Bump pyhmmer dependency to v0.10.0.
Bump pyrodigal dependency to v3.0.0.
Make gecco cv output a gene table with a ground truth column.

v0.9.8 - 2023-06-09

Fixed

ClusterTable.from_clusters extracting cluster IDs in the wrong column.
Deprecation warnings in polars.read_csv and polars.write_csv with recent polars versions.
Deprecation warnings in importlib_resources with recent Python versions.

v0.9.7 - 2023-05-26

Added

Command line option to annotate proteins using bitscore cutoffs from HMMs.
Command line option to disentangle overlapping domains after HMM annotation.

Changed

Bump pyhmmer dependency to v0.8.0.
Bump pyrodigal dependency to v2.1.0.
Rewrite gecco.model to use polars for managing tabular data.
Replace pandas dependencies with polars
Update gecco run to skip type classification for tasks without an assigned cluster type.

Fixed

Cluster.to_seq_record crashing when called on a cluster with types attribute unset.
Progress bar resetting when performing domain annotation with multiple HMMs.

Removed

Support for Python 3.7.

v0.9.6 - 2023-01-11

Added

Gene Ontology annotations to gecco.interpro local metadata.
Reference to Gene Ontology terms and derived functions to gecco.model.Domain objects.
Gene color based on predicted function in gecco.model.Gene.to_seq_feature.

Fixed

Missing gzip import in the CLI preventing usage of gzip-compressed inputs.
Invalid coordinates of domains found in reverse-strand genes.
Detection of entry points with importlib.metadata on older Python versions.

Changed

bgc_id columns of cluster tables are renamed cluster_id.
gecco.model.ProductType is renamed to gecco.model.ClusterType.
Bumped pyrodigal dependency to v2.0.
Bumped pyhmmer dependency to v0.7.

v0.9.5 - 2022-08-10

Added

gecco predict command to predict BGCs from an annotated genome.
Protein.with_seq function to assign a new sequence to a protein object.

Fixed

Issue with antiSMASH sideload JSON file generation in gecco run and gecco predict.
Make gecco.orf handle STOP codons consistently (#9).

v0.9.4 - 2022-05-31

Added

classes_ property to TypeClassifier to access the classes_ attribute of the TypeBinarizer.
Alternative ORF finder CDSFinder which simply extracts CDS features from input sequences (#8).
Support for annotating domains with "exclusive" HMMs to annotate genes with at most one HMM from the library.

Changed

ProductType is not restricted to MIBiG types anymore and can support any string as a base type identifier.
PyrodigalFinder now uses multiprocessing.pool.ThreadPool instead of custom thread code thanks to OrfFinder.find_genes reentrancy introduced in Pyrodigal v1.0.
PyrodigalFinder can now be used in single / non-meta mode from the API.
BUmped minimum rich version to 12.3 to use None total in progress bars when the size of an HMM library is unknown.

Fixed

Broken MyPy type annotations in the gecco.model and gecco.cli modules.

v0.9.3 - 2022-05-13

Changed

--format flag of gecco annotate and gecco run CLI commands is now made lowercase before giving value to Bio.SeqIO.

Fixed

Genes with duplicate IDs being silently ignored in HMMER.run.

v0.9.2 - 2022-04-11

Added

Padding of short sequences with empty genes when predicting probabilities in ClusterCRF.

v0.9.1 - 2022-04-05

Changed

Make the genes.tsv and features.tsv table contain all genes even when they come from a contig too short to be processed by the CRF sliding window.
Replaced the --force-clusters-tsv flag with a --force-tsv flag to force writing TSV tables even when no genes or clusters were found in gecco run or gecco annotate.

v0.9.1-alpha4 - 2022-03-31

Retrain internal model with:

$ python -m gecco -vv train --c1 0.4 --c2 0 --select 0.25 --window-size 20 \
         -f mibig-2.0.proG2.Pfam-v35.0.features.tsv \
         -c mibig-2.0.proG2.clusters.tsv \
         -g GECCO-data/data/embeddings/mibig-2.0.proG2.genes.tsv \
         -o models/v0.9.1-alpha4

v0.9.1-alpha3 - 2022-03-23

Added

gecco.model.GeneTable class to store gene coordinates independently of protein domains.

Changed

Refactored implementation of load and dump methods for Table classes into a dedicated base class.
gecco run and gecco annotate now output a gene table in addition to the feature and cluster tables.
gecco train expects a gene table instead of a GFF file for the gene coordinates.

v0.9.1-alpha2 - 2022-03-23

Fixed

TypeClassifier.trained not being able to read unknown types from type tables.

v0.9.1-alpha1 - 2022-03-20

Candidate release with support for a sliding window in the CRF prediction algorithm.

v0.8.10 - 2022-02-23

Fixed

--antismash-sideload flag of gecco run causing command to crash.

v0.8.9 - 2022-02-22

Removed

Prediction and support for the Other biosynthetic type of MIBiG clusters.

v0.8.8 - 2022-02-21

Fixed

ClusterRefiner filtering method for edge genes not working as intended.
gecco run and gecco annotate commands crashing on missing input files instead of nicely rendering the error.

v0.8.7 - 2022-02-18

Fixed

interpro.json metadata file not being included in distribution files.
Missing docstring for Protein.with_domains method.

Changed

Bump minimum scikit-learn version to v1.0 for Python3.7+.

v0.8.6 - 2022-02-17 - YANKED

Added

CLI flag for enabling region masking for contigs processed by Prodigal.
CLI flag for controlling region distance used for edge distance filtering.

Changed

gecco.model.Gene and gecco.model.Protein are now immutable data classes.
Bump minimum pyrodigal version to v0.6.4 to use region masking.
Implement filtering for extracted clusters based on distance to the contig edge.
Store InterPro metadata file uncompressed for version-control integration.

Fixed

Mark BGC0000930 as Terpene in the type classifier data.
Progress bar messages are now in consistent format.

v0.8.5 - 2021-11-21

Added

Minimal compatibility support for running GECCO inside of Galaxy workflows.

v0.8.4 - 2021-09-26

Fixed

gecco convert gbk --format bigslice failing to run because of outdated code (#5).
gecco convert gbk --format bigslice not creating files with names conforming to BiG-SLiCE expected input.

Changed

Bump minimum pyrodigal version to v0.6.2 to use platform-accelerated code if supported.

v0.8.3-post1 - 2021-08-23

Fixed

Wrong default value for --threshold being shown in gecco run help message.

v0.8.3 - 2021-08-23

Changed

Default probability threshold for segmentation to 0.3 (from 0.4).

v0.8.2 - 2021-07-31

Fixed

gecco run crashing on Python 3.6 because of missing contextlib.nullcontext class.

Changed

gecco run and gecco annotate will not try to count the number of profiles when given an external HMM file with the --hmm flag.
PyHMMER.run now reports the p-value of each domain in addition to the e-value as a /note qualifier.

v0.8.1 - 2021-07-29

Changed

gecco run now filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom --model.

Fixed

gecco reporting about using Pfam v33.1 while actually using v34.0 because of an outdated field in gecco/hmmer/Pfam.ini.

Added

Missing documentation for the strand attribute of gecco.model.Gene.

v0.8.0 - 2021-07-03

Changed

Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0.
Bump minimum pyhmmer version to v0.4.0 to improve exception handling.
Bump minimum pyrodigal version to v0.5.0 to fix sequence decoding on some platforms.
Use p-values instead of e-values to filter domains obtained with HMMER.
gecco cv and gecco train now seed the RNG with a user-defined seed before shuffling rows of training data.

Fixed

Extraction of BGC compositions for the type predictor while training.
ClusterCRF.trained failing to open an external model.

Added

Domain.pvalue attribute to access the p-value of a domain annotation.
Mandatory pvalue column to FeatureTable objects.
Support for loading several feature tables in gecco train and gecco cv.
Warnings to ClusterCRF.fit when selecting uninformative features.
--correction flag to gecco train and gecco cv, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests.

Removed

Outdated gecco embed command.
Unused --truncate flag from the gecco train CLI.
Tigrfam domains, which is not improving performance on the new training data.

v0.7.0 - 2021-05-31

Added

Support for writing an AntiSMASH sideload JSON file after a gecco run workflow.
Code for converting GenBank files in BiG-SLiCE compatible format with the gecco convert subcommand.
Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE.

Changed

Minimum Biopython version to v1.73 for compatibility with older bioinformatics tooling.
Internal domain composition shipped in the gecco.types with newer composition array obtained directly from MIBiG files.

Removed

Outdated notice about -vvv verbosity level in the help message of the main gecco command.

v0.6.3 - 2021-05-10

Fixed

HMMER annotation not properly handling inputs with multiple contigs.
Some progress bar totals displaying as floats in the CLI.

Changed

PyHMMER now sets the Z and domZ values from the number of proteins given to the search pipeline.
gecco.cli delegates imports to make CLI more responsive.
pkg_resources has been replaced with importlib.resources and importlib.metadata where applicable.
multiprocessing.cpu_count has been replaced with os.cpu_count where applicable.

v0.6.2 - 2021-05-04

Fixed

gecco cv loto crashing because of outdated code.

Changed

Logging-style prompt will only display if GECCO is running with -vv flag.

Added

GECCO bioRxiv paper reference to Cluster.to_seq_record output record.

v0.6.1 - 2021-03-15

Fixed

Progress bar not being disabled by -q flag in CLI.
Fallback to using HMM name if accession is not available in PyHMMER.
Group genes by source contig and process them separately in PyHMMER to avoid bogus E-values.

Added

psutil dependency to get the number of physical CPU cores on the host machine.
Support for using an arbitrary mapping of positives to negatives in gecco embed.

Removed

Unused and outdated HMMER and DomainRow classes from gecco.hmmer.

v0.6.0 - 2021-02-28

Changed

Updated internal model with a cleaned-up version of the MIBiG-2.0 Pfam-33.1/Tigrfam-15.0 embedding.
Updated internal InterPro catalog.

Fixed

Features not being grouped together in gecco cv and gecco train when provided with a feature table where rows were not sorted by protein IDs.

v0.5.5 - 2021-02-28

Fixed

gecco cv bug causing only the last fold to be written.

v0.5.4 - 2021-02-28

Changed

Replaced verboselogs, coloredlogs and better-exceptions with rich.

Removed

tqdm training dependency.

Added

gecco annotate command to produce a feature table from a genomic file.
gecco embed to embed BGCs into non-BGC regions using feature tables.

v0.5.3 - 2021-02-21

Fixed

Coordinates of genes in output GenBank files.
Potential issue with the number of CPUs in PyHMMER.run.

Changed

Bump required pyrodigal version to v0.4.2 to fix buffer overflow.

v0.5.2 - 2021-01-29

Added

Support for downloading HMM files directly from GitHub releases assets.
Validation of filtered HMMs with MD5 checksum.

Fixed

Invalid coordinates of protein domains in GenBank output files.
gecco.interpro module not being added to wheel distribution.

Changed

Bump required pyhmmer version to v0.2.1.

v0.5.1 - 2021-01-15

Fixed

--hmm flag being ignored in in gecco run command.
PyHMMER using HMM names instead of accessions, causing issues with Pfam HMMs.

v0.5.0 - 2021-01-11

Added

Explicit support for Python 3.9.

Changed

pyhmmer is used to annotate protein sequences instead of HMMER3 binary hmmsearch.
HMM files are stored in binary format to speedup parsing and reduce storage size.
tqdm is now a training-only dependency.
gecco cv now requires training dependencies.

v0.4.5 - 2020-11-23

Added

Additional fold column to cross-validation table output.

Changed

Use sequence ID instead of protein ID to extract type from cluster in gecco cv.
Install HMM data in pre-pressed format to make hmmsearch runs faster on short sequences.
gecco.orf was rewritten to extract genes from input sequences in parallel.

v0.4.4 - 2020-09-30

Added

gecco cv loto command to run LOTO cross-validation using BGC types for stratification.
header keyword argument to FeatureTable.dump and ClusterTable.dump to write the table without the column header allowing to append to an existing table.
__getitem__ implementation for FeatureTable and ClusterTable that returns a single row or a sub-table from a table.

Fixed

gecco cv command now writes results iteratively instead of holding the tables for every fold in memory.

Changed

Bumped pandas training dependency to v1.0.

v0.4.3 - 2020-09-07

Fixed

GenBank files being written with invalid /cds feature type.

Changed

Blocked installation of Biopython v1.78 or newer as it removes Bio.Alphabet and breaks the current code.

v0.4.2 - 2020-08-07

Fixed

TypeClassifier.predict_types using inverse type probabilities when given several clusters to process.

v0.4.1 - 2020-08-07

Fixed

gecco run command crashing on input sequences not containing any genes.

v0.4.0 - 2020-08-06

Added

gecco.model.ProductType enum to model the biosynthetic class of a BGC.

Removed

pandas interaction from internal data model.
ClusterCRF code specific to cross-validation.

Changed

pandas, fisher and statsmodels dependencies are now optional.
gecco train command expects a cluster table in addition to the feature table to know the types of the input BGCs.

v0.3.0 - 2020-08-03

Changed

Replaced Nearest-Neighbours classifier with Random Forest to perform type prediction for candidate BGCs.
gecco.knn module was renamed to implementation-agnostic name gecco.types.

Fixed

Extraction of domain composition taking a long time in gecco train command.

Removed

--metric argument to the gecco run CLI command.

v0.2.2 - 2020-07-31

Changed

Domain and Gene can now carry qualifiers that are used when they are translated to a sequence feature.

Added

InterPro names, accessions, and HMMER e-value for each annotated domain in GenBank output files.

v0.2.1 - 2020-07-23

Fixed

Various potential crashes in ClusterRefiner code.

Removed

Uneeded feature dictionary filtering in ClusterCRF for models with Fisher Exact Test feature selection.

v0.2.0 - 2020-07-23

Fixed

pandas warning about unsorted columns in gecco run.

Removed

Gene.probability property, replaced by Gene.maximum_probability and Gene.average_probability properties to be explicit.

Changed

Internal model now uses Pfam and Tigrfam with the top 35% features selected with Fisher's Exact Test.
ClusterRefiner now removes genes on Cluster edges if they do not contain any domain annotation.

v0.1.1 - 2020-07-22

Added

ClusterCRF.predict_probabilities to annotate a list of Gene.

Changed

BGC probability is now stored at the Domain level instead of at the Gene level, independently of the feature extraction level used by the CRF.
ClusterKNN will use the model path provided to gecco run if any.

Docs

Added this changelog file to document changes in the code.
Added documentation to gecco submodules missing some.
Included the CHANGELOG.md file to the generated docs.

v0.1.0 - 2020-07-17

Initial release.

v0.0.1 - 2018-08-13

Proof-of-concept.

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

Unreleased

v0.9.10 - 2024-02-27

Fixed

Changed

v0.9.9 - 2023-11-23

Added

Fixed

Changed

v0.9.8 - 2023-06-09

Fixed

v0.9.7 - 2023-05-26

Added

Changed

Fixed

Removed

v0.9.6 - 2023-01-11

Added

Fixed

Changed

v0.9.5 - 2022-08-10

Added

Fixed

v0.9.4 - 2022-05-31

Added

Changed

Fixed

v0.9.3 - 2022-05-13

Changed

Fixed

v0.9.2 - 2022-04-11

Added

v0.9.1 - 2022-04-05

Changed

v0.9.1-alpha4 - 2022-03-31

v0.9.1-alpha3 - 2022-03-23

Added

Changed

v0.9.1-alpha2 - 2022-03-23

Fixed

v0.9.1-alpha1 - 2022-03-20

v0.8.10 - 2022-02-23

Fixed

v0.8.9 - 2022-02-22

Removed

v0.8.8 - 2022-02-21

Fixed

v0.8.7 - 2022-02-18

Fixed

Changed

v0.8.6 - 2022-02-17 - YANKED

Added

Changed

Fixed

v0.8.5 - 2021-11-21

Added

v0.8.4 - 2021-09-26

Fixed

Changed

v0.8.3-post1 - 2021-08-23

Fixed

v0.8.3 - 2021-08-23

Changed

v0.8.2 - 2021-07-31

Fixed

Changed

v0.8.1 - 2021-07-29

Changed

Fixed

Added

v0.8.0 - 2021-07-03

Changed

Fixed

Added