All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
v0.9.10 - 2024-02-27
- Progress reading display when reading from compressed files.
- Change labeling routine to use broad overlaps when annotating genes with cluster tables (#15).
- Bump supported
polars
dependency tov0.20
. - Bump supported
statsmodels
dependency tov0.14
. - Report identifier of sequences with uni-valued labels when training.
v0.9.9 - 2023-11-23
- Support for
gzip
,bzip2
,lz4
andxz
-compressed input files.
- Outdated use of
pandas
API ingecco cv
command.
- Bump
pyhmmer
dependency tov0.10.0
. - Bump
pyrodigal
dependency tov3.0.0
. - Make
gecco cv
output a gene table with a ground truth column.
v0.9.8 - 2023-06-09
ClusterTable.from_clusters
extracting cluster IDs in the wrong column.- Deprecation warnings in
polars.read_csv
andpolars.write_csv
with recentpolars
versions. - Deprecation warnings in
importlib_resources
with recent Python versions.
v0.9.7 - 2023-05-26
- Command line option to annotate proteins using bitscore cutoffs from HMMs.
- Command line option to disentangle overlapping domains after HMM annotation.
- Bump
pyhmmer
dependency tov0.8.0
. - Bump
pyrodigal
dependency tov2.1.0
. - Rewrite
gecco.model
to usepolars
for managing tabular data. - Replace
pandas
dependencies withpolars
- Update
gecco run
to skip type classification for tasks without an assigned cluster type.
Cluster.to_seq_record
crashing when called on a cluster withtypes
attribute unset.- Progress bar resetting when performing domain annotation with multiple HMMs.
- Support for Python 3.7.
v0.9.6 - 2023-01-11
- Gene Ontology annotations to
gecco.interpro
local metadata. - Reference to Gene Ontology terms and derived functions to
gecco.model.Domain
objects. - Gene color based on predicted function in
gecco.model.Gene.to_seq_feature
.
- Missing
gzip
import in the CLI preventing usage of gzip-compressed inputs. - Invalid coordinates of domains found in reverse-strand genes.
- Detection of entry points with
importlib.metadata
on older Python versions.
bgc_id
columns of cluster tables are renamedcluster_id
.gecco.model.ProductType
is renamed togecco.model.ClusterType
.- Bumped
pyrodigal
dependency tov2.0
. - Bumped
pyhmmer
dependency tov0.7
.
v0.9.5 - 2022-08-10
gecco predict
command to predict BGCs from an annotated genome.Protein.with_seq
function to assign a new sequence to a protein object.
- Issue with antiSMASH sideload JSON file generation in
gecco run
andgecco predict
. - Make
gecco.orf
handle STOP codons consistently (#9).
v0.9.4 - 2022-05-31
classes_
property toTypeClassifier
to access theclasses_
attribute of theTypeBinarizer
.- Alternative ORF finder
CDSFinder
which simply extracts CDS features from input sequences (#8). - Support for annotating domains with "exclusive" HMMs to annotate genes with at most one HMM from the library.
ProductType
is not restricted to MIBiG types anymore and can support any string as a base type identifier.PyrodigalFinder
now usesmultiprocessing.pool.ThreadPool
instead of custom thread code thanks toOrfFinder.find_genes
reentrancy introduced in Pyrodigalv1.0
.PyrodigalFinder
can now be used in single / non-meta mode from the API.- BUmped minimum
rich
version to12.3
to useNone
total in progress bars when the size of an HMM library is unknown.
- Broken MyPy type annotations in the
gecco.model
andgecco.cli
modules.
v0.9.3 - 2022-05-13
--format
flag ofgecco annotate
andgecco run
CLI commands is now made lowercase before giving value toBio.SeqIO
.
- Genes with duplicate IDs being silently ignored in
HMMER.run
.
v0.9.2 - 2022-04-11
- Padding of short sequences with empty genes when predicting probabilities in
ClusterCRF
.
v0.9.1 - 2022-04-05
- Make the
genes.tsv
andfeatures.tsv
table contain all genes even when they come from a contig too short to be processed by the CRF sliding window. - Replaced the
--force-clusters-tsv
flag with a--force-tsv
flag to force writing TSV tables even when no genes or clusters were found ingecco run
orgecco annotate
.
v0.9.1-alpha4 - 2022-03-31
Retrain internal model with:
$ python -m gecco -vv train --c1 0.4 --c2 0 --select 0.25 --window-size 20 \
-f mibig-2.0.proG2.Pfam-v35.0.features.tsv \
-c mibig-2.0.proG2.clusters.tsv \
-g GECCO-data/data/embeddings/mibig-2.0.proG2.genes.tsv \
-o models/v0.9.1-alpha4
v0.9.1-alpha3 - 2022-03-23
gecco.model.GeneTable
class to store gene coordinates independently of protein domains.
- Refactored implementation of
load
anddump
methods forTable
classes into a dedicated base class. gecco run
andgecco annotate
now output a gene table in addition to the feature and cluster tables.gecco train
expects a gene table instead of a GFF file for the gene coordinates.
v0.9.1-alpha2 - 2022-03-23
TypeClassifier.trained
not being able to read unknown types from type tables.
v0.9.1-alpha1 - 2022-03-20
Candidate release with support for a sliding window in the CRF prediction algorithm.
v0.8.10 - 2022-02-23
--antismash-sideload
flag ofgecco run
causing command to crash.
v0.8.9 - 2022-02-22
- Prediction and support for the Other biosynthetic type of MIBiG clusters.
v0.8.8 - 2022-02-21
ClusterRefiner
filtering method for edge genes not working as intended.gecco run
andgecco annotate
commands crashing on missing input files instead of nicely rendering the error.
v0.8.7 - 2022-02-18
interpro.json
metadata file not being included in distribution files.- Missing docstring for
Protein.with_domains
method.
- Bump minimum
scikit-learn
version tov1.0
for Python3.7+.
v0.8.6 - 2022-02-17 - YANKED
- CLI flag for enabling region masking for contigs processed by Prodigal.
- CLI flag for controlling region distance used for edge distance filtering.
gecco.model.Gene
andgecco.model.Protein
are now immutable data classes.- Bump minimum
pyrodigal
version tov0.6.4
to use region masking. - Implement filtering for extracted clusters based on distance to the contig edge.
- Store InterPro metadata file uncompressed for version-control integration.
- Mark
BGC0000930
asTerpene
in the type classifier data. - Progress bar messages are now in consistent format.
v0.8.5 - 2021-11-21
- Minimal compatibility support for running GECCO inside of Galaxy workflows.
v0.8.4 - 2021-09-26
gecco convert gbk --format bigslice
failing to run because of outdated code (#5).gecco convert gbk --format bigslice
not creating files with names conforming to BiG-SLiCE expected input.
- Bump minimum
pyrodigal
version tov0.6.2
to use platform-accelerated code if supported.
v0.8.3-post1 - 2021-08-23
- Wrong default value for
--threshold
being shown ingecco run
help message.
v0.8.3 - 2021-08-23
- Default probability threshold for segmentation to 0.3 (from 0.4).
v0.8.2 - 2021-07-31
gecco run
crashing on Python 3.6 because of missingcontextlib.nullcontext
class.
gecco run
andgecco annotate
will not try to count the number of profiles when given an external HMM file with the--hmm
flag.PyHMMER.run
now reports the p-value of each domain in addition to the e-value as a/note
qualifier.
v0.8.1 - 2021-07-29
gecco run
now filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom--model
.
gecco
reporting about using Pfamv33.1
while actually usingv34.0
because of an outdated field ingecco/hmmer/Pfam.ini
.
- Missing documentation for the
strand
attribute ofgecco.model.Gene
.
v0.8.0 - 2021-07-03
- Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0.
- Bump minimum
pyhmmer
version tov0.4.0
to improve exception handling. - Bump minimum
pyrodigal
version tov0.5.0
to fix sequence decoding on some platforms. - Use p-values instead of e-values to filter domains obtained with HMMER.
gecco cv
andgecco train
now seed the RNG with a user-defined seed before shuffling rows of training data.
- Extraction of BGC compositions for the type predictor while training.
ClusterCRF.trained
failing to open an external model.
Domain.pvalue
attribute to access the p-value of a domain annotation.- Mandatory
pvalue
column toFeatureTable
objects. - Support for loading several feature tables in
gecco train
andgecco cv
. - Warnings to
ClusterCRF.fit
when selecting uninformative features. --correction
flag togecco train
andgecco cv
, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests.
- Outdated
gecco embed
command. - Unused
--truncate
flag from thegecco train
CLI. - Tigrfam domains, which is not improving performance on the new training data.
v0.7.0 - 2021-05-31
- Support for writing an AntiSMASH sideload JSON file after a
gecco run
workflow. - Code for converting GenBank files in BiG-SLiCE compatible format with the
gecco convert
subcommand. - Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE.
- Minimum Biopython version to
v1.73
for compatibility with older bioinformatics tooling. - Internal domain composition shipped in the
gecco.types
with newer composition array obtained directly from MIBiG files.
- Outdated notice about
-vvv
verbosity level in the help message of the maingecco
command.
v0.6.3 - 2021-05-10
- HMMER annotation not properly handling inputs with multiple contigs.
- Some progress bar totals displaying as floats in the CLI.
PyHMMER
now sets theZ
anddomZ
values from the number of proteins given to the search pipeline.gecco.cli
delegates imports to make CLI more responsive.pkg_resources
has been replaced withimportlib.resources
andimportlib.metadata
where applicable.multiprocessing.cpu_count
has been replaced withos.cpu_count
where applicable.
v0.6.2 - 2021-05-04
gecco cv loto
crashing because of outdated code.
- Logging-style prompt will only display if GECCO is running with
-vv
flag.
- GECCO bioRxiv paper reference to
Cluster.to_seq_record
output record.
v0.6.1 - 2021-03-15
- Progress bar not being disabled by
-q
flag in CLI. - Fallback to using HMM name if accession is not available in
PyHMMER
. - Group genes by source contig and process them separately in
PyHMMER
to avoid bogus E-values.
psutil
dependency to get the number of physical CPU cores on the host machine.- Support for using an arbitrary mapping of positives to negatives in
gecco embed
.
- Unused and outdated
HMMER
andDomainRow
classes fromgecco.hmmer
.
v0.6.0 - 2021-02-28
- Updated internal model with a cleaned-up version of the MIBiG-2.0 Pfam-33.1/Tigrfam-15.0 embedding.
- Updated internal InterPro catalog.
- Features not being grouped together in
gecco cv
andgecco train
when provided with a feature table where rows were not sorted by protein IDs.
v0.5.5 - 2021-02-28
gecco cv
bug causing only the last fold to be written.
v0.5.4 - 2021-02-28
- Replaced
verboselogs
,coloredlogs
andbetter-exceptions
withrich
.
tqdm
training dependency.
gecco annotate
command to produce a feature table from a genomic file.gecco embed
to embed BGCs into non-BGC regions using feature tables.
v0.5.3 - 2021-02-21
- Coordinates of genes in output GenBank files.
- Potential issue with the number of CPUs in
PyHMMER.run
.
- Bump required
pyrodigal
version tov0.4.2
to fix buffer overflow.
v0.5.2 - 2021-01-29
- Support for downloading HMM files directly from GitHub releases assets.
- Validation of filtered HMMs with MD5 checksum.
- Invalid coordinates of protein domains in GenBank output files.
gecco.interpro
module not being added to wheel distribution.
- Bump required
pyhmmer
version tov0.2.1
.
v0.5.1 - 2021-01-15
--hmm
flag being ignored in ingecco run
command.PyHMMER
using HMM names instead of accessions, causing issues with Pfam HMMs.
v0.5.0 - 2021-01-11
- Explicit support for Python 3.9.
pyhmmer
is used to annotate protein sequences instead of HMMER3 binaryhmmsearch
.- HMM files are stored in binary format to speedup parsing and reduce storage size.
tqdm
is now a training-only dependency.gecco cv
now requires training dependencies.
v0.4.5 - 2020-11-23
- Additional
fold
column to cross-validation table output.
- Use sequence ID instead of protein ID to extract type from cluster in
gecco cv
. - Install HMM data in pre-pressed format to make
hmmsearch
runs faster on short sequences. gecco.orf
was rewritten to extract genes from input sequences in parallel.
v0.4.4 - 2020-09-30
gecco cv loto
command to run LOTO cross-validation using BGC types for stratification.header
keyword argument toFeatureTable.dump
andClusterTable.dump
to write the table without the column header allowing to append to an existing table.__getitem__
implementation forFeatureTable
andClusterTable
that returns a single row or a sub-table from a table.
gecco cv
command now writes results iteratively instead of holding the tables for every fold in memory.
- Bumped
pandas
training dependency tov1.0
.
v0.4.3 - 2020-09-07
- GenBank files being written with invalid
/cds
feature type.
- Blocked installation of Biopython
v1.78
or newer as it removesBio.Alphabet
and breaks the current code.
v0.4.2 - 2020-08-07
TypeClassifier.predict_types
using inverse type probabilities when given several clusters to process.
v0.4.1 - 2020-08-07
gecco run
command crashing on input sequences not containing any genes.
v0.4.0 - 2020-08-06
gecco.model.ProductType
enum to model the biosynthetic class of a BGC.
pandas
interaction from internal data model.ClusterCRF
code specific to cross-validation.
pandas
,fisher
andstatsmodels
dependencies are now optional.gecco train
command expects a cluster table in addition to the feature table to know the types of the input BGCs.
v0.3.0 - 2020-08-03
- Replaced Nearest-Neighbours classifier with Random Forest to perform type prediction for candidate BGCs.
gecco.knn
module was renamed to implementation-agnostic namegecco.types
.
- Extraction of domain composition taking a long time in
gecco train
command.
--metric
argument to thegecco run
CLI command.
v0.2.2 - 2020-07-31
Domain
andGene
can now carry qualifiers that are used when they are translated to a sequence feature.
- InterPro names, accessions, and HMMER e-value for each annotated domain in GenBank output files.
v0.2.1 - 2020-07-23
- Various potential crashes in
ClusterRefiner
code.
- Uneeded feature dictionary filtering in
ClusterCRF
for models with Fisher Exact Test feature selection.
v0.2.0 - 2020-07-23
pandas
warning about unsorted columns ingecco run
.
Gene.probability
property, replaced byGene.maximum_probability
andGene.average_probability
properties to be explicit.
- Internal model now uses
Pfam
andTigrfam
with the top 35% features selected with Fisher's Exact Test. ClusterRefiner
now removes genes onCluster
edges if they do not contain any domain annotation.
v0.1.1 - 2020-07-22
ClusterCRF.predict_probabilities
to annotate a list ofGene
.
- BGC probability is now stored at the
Domain
level instead of at theGene
level, independently of the feature extraction level used by the CRF. ClusterKNN
will use the model path provided togecco run
if any.
- Added this changelog file to document changes in the code.
- Added documentation to
gecco
submodules missing some. - Included the
CHANGELOG.md
file to the generated docs.
v0.1.0 - 2020-07-17
Initial release.
v0.0.1 - 2018-08-13
Proof-of-concept.