-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheme creation #52
base: development
Are you sure you want to change the base?
Scheme creation #52
Conversation
…sel create` Signed-off-by: pkruczkiewicz <[email protected]>
Signed-off-by: pkruczkiewicz <[email protected]>
Signed-off-by: pkruczkiewicz <[email protected]>
biohansel/cli.py
Outdated
def cli(verbose): | ||
"""Subtype with a biohansel scheme or create a scheme for your organism of interest | ||
""" | ||
lvl = init_console_logger(verbose) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init_console_logger does not currently return any value for lvl, was it supposed to return something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in d6d2270
logging.debug('Initialized logging with %s level', lvl) | ||
|
||
|
||
def check_between_0_and_1_inclusive(ctx: click.Context, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is ctx or param actually used in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's part of the interface for Click validation callback
raise click.BadParameter('value needs to be between 0.0 and 1.0 inclusive!') | ||
|
||
|
||
def check_positive_number(ctx: click.Context, |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yup that makes sense
biohansel/utils.py
Outdated
from .subtyping_params import SubtypingParams | ||
NT_SUB = {x: y for x, y in zip('acgtrymkswhbvdnxACGTRYMKSWHBVDNX', 'tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX')} | ||
REGEX_FASTQ = re.compile(r'^(.+)\.(fastq|fq|fastqsanger)(\.gz)?$') | ||
REGEX_FASTA = re.compile(r'^.+\.(fasta|fa|fna|fas)(\.gz)?$') | ||
|
||
|
||
def does_file_exist(filepath: str, force: bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you are logging.warning in this function, is there any output that fills in {} after File, or does it just print out "File {} already exists..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. Replaced with an f-string in 7017458
biohansel/cli.py
Outdated
subtyping_params=subtyping_params, | ||
scheme_subtype_counts=scheme_subtype_counts, | ||
n_threads=threads) | ||
logging.info('Generated %s subtyping results from %s contigs samples', len(contigs_results), len(input_contigs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps f strings would be good to keep things consistent, just a small thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and the f-string syntax is easier to read than the alternatives. Changed in c2fb9cf
@@ -1,14 +1,15 @@ | |||
# -*- coding: utf-8 -*- | |||
|
|||
import logging | |||
from typing import List, Any, Optional, Tuple, Union | |||
from typing import List, Any, Tuple, Union | |||
|
|||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"os" is a core module, perhaps it should come right after logging to maintain the proper import format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're all built-ins in that file actually, but they could definitely be organized a bit better. Changed in f1df3ad
`str.format`/`"%s" % (str)` replaced with f-strings
biohansel/utils.py
Outdated
from .subtyping_params import SubtypingParams | ||
LOG_FORMAT = '%(asctime)s %(levelname)s: %(message)s [in %(pathname)s:%(lineno)d]' | ||
|
||
NT_SUB = {x: y for x, y in zip('acgtrymkswhbvdnxACGTRYMKSWHBVDNX', 'tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX')} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small thing, but perhaps a more descriptive variable name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NT_SUB
is const nucleotide substitution dict for reverse complementing a nucleotide sequence. It's not really meant to be used by anything except the revcomp
function. It's in the module scope so it isn't re-initialized each time the revcomp
function is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 47fa20f. Added a comment as well.
biohansel/utils.py
Outdated
logging.info(f'Searching dir "{input_directory}" for FASTQ files') | ||
reads += collect_fastq_from_dir(input_directory) | ||
if paired_reads: | ||
for x in paired_reads: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small thing, but perhaps a more descriptive variable name for x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 47fa20f
tests/subtype/test_metadata.py
Outdated
df_results = pd.read_table('tests/data/subtyping-results.tsv') | ||
df_md = read_metadata_table('tests/data/subtype-metadata.tsv') | ||
df_results = pd.read_table('tests/data/subtype/subtyping-results.tsv') | ||
df_md = read_metadata_table('tests/data/subtype/subtype-metadata.tsv') | ||
df_merged = merge_metadata_with_summary_results(df_results, df_md) | ||
assert np.all(df_md.columns.isin(df_merged.columns)) | ||
for i, r in df_merged.iterrows(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small thing, but perhaps a more descriptive variable name for i and r?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 9511966. The index was unused so renamed to _
tests/subtype/test_qc.py
Outdated
@@ -48,7 +47,7 @@ def test_intermediate_subtype(): | |||
avg_tile_coverage=37.04102564102564, | |||
qc_status=None, | |||
qc_message=None) | |||
df = pd.read_csv('tests/data/se_intermediate_subtype_df.csv') | |||
df = pd.read_csv('tests/data/subtype/se_intermediate_subtype_df.csv') | |||
p = init_subtyping_params(args=None, scheme=scheme) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small thing, but perhaps a more descriptive variable name for p?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 9511966
|
||
genome_name = 'test' | ||
|
||
|
||
def test_low_coverage(): | ||
scheme = 'heidelberg' | ||
fastq = 'tests/data/SRR1696752/SRR1696752.fastq' | ||
fastq = 'tests/data/subtype/SRR1696752/SRR1696752.fastq' | ||
st, df = subtype_reads(reads=fastq, genome_name=genome_name, scheme=scheme) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "st" the standard for subtype for the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it should be fairly standard for a Subtype
object to just be named st
for brevity throughout the codebase.
|
||
|
||
def check_subtype_attrs(*sts): | ||
for a in attrs_to_check: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small thing, but perhaps a more descriptive variable name for a?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 9511966
biohansel/utils.py
Outdated
return version | ||
return None | ||
|
||
|
||
def collect_fastq_from_dir(input_directory: str) -> List[Union[str, Tuple[List[str], str]]]: | ||
fastqs = [] | ||
for x in os.listdir(input_directory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small thing, but perhaps a more descriptive variable name for x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 51dafd0
biohansel/utils.py
Outdated
@@ -151,7 +128,6 @@ def collect_fasta_from_dir(input_directory: str) -> List[Tuple[str, str]]: | |||
return input_genomes | |||
|
|||
|
|||
NT_SUB = {x: y for x, y in zip('acgtrymkswhbvdnxACGTRYMKSWHBVDNX', 'tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX')} | |||
|
|||
|
|||
def revcomp(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small thing, but perhaps a more descriptive variable name for "s"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for "c"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, if the function returns a string, perhaps -> str is good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 47fa20f
biohansel/utils.py
Outdated
@@ -170,30 +146,67 @@ def is_gzipped(p: str) -> bool: | |||
return bool(re.match(r'^.+\.gz$', p)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small thing, but perhaps a more descriptive variable name for "p"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 47fa20f
Signed-off-by: pkruczkiewicz <[email protected]>
Signed-off-by: pkruczkiewicz <[email protected]>
biohansel/subtype/qc/checks.py
Outdated
from biohansel.subtype.qc.const import QC | ||
from biohansel.subtype.qc.utils import get_conflicting_tiles, get_num_pos_neg_tiles, get_mixed_subtype_tile_counts | ||
from biohansel.subtype.subtype import Subtype | ||
from biohansel.subtype.subtyping_params import SubtypingParams | ||
|
||
|
||
def is_overall_coverage_low(st: Subtype, df: pd.DataFrame, p: SubtypingParams) -> Tuple[Optional[str], Optional[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps a different name for "p"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 9511966
biohansel/subtype/qc/checks.py
Outdated
@@ -16,8 +16,8 @@ def is_overall_coverage_low(st: Subtype, df: pd.DataFrame, p: SubtypingParams) - | |||
or not st.is_fastq_input(): | |||
return None, None | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if having multiple return statements is appropriate? Perhaps, just have two variables that represent your return values, and just return those two variables right at the end of the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a matter of preference and what you feel comfortable doing. Multiple returns were used here to avoid lots of nesting (similar to using break
or continue
in a loop). Simple conditions are checked first to give an easy out before trying to do any more in-depth checks.
biohansel/subtype/qc/checks.py
Outdated
@@ -16,8 +16,8 @@ def is_overall_coverage_low(st: Subtype, df: pd.DataFrame, p: SubtypingParams) - | |||
or not st.is_fastq_input(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this above statement can be broken down into separate "if" statements without it being all in one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that condition wasn't very clear the way it was written, so I flipped the conditions to check for AND of positive things rather than OR of negatives. Also added a unit test for the function in 98cd34d
Signed-off-by: pkruczkiewicz <[email protected]>
|
||
from biohansel.subtype.qc.const import QC | ||
from biohansel.subtype.subtype import Subtype | ||
from biohansel.subtype.subtyping_params import SubtypingParams | ||
|
||
CHECKS = [is_missing_tiles, | ||
is_mixed_subtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
under the perform_quality_check function, the docstring for subtyping_params would need to be modified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in cc9e671
@@ -232,7 +233,7 @@ def is_maybe_intermediate_subtype(st: Subtype, df: pd.DataFrame, p: SubtypingPar | |||
num_pos_tiles, num_neg_tiles = get_num_pos_neg_tiles(st, df) | |||
obs = int(st.n_tiles_matching_all) | |||
exp = int(st.n_tiles_matching_all_expected) | |||
if (exp - obs) / exp <= p.max_perc_intermediate_tiles and conflicting_tiles.shape[0] == 0 and \ | |||
if (exp - obs) / exp <= params.max_intermediate_tiles and conflicting_tiles.shape[0] == 0 and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this conditional statement is a bit long, perhaps breaking it off into smaller statements would be good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deconstructed the condition in cc9e671 to make it easier to see what's actually being tested.
from biohansel.subtype.subtype_stats import subtype_counts | ||
from biohansel.subtype.subtyping_params import SubtypingParams | ||
from biohansel.subtype.util import get_scheme_fasta, get_scheme_version, init_subtyping_params | ||
from biohansel.utils import find_inconsistent_subtypes | ||
|
||
|
||
def subtype_reads_samples(reads: List[Tuple[List[str], str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for n_threads, is the number of threads always going to be 1 and not user-defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be user-defined. The default value is 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok yes my mistake
@@ -166,7 +167,7 @@ def parallel_query_contigs(input_genomes: List[Tuple[str, str]], | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you have df['scheme'], which value actually gets used? scheme_name or scheme, when both are defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, perhaps better variable names for refpositions and subtypes on line 138?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheme_name
is used if both are defined. The order in the <var1> or <var2>
dictates which is used preferentially.
Better var names for refpositions
and subtypes
? Do you mean the x
and y
being used in the list comprehensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yeah that makes sense, yeah for 'x' and 'y'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually go with short var names for list and dict comprehensions since there's no chance that they can be accessed outside of the comprehension. I was hoping it was clear that the result from de-structuring Series.str.split
produces 2 elements and that what x
and y
are is clear from the name of the vars they end up in. The names can be changed if this is potentially confusing. It may be a good idea to extract this code into a separate function with support for custom scheme tile naming conventions since right now we're expecting a very specific format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh alright yeah that makes sense, I'm good either way, but I agree that putting it in another separate function would be useful
Signed-off-by: pkruczkiewicz <[email protected]>
Signed-off-by: pkruczkiewicz <[email protected]>
We have move the scheme creation work to a separate repo for the scheme creation tool BioCanon https://github.com/phac-nml/bioCanon |
To help make it more clear what's being changed or added for the scheme creation module, it would be a good idea to have a base branch to merge into so that's why
scheme-creation/_base
has been created.scheme-creation/<name of feature>
(e.g.scheme-creation/find-discriminatory-snvs
) and a pull request should be opened to merge intoscheme-creation/_base
instead ofdevelopment
.biohansel/subtype
vsbiohansel/create
,tests/subtype
vstests/create
,tests/data/subtype
vstests/data/create
) unless there is common function in some utility function.from biohansel.subtype.util import whatever
instead offrom util import whatever
; fully qualified path frombiohansel
)hansel
tobiohansel
to align with the name change frombio_hansel
tobiohansel
and to reduce confusion.