Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheme creation #52

Open
wants to merge 15 commits into
base: development
Choose a base branch
from
Open

Scheme creation #52

wants to merge 15 commits into from

Conversation

peterk87
Copy link
Contributor

@peterk87 peterk87 commented Jul 16, 2018

To help make it more clear what's being changed or added for the scheme creation module, it would be a good idea to have a base branch to merge into so that's why scheme-creation/_base has been created.

  • Scheme creation relevant feature or change branches should be named scheme-creation/<name of feature> (e.g. scheme-creation/find-discriminatory-snvs) and a pull request should be opened to merge into scheme-creation/_base instead of development.
  • Subtyping specific and scheme creation specific code, tests and test data will be separate (biohansel/subtype vs biohansel/create, tests/subtype vs tests/create, tests/data/subtype vs tests/data/create) unless there is common function in some utility function.
  • click will be used instead of argparse
  • absolute imports only (e.g. from biohansel.subtype.util import whatever instead of from util import whatever; fully qualified path from biohansel)
  • command-line application has been changed from hansel to biohansel to align with the name change from bio_hansel to biohansel and to reduce confusion.

biohansel/cli.py Outdated
def cli(verbose):
"""Subtype with a biohansel scheme or create a scheme for your organism of interest
"""
lvl = init_console_logger(verbose)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init_console_logger does not currently return any value for lvl, was it supposed to return something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d6d2270

logging.debug('Initialized logging with %s level', lvl)


def check_between_0_and_1_inclusive(ctx: click.Context,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is ctx or param actually used in this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the interface for Click validation callback

raise click.BadParameter('value needs to be between 0.0 and 1.0 inclusive!')


def check_positive_number(ctx: click.Context,

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yup that makes sense

from .subtyping_params import SubtypingParams
NT_SUB = {x: y for x, y in zip('acgtrymkswhbvdnxACGTRYMKSWHBVDNX', 'tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX')}
REGEX_FASTQ = re.compile(r'^(.+)\.(fastq|fq|fastqsanger)(\.gz)?$')
REGEX_FASTA = re.compile(r'^.+\.(fasta|fa|fna|fas)(\.gz)?$')


def does_file_exist(filepath: str, force: bool):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you are logging.warning in this function, is there any output that fills in {} after File, or does it just print out "File {} already exists..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. Replaced with an f-string in 7017458

biohansel/cli.py Outdated
subtyping_params=subtyping_params,
scheme_subtype_counts=scheme_subtype_counts,
n_threads=threads)
logging.info('Generated %s subtyping results from %s contigs samples', len(contigs_results), len(input_contigs))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps f strings would be good to keep things consistent, just a small thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and the f-string syntax is easier to read than the alternatives. Changed in c2fb9cf

@@ -1,14 +1,15 @@
# -*- coding: utf-8 -*-

import logging
from typing import List, Any, Optional, Tuple, Union
from typing import List, Any, Tuple, Union

import os
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"os" is a core module, perhaps it should come right after logging to maintain the proper import format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're all built-ins in that file actually, but they could definitely be organized a bit better. Changed in f1df3ad

`str.format`/`"%s" % (str)` replaced with f-strings
from .subtyping_params import SubtypingParams
LOG_FORMAT = '%(asctime)s %(levelname)s: %(message)s [in %(pathname)s:%(lineno)d]'

NT_SUB = {x: y for x, y in zip('acgtrymkswhbvdnxACGTRYMKSWHBVDNX', 'tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX')}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small thing, but perhaps a more descriptive variable name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NT_SUB is const nucleotide substitution dict for reverse complementing a nucleotide sequence. It's not really meant to be used by anything except the revcomp function. It's in the module scope so it isn't re-initialized each time the revcomp function is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 47fa20f. Added a comment as well.

logging.info(f'Searching dir "{input_directory}" for FASTQ files')
reads += collect_fastq_from_dir(input_directory)
if paired_reads:
for x in paired_reads:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small thing, but perhaps a more descriptive variable name for x?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 47fa20f

df_results = pd.read_table('tests/data/subtyping-results.tsv')
df_md = read_metadata_table('tests/data/subtype-metadata.tsv')
df_results = pd.read_table('tests/data/subtype/subtyping-results.tsv')
df_md = read_metadata_table('tests/data/subtype/subtype-metadata.tsv')
df_merged = merge_metadata_with_summary_results(df_results, df_md)
assert np.all(df_md.columns.isin(df_merged.columns))
for i, r in df_merged.iterrows():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small thing, but perhaps a more descriptive variable name for i and r?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 9511966. The index was unused so renamed to _

@@ -48,7 +47,7 @@ def test_intermediate_subtype():
avg_tile_coverage=37.04102564102564,
qc_status=None,
qc_message=None)
df = pd.read_csv('tests/data/se_intermediate_subtype_df.csv')
df = pd.read_csv('tests/data/subtype/se_intermediate_subtype_df.csv')
p = init_subtyping_params(args=None, scheme=scheme)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small thing, but perhaps a more descriptive variable name for p?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 9511966


genome_name = 'test'


def test_low_coverage():
scheme = 'heidelberg'
fastq = 'tests/data/SRR1696752/SRR1696752.fastq'
fastq = 'tests/data/subtype/SRR1696752/SRR1696752.fastq'
st, df = subtype_reads(reads=fastq, genome_name=genome_name, scheme=scheme)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "st" the standard for subtype for the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should be fairly standard for a Subtype object to just be named st for brevity throughout the codebase.



def check_subtype_attrs(*sts):
for a in attrs_to_check:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small thing, but perhaps a more descriptive variable name for a?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 9511966

return version
return None


def collect_fastq_from_dir(input_directory: str) -> List[Union[str, Tuple[List[str], str]]]:
fastqs = []
for x in os.listdir(input_directory):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small thing, but perhaps a more descriptive variable name for x?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 51dafd0

@@ -151,7 +128,6 @@ def collect_fasta_from_dir(input_directory: str) -> List[Tuple[str, str]]:
return input_genomes


NT_SUB = {x: y for x, y in zip('acgtrymkswhbvdnxACGTRYMKSWHBVDNX', 'tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX')}


def revcomp(s):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small thing, but perhaps a more descriptive variable name for "s"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for "c"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, if the function returns a string, perhaps -> str is good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 47fa20f

@@ -170,30 +146,67 @@ def is_gzipped(p: str) -> bool:
return bool(re.match(r'^.+\.gz$', p))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small thing, but perhaps a more descriptive variable name for "p"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 47fa20f

pkruczkiewicz added 2 commits August 9, 2018 14:36
Signed-off-by: pkruczkiewicz <[email protected]>
from biohansel.subtype.qc.const import QC
from biohansel.subtype.qc.utils import get_conflicting_tiles, get_num_pos_neg_tiles, get_mixed_subtype_tile_counts
from biohansel.subtype.subtype import Subtype
from biohansel.subtype.subtyping_params import SubtypingParams


def is_overall_coverage_low(st: Subtype, df: pd.DataFrame, p: SubtypingParams) -> Tuple[Optional[str], Optional[str]]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a different name for "p"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 9511966

@@ -16,8 +16,8 @@ def is_overall_coverage_low(st: Subtype, df: pd.DataFrame, p: SubtypingParams) -
or not st.is_fastq_input():
return None, None

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if having multiple return statements is appropriate? Perhaps, just have two variables that represent your return values, and just return those two variables right at the end of the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a matter of preference and what you feel comfortable doing. Multiple returns were used here to avoid lots of nesting (similar to using break or continue in a loop). Simple conditions are checked first to give an easy out before trying to do any more in-depth checks.

@@ -16,8 +16,8 @@ def is_overall_coverage_low(st: Subtype, df: pd.DataFrame, p: SubtypingParams) -
or not st.is_fastq_input():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this above statement can be broken down into separate "if" statements without it being all in one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that condition wasn't very clear the way it was written, so I flipped the conditions to check for AND of positive things rather than OR of negatives. Also added a unit test for the function in 98cd34d

Signed-off-by: pkruczkiewicz <[email protected]>

from biohansel.subtype.qc.const import QC
from biohansel.subtype.subtype import Subtype
from biohansel.subtype.subtyping_params import SubtypingParams

CHECKS = [is_missing_tiles,
is_mixed_subtype,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under the perform_quality_check function, the docstring for subtyping_params would need to be modified

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cc9e671

@@ -232,7 +233,7 @@ def is_maybe_intermediate_subtype(st: Subtype, df: pd.DataFrame, p: SubtypingPar
num_pos_tiles, num_neg_tiles = get_num_pos_neg_tiles(st, df)
obs = int(st.n_tiles_matching_all)
exp = int(st.n_tiles_matching_all_expected)
if (exp - obs) / exp <= p.max_perc_intermediate_tiles and conflicting_tiles.shape[0] == 0 and \
if (exp - obs) / exp <= params.max_intermediate_tiles and conflicting_tiles.shape[0] == 0 and \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this conditional statement is a bit long, perhaps breaking it off into smaller statements would be good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deconstructed the condition in cc9e671 to make it easier to see what's actually being tested.

from biohansel.subtype.subtype_stats import subtype_counts
from biohansel.subtype.subtyping_params import SubtypingParams
from biohansel.subtype.util import get_scheme_fasta, get_scheme_version, init_subtyping_params
from biohansel.utils import find_inconsistent_subtypes


def subtype_reads_samples(reads: List[Tuple[List[str], str]],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for n_threads, is the number of threads always going to be 1 and not user-defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yes my mistake

@@ -166,7 +167,7 @@ def parallel_query_contigs(input_genomes: List[Tuple[str, str]],

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you have df['scheme'], which value actually gets used? scheme_name or scheme, when both are defined?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, perhaps better variable names for refpositions and subtypes on line 138?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheme_name is used if both are defined. The order in the <var1> or <var2> dictates which is used preferentially.

Better var names for refpositions and subtypes? Do you mean the x and y being used in the list comprehensions?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yeah that makes sense, yeah for 'x' and 'y'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually go with short var names for list and dict comprehensions since there's no chance that they can be accessed outside of the comprehension. I was hoping it was clear that the result from de-structuring Series.str.split produces 2 elements and that what x and y are is clear from the name of the vars they end up in. The names can be changed if this is potentially confusing. It may be a good idea to extract this code into a separate function with support for custom scheme tile naming conventions since right now we're expecting a very specific format.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh alright yeah that makes sense, I'm good either way, but I agree that putting it in another separate function would be useful

@glabbe
Copy link
Collaborator

glabbe commented Nov 4, 2020

We have move the scheme creation work to a separate repo for the scheme creation tool BioCanon https://github.com/phac-nml/bioCanon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants