Scheme creation #52

peterk87 · 2018-07-16T16:51:41Z

To help make it more clear what's being changed or added for the scheme creation module, it would be a good idea to have a base branch to merge into so that's why scheme-creation/_base has been created.

Scheme creation relevant feature or change branches should be named scheme-creation/<name of feature> (e.g. scheme-creation/find-discriminatory-snvs) and a pull request should be opened to merge into scheme-creation/_base instead of development.
Subtyping specific and scheme creation specific code, tests and test data will be separate (biohansel/subtype vs biohansel/create, tests/subtype vs tests/create, tests/data/subtype vs tests/data/create) unless there is common function in some utility function.
click will be used instead of argparse
absolute imports only (e.g. from biohansel.subtype.util import whatever instead of from util import whatever; fully qualified path from biohansel)
command-line application has been changed from hansel to biohansel to align with the name change from bio_hansel to biohansel and to reduce confusion.

…sel create` Signed-off-by: pkruczkiewicz <[email protected]>

Signed-off-by: pkruczkiewicz <[email protected]>

gcttong · 2018-08-02T17:05:05Z

biohansel/cli.py

+def cli(verbose):
+    """Subtype with a biohansel scheme or create a scheme for your organism of interest
+    """
+    lvl = init_console_logger(verbose)


init_console_logger does not currently return any value for lvl, was it supposed to return something?

Fixed in d6d2270

gcttong · 2018-08-09T16:44:41Z

biohansel/cli.py

+    logging.debug('Initialized logging with %s level', lvl)
+
+
+def check_between_0_and_1_inclusive(ctx: click.Context,


is ctx or param actually used in this function?

It's part of the interface for Click validation callback

peterk87 · 2018-08-09T16:50:12Z

biohansel/cli.py

+        raise click.BadParameter('value needs to be between 0.0 and 1.0 inclusive!')
+
+
+def check_positive_number(ctx: click.Context,


See http://click.pocoo.org/6/options/#callbacks-for-validation

ok yup that makes sense

gcttong · 2018-08-09T18:23:31Z

biohansel/utils.py

-from .subtyping_params import SubtypingParams
+NT_SUB = {x: y for x, y in zip('acgtrymkswhbvdnxACGTRYMKSWHBVDNX', 'tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX')}
+REGEX_FASTQ = re.compile(r'^(.+)\.(fastq|fq|fastqsanger)(\.gz)?$')
+REGEX_FASTA = re.compile(r'^.+\.(fasta|fa|fna|fas)(\.gz)?$')


 def does_file_exist(filepath: str, force: bool):


when you are logging.warning in this function, is there any output that fills in {} after File, or does it just print out "File {} already exists..."

Thanks for pointing that out. Replaced with an f-string in 7017458

gcttong · 2018-08-09T18:29:34Z

biohansel/cli.py

+                                                  subtyping_params=subtyping_params,
+                                                  scheme_subtype_counts=scheme_subtype_counts,
+                                                  n_threads=threads)
+        logging.info('Generated %s subtyping results from %s contigs samples', len(contigs_results), len(input_contigs))


Perhaps f strings would be good to keep things consistent, just a small thing

Yes and the f-string syntax is easier to read than the alternatives. Changed in c2fb9cf

gcttong · 2018-08-09T18:34:19Z

biohansel/utils.py

@@ -1,14 +1,15 @@
 # -*- coding: utf-8 -*-

 import logging
-from typing import List, Any, Optional, Tuple, Union
+from typing import List, Any, Tuple, Union

 import os


"os" is a core module, perhaps it should come right after logging to maintain the proper import format

They're all built-ins in that file actually, but they could definitely be organized a bit better. Changed in f1df3ad

`str.format`/`"%s" % (str)` replaced with f-strings

gcttong · 2018-08-09T18:53:55Z

biohansel/utils.py

-from .subtyping_params import SubtypingParams
+LOG_FORMAT = '%(asctime)s %(levelname)s: %(message)s [in %(pathname)s:%(lineno)d]'
+
+NT_SUB = {x: y for x, y in zip('acgtrymkswhbvdnxACGTRYMKSWHBVDNX', 'tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX')}


just a small thing, but perhaps a more descriptive variable name?

NT_SUB is const nucleotide substitution dict for reverse complementing a nucleotide sequence. It's not really meant to be used by anything except the revcomp function. It's in the module scope so it isn't re-initialized each time the revcomp function is called.

Changed in 47fa20f. Added a comment as well.

gcttong · 2018-08-09T18:55:34Z

biohansel/utils.py

+        logging.info(f'Searching dir "{input_directory}" for FASTQ files')
+        reads += collect_fastq_from_dir(input_directory)
+    if paired_reads:
+        for x in paired_reads:


just a small thing, but perhaps a more descriptive variable name for x?

Changed in 47fa20f

gcttong · 2018-08-09T18:56:11Z

tests/subtype/test_metadata.py

-    df_results = pd.read_table('tests/data/subtyping-results.tsv')
-    df_md = read_metadata_table('tests/data/subtype-metadata.tsv')
+    df_results = pd.read_table('tests/data/subtype/subtyping-results.tsv')
+    df_md = read_metadata_table('tests/data/subtype/subtype-metadata.tsv')
    df_merged = merge_metadata_with_summary_results(df_results, df_md)
    assert np.all(df_md.columns.isin(df_merged.columns))
    for i, r in df_merged.iterrows():


just a small thing, but perhaps a more descriptive variable name for i and r?

Changed in 9511966. The index was unused so renamed to _

gcttong · 2018-08-09T18:56:35Z

tests/subtype/test_qc.py

@@ -48,7 +47,7 @@ def test_intermediate_subtype():
                 avg_tile_coverage=37.04102564102564,
                 qc_status=None,
                 qc_message=None)
-    df = pd.read_csv('tests/data/se_intermediate_subtype_df.csv')
+    df = pd.read_csv('tests/data/subtype/se_intermediate_subtype_df.csv')
    p = init_subtyping_params(args=None, scheme=scheme)


just a small thing, but perhaps a more descriptive variable name for p?

Changed in 9511966

gcttong · 2018-08-09T18:57:03Z

tests/subtype/test_qc.py


 genome_name = 'test'


 def test_low_coverage():
    scheme = 'heidelberg'
-    fastq = 'tests/data/SRR1696752/SRR1696752.fastq'
+    fastq = 'tests/data/subtype/SRR1696752/SRR1696752.fastq'
    st, df = subtype_reads(reads=fastq, genome_name=genome_name, scheme=scheme)


is "st" the standard for subtype for the file?

Yeah, it should be fairly standard for a Subtype object to just be named st for brevity throughout the codebase.

gcttong · 2018-08-09T18:58:12Z

tests/__init__.py

-
-
-def check_subtype_attrs(*sts):
-    for a in attrs_to_check:


just a small thing, but perhaps a more descriptive variable name for a?

Changed in 9511966

gcttong · 2018-08-09T18:58:47Z

biohansel/utils.py

-        return version
-    return None
-
-
 def collect_fastq_from_dir(input_directory: str) -> List[Union[str, Tuple[List[str], str]]]:
    fastqs = []
    for x in os.listdir(input_directory):


just a small thing, but perhaps a more descriptive variable name for x?

Changed in 51dafd0

gcttong · 2018-08-09T19:01:13Z

biohansel/utils.py

@@ -151,7 +128,6 @@ def collect_fasta_from_dir(input_directory: str) -> List[Tuple[str, str]]:
    return input_genomes


-NT_SUB = {x: y for x, y in zip('acgtrymkswhbvdnxACGTRYMKSWHBVDNX', 'tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX')}


 def revcomp(s):


just a small thing, but perhaps a more descriptive variable name for "s"?

also for "c"?

also, if the function returns a string, perhaps -> str is good?

Changed in 47fa20f

gcttong · 2018-08-09T19:01:40Z

biohansel/utils.py

@@ -170,30 +146,67 @@ def is_gzipped(p: str) -> bool:
    return bool(re.match(r'^.+\.gz$', p))


just a small thing, but perhaps a more descriptive variable name for "p"?

Changed in 47fa20f

Signed-off-by: pkruczkiewicz <[email protected]>

gcttong · 2018-08-09T19:41:25Z

biohansel/subtype/qc/checks.py

+from biohansel.subtype.qc.const import QC
+from biohansel.subtype.qc.utils import get_conflicting_tiles, get_num_pos_neg_tiles, get_mixed_subtype_tile_counts
+from biohansel.subtype.subtype import Subtype
+from biohansel.subtype.subtyping_params import SubtypingParams


 def is_overall_coverage_low(st: Subtype, df: pd.DataFrame, p: SubtypingParams) -> Tuple[Optional[str], Optional[str]]:


perhaps a different name for "p"?

Changed in 9511966

gcttong · 2018-08-09T19:46:57Z

biohansel/subtype/qc/checks.py

@@ -16,8 +16,8 @@ def is_overall_coverage_low(st: Subtype, df: pd.DataFrame, p: SubtypingParams) -
            or not st.is_fastq_input():
        return None, None



not sure if having multiple return statements is appropriate? Perhaps, just have two variables that represent your return values, and just return those two variables right at the end of the function?

It's a matter of preference and what you feel comfortable doing. Multiple returns were used here to avoid lots of nesting (similar to using break or continue in a loop). Simple conditions are checked first to give an easy out before trying to do any more in-depth checks.

gcttong · 2018-08-09T19:47:49Z

biohansel/subtype/qc/checks.py

@@ -16,8 +16,8 @@ def is_overall_coverage_low(st: Subtype, df: pd.DataFrame, p: SubtypingParams) -
            or not st.is_fastq_input():


Perhaps this above statement can be broken down into separate "if" statements without it being all in one?

Yeah, that condition wasn't very clear the way it was written, so I flipped the conditions to check for AND of positive things rather than OR of negatives. Also added a unit test for the function in 98cd34d

Signed-off-by: pkruczkiewicz <[email protected]>

gcttong · 2018-08-09T20:18:18Z

biohansel/subtype/qc/__init__.py

-
+from biohansel.subtype.qc.const import QC
+from biohansel.subtype.subtype import Subtype
+from biohansel.subtype.subtyping_params import SubtypingParams

 CHECKS = [is_missing_tiles,
          is_mixed_subtype,


under the perform_quality_check function, the docstring for subtyping_params would need to be modified

Fixed in cc9e671

gcttong · 2018-08-09T20:20:06Z

biohansel/subtype/qc/checks.py

@@ -232,7 +233,7 @@ def is_maybe_intermediate_subtype(st: Subtype, df: pd.DataFrame, p: SubtypingPar
    num_pos_tiles, num_neg_tiles = get_num_pos_neg_tiles(st, df)
    obs = int(st.n_tiles_matching_all)
    exp = int(st.n_tiles_matching_all_expected)
-    if (exp - obs) / exp <= p.max_perc_intermediate_tiles and conflicting_tiles.shape[0] == 0 and \
+    if (exp - obs) / exp <= params.max_intermediate_tiles and conflicting_tiles.shape[0] == 0 and \


this conditional statement is a bit long, perhaps breaking it off into smaller statements would be good?

Deconstructed the condition in cc9e671 to make it easier to see what's actually being tested.

gcttong · 2018-08-09T20:22:02Z

biohansel/subtype/subtyper.py

+from biohansel.subtype.subtype_stats import subtype_counts
+from biohansel.subtype.subtyping_params import SubtypingParams
+from biohansel.subtype.util import get_scheme_fasta, get_scheme_version, init_subtyping_params
+from biohansel.utils import find_inconsistent_subtypes


 def subtype_reads_samples(reads: List[Tuple[List[str], str]],


for n_threads, is the number of threads always going to be 1 and not user-defined?

It can be user-defined. The default value is 1.

See https://github.com/phac-nml/biohansel/pull/52/files/9511966bb5c1825cc60001ba162d0f5045a8751a#diff-6bd3773d4492443bd26fad15adf979b0R184

Ok yes my mistake

gcttong · 2018-08-09T20:35:33Z

biohansel/subtype/subtyper.py

@@ -166,7 +167,7 @@ def parallel_query_contigs(input_genomes: List[Tuple[str, str]],



when you have df['scheme'], which value actually gets used? scheme_name or scheme, when both are defined?

Also, perhaps better variable names for refpositions and subtypes on line 138?

scheme_name is used if both are defined. The order in the <var1> or <var2> dictates which is used preferentially.

Better var names for refpositions and subtypes? Do you mean the x and y being used in the list comprehensions?

ok yeah that makes sense, yeah for 'x' and 'y'

I usually go with short var names for list and dict comprehensions since there's no chance that they can be accessed outside of the comprehension. I was hoping it was clear that the result from de-structuring Series.str.split produces 2 elements and that what x and y are is clear from the name of the vars they end up in. The names can be changed if this is potentially confusing. It may be a good idea to extract this code into a separate function with support for custom scheme tile naming conventions since right now we're expecting a very specific format.

oh alright yeah that makes sense, I'm good either way, but I agree that putting it in another separate function would be useful

Signed-off-by: pkruczkiewicz <[email protected]>

…shold

glabbe · 2020-11-04T14:11:15Z

We have move the scheme creation work to a separate repo for the scheme creation tool BioCanon https://github.com/phac-nml/bioCanon

pkruczkiewicz added 3 commits July 13, 2018 18:15

argparse -> click; bio_hansel -> biohansel; creating base for `biohan…

ab6bf7f

…sel create` Signed-off-by: pkruczkiewicz <[email protected]>

add create subcommand; restructure/refactor for different subcommands

1b6471a

Signed-off-by: pkruczkiewicz <[email protected]>

fix packaging and add ability to python setup.py test to run tests

79e29c9

Signed-off-by: pkruczkiewicz <[email protected]>

gcttong reviewed Aug 2, 2018

View reviewed changes

Update .travis.yml

68b5c7e