Skip to content

A Python tool to compare the contents of two tab-delimited files.

License

Notifications You must be signed in to change notification settings

theiagen/theiavalidate

Repository files navigation

theiavalidate

Note: this repository is undergoing active development. Check back for updates.

Docker

We recommend using our Docker image to run this tool.

docker pull us-docker.pkg.dev/general-theiagen/theiagen/theiavalidate:0.1.0

Usage

usage: python3 theiavalidate.py table1 table2 [options]

This tool compares two tab-delimited files and outputs a report of the differences between the two files.

positional arguments:
  table1  the first table to compare
  table2  the second table to compare

optional arguments:
  -h, --help
          show this help message and exit
  -v, --version
          show program's version number and exit
  -c, --columns_to_compare
          a comma-separated list of columns to compare
          required for a successful run
  -m, --validation_criteria
          a tab-delimited file containing the validation criteria to check
  -l, --column_translation
          a tab-delimited file that links column names between the two tables
  -o, --output_prefix
          the output file name prefix
          do not include any spaces
  -n, --na_values
          the values that should be considered NA
          default values = ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', 'None']
  --verbose
          increase stdout verbosity
  --debug
          increase stdout verbosity to debug; overwrites --verbose

Inputs Explained

See also the examples folder for example inputs.

Required: table1 and table2

These are the two TSV files that will be examined. The order of the tables does not matter.

CAUTION: each table requires exactly the same number of samples and matching sample names (or values in the first column). If the tables do not have the same samples, the script will fail. There can be no additional samples in either table as well.

Required: columns_to_compare

The columns_to_compare variable determines what columns will be examined. This is a comma separated list, such as: "assembly_length,est_coverage,gambit_predicted_taxon". The order of the columns does not matter. All other columns not listed will be ignored.

Optional: validation_criteria

An example validation_criteria.tsv file is shown below. The first column is the column name in the two tables. The second column is the validation criteria to use for that column. This file expects a header and is tab-delimited.

CAUTION: Any column names in this file must also be in columns_to_compare for additional validation criteria to be performed.

column_name     validation_criteria
column1         EXACT
column2         SET
column3         0.01

Currently implemented validation criteria include:

validation_criteria explanation
EXACT The values in the two columns must be exactly the same; in this case [foo,bar] != [bar,foo]. When applied to columns referencing files, file contents will be compared to check if they are identical.
SET The values in the two columns must be the same set of values; in this case [foo,bar] == [bar,foo]. When applied to columns referencing files, the lines within the files will be sorted alphabetically before comparing.
<FLOAT> The values in the two columns must be within <FLOAT>*100 of each other; e.g., 0.3 -> 30% difference allowed.
IGNORE The values in the two columns are assumed to match; in this case foo == bar.

Optional: column_translation

An example column_translation.tsv file is shown below. The first column is the column name in one table, and the second column is the corresponding column name in the other table. All columns with the name in the first column will be renamed to match the corresponding column name in the second column. This file has no header and is tab-delimited.

column_name1_table1    column_name1_table2
column_name2_table1    column_name2_table2
original_column_name   new_column_name

For example, if table1 has a column named column_name1_table1, it will be renamed to column_name1_table2 in all outputs and comparisons.

Optional: output_prefix

The output prefix variable is a string that will prefix all output file. Do not include any whitespace. The default is theiavalidate.

Optional: na_values

The na_values variable is a list of values that should be considered NA by Pandas. The default list is different than the default na_values list used by Pandas. This is because some outputs are legitimately "NA" and should not be considered missing data by Pandas. All and only the values in this list will be replaced with pandas.na or numpy.nan in the output files and comparisons.

Optional: verbose and debug

These two outputs increase the verbosity of the logging system to INFO and DEBUG, respectively. DEBUG produces far more output than INFO and may be excessive for non-debugging purposes. If both --debug and --verbose are present, --debug takes precendence. If no verbosity is specified, the logging level is set to ERROR.

Outputs Explained

See also the examples folder for example outputs.

Or, you can copy and paste following command in the Docker image to generate the example outputs.

theiavalidate.py \
  theiavalidate/examples/example-table1.tsv \
  theiavalidate/examples/example-table2.tsv \
  -c "assembly_length,gambit_predicted_taxon,amrfinderplus_amr_core_genes,extra_column" \
  -l theiavalidate/examples/example-column_translation.tsv \
  -m theiavalidate/examples/example-validation_criteria.tsv \
  -o example-output

filtered_<table1_name> and filtered_<table2_name>

These files are the original input files with only the columns specified in columns_to_compare and all columns being renamed to what is specified in the column_translation.tsv file. These files are provided to allow the user to see what columns are being compared and to allow the user to manually inspect the original data.

<output_prefix>_exact_differences.tsv

This file is a tab-delimited file containing all rows and columns specified in columns_to_compare. The only values in this file are the values that are not exactly the same between the two tables.

<output_prefix>_validation_criteria_differences.tsv

NOTE: This file is only provided if a validation_criteria.tsv file is provided. This file is a tab-delimited file containing all rows and columns specified in columns_to_compare. The only values in this file are the values that do not meet the validation criteria specified in the validation_criteria.tsv file.

<output_prefix>_summary.html and <output_prefix>_summary.pdf

This file (available as an HTML and PDF) is a summary of the differences between the two tables. It contains the following information:

  • the date theiavalidate.py was run
  • as rows, the columns specified in columns_to_compare
  • as columns:
    • the number of rows in table1 that have values
    • the number of rows in table2 that have values
    • the number of differences (exact match)
    • the corresponding validation criteria (if provided)
    • the number of samples failing the validation criteria

If a validation_criteria.tsv file was provided, a definition of the (currently implemented) validation criteria are provided at the bottom of the table

<sample>_<column>_diff.txt

Shows the differing lines within mismatching files for a given sample and column. Each pair of mismatching files generates a separate file.