Skip to content

Commit

Permalink
Merge pull request #15 from biomedbigdata/shuffle_pheno
Browse files Browse the repository at this point in the history
added phenotype shuffle option to epiJSON
  • Loading branch information
juli-p authored Nov 8, 2023
2 parents 00fb972 + 424127b commit 33177a2
Show file tree
Hide file tree
Showing 6 changed files with 105 additions and 1 deletion.
2 changes: 2 additions & 0 deletions .clangd
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
CompileFlags:
Add: [-std=c++20]
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -128,4 +128,7 @@ docker/data
__pycache__


test_out
test_out
debug_scripts
.cache
compile_commands.json
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
- [Selecting the input](#selecting-the-input)
- [Filter selection](#filter-selection)
- [Output files](#output-files)
- [Shuffling the phenotype](#shuffling-the-phenotype)
- [Parameters for calculate_scores](#parameters-for-calculate_scores)
- [specify GWAS data](#specify-gwas-data)
- [specify the SNP set data](#specify-the-snp-set-data)
Expand Down Expand Up @@ -641,6 +642,12 @@ There is the special case where one might want to override the phenotype only fo
--make-macoed
```

### Shuffling the phenotype
The tool epiJSON can also shuffle the phenotype vector after processing all filters. This feature can be activated with the following flag:
```bash
--shuffle-phenotype
```

## Parameters for calculate_scores
With calculate_scores the user can apply the statistical models mentioned in *Local search* to custom data. The tool reads the GWAS data in JSON_EPIGEN format (can be created with epiJSON from various common file formats) and a list of SNP sets in CSV format. It calculates the requested scores for the sets and outputs everything in the same CSV format that NeEDL uses for its results.

Expand Down
50 changes: 50 additions & 0 deletions src/jobs/PlinkShufflePhenotype.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
//
// Created by juli on 08.11.23.
//

#include "PlinkShufflePhenotype.hpp"
#include "../util/TimeLogger.hpp"

namespace epi {

PlinkShufflePhenotype::PlinkShufflePhenotype(std::string input_path, std::string output_path) {
this->input_path = input_path;
this->output_path = output_path;
}


void PlinkShufflePhenotype::run(std::shared_ptr<DataModel> data) {
TimeLogger logger("shufffle phenotype");

Logger::logLine("Read fam file");
CSVParser ind_parser;
ind_parser.parse(input_path + ".fam", ' ');
if (ind_parser.num_columns() < 6) ind_parser.parse(input_path + ".fam", '\t');

// create a list of indices and shuffle them -> output file will get the phenotypes in that order
std::vector<size_t> index_list;
index_list.reserve(ind_parser.num_rows());
for(size_t i = 0; i < ind_parser.num_rows();++i) {
index_list.push_back(i);
}
std::shuffle(index_list.begin(), index_list.end(), data->random_device[omp_get_thread_num()]);


std::ofstream pheno_file(output_path + ".fam");

for (size_t i = 0; i < ind_parser.num_rows(); ++i) {
for (size_t col = 0; col < 5; ++col) {
pheno_file << ind_parser.cell(i, col) << '\t';
}
pheno_file << ind_parser.cell(index_list[i], 5) << '\n';
}
pheno_file.close();

// copy bim and bed file
std::filesystem::copy_file(input_path + ".bed", output_path + ".bed");
std::filesystem::copy_file(input_path + ".bim", output_path + ".bim");


logger.stop();
}
}
28 changes: 28 additions & 0 deletions src/jobs/PlinkShufflePhenotype.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
//
// Created by juli on 08.11.23.
//

#ifndef GENEPISEEKER_PLINKSHUFFLEPHENOTYPE_HPP
#define GENEPISEEKER_PLINKSHUFFLEPHENOTYPE_HPP

#include "Job.hpp"

namespace epi {

class PlinkShufflePhenotype : public Job {
public:
PlinkShufflePhenotype(std::string input_path, std::string output_path);
void run(std::shared_ptr<DataModel> data) override;

private:
std::string input_path;
std::string output_path;
};

} // epi

#ifdef HEADER_ONLY
#include "PlinkShufflePhenotype.cpp"
#endif

#endif //GENEPISEEKER_PLINKSHUFFLEPHENOTYPE_HPP
14 changes: 14 additions & 0 deletions test/model/src/epiJSON.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
#include "../../../src/jobs/PlinkConvertPlinkBinToMACOEDInput.hpp"
#include "../../../src/jobs/PlinkConvertJsonToPlink.hpp"
#include "../../../src/jobs/PlinkConvertPlinkBinToLINDENInput.hpp"
#include "../../../src/jobs/PlinkShufflePhenotype.hpp"

using namespace epi;

Expand All @@ -52,6 +53,9 @@ int main(int argc, char **argv) {
std::vector<std::string> override_phenotypes;
app.add_option("--override-phenotype", override_phenotypes, "This option can optionally be set to override the phenotype contained in the input file. This is helpful if cases and controls are in separate files. If used, it needs to be specified once for every input file. If you want to use this feature only for some input files, set to 'NO' to use the phenotype in the input file. Please use the same notation as used in .fam files (1 = control, 2 = case, -9/0 = missing, or numeric data for categorical/quantitative data");

bool shuffle_phenotype = false;
app.add_flag("--shuffle-phenotype", shuffle_phenotype, "If this flag is set, the phenotype of all samples is shuffled.");

std::string output_directory;
app.add_option("--output-directory", output_directory,
"Output directory were all intermediate and final results should be written to. This should be empty prior to starting epiJSON.")->required();
Expand Down Expand Up @@ -290,12 +294,22 @@ int main(int argc, char **argv) {
current_input_file = outfile;
}

// optionally shuffle the phenotypes of all individuals
if (shuffle_phenotype) {
std::string outfile = output_directory + "shuffled_pheno";
seq.add(std::make_shared<epi::PlinkShufflePhenotype>(current_input_file, outfile));
seq.add(std::make_shared<epi::PlinkCollectDatasetStats>(outfile, phenotype, "shuffle_phenotype"));
seq.add(std::make_shared<epi::PlinkRemoveTempFiles>(current_input_file, std::vector<std::string>{".bim", ".bed", ".fam" }));
current_input_file = outfile;
}


// process disease SNPs if provided
if (!disease_snps_file.empty()) {
seq.add(std::make_shared<epi::DiseaseSNPReader>(disease_snps_file, current_input_file));
}


// create requested output
if (make_all_formats) {
make_json = make_bim_bed_fam = make_ped_map = make_tped_tfam = make_vcf = make_macoed = make_linden = true;
Expand Down

0 comments on commit 33177a2

Please sign in to comment.