Skip to content

Commit

Permalink
Merge pull request #45 from jaebeom-kim/master
Browse files Browse the repository at this point in the history
Updates for revision
  • Loading branch information
jaebeom-kim authored Dec 1, 2023
2 parents 50ce7dd + e5d3d47 commit a94e4f1
Show file tree
Hide file tree
Showing 18 changed files with 225 additions and 310 deletions.
82 changes: 47 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ In addition, it can classify reads against a database of any size as long as it
- Generate `taxonomyDB` during `build` and load it during `classify` workflow for faster loading of taxonomy information.
- Support gzipped FASTA/FASTQ files in `add-to-library` and `classify` workflows.
- low-complexity filtering in `build` workflow as default with `--mask-prob 0.9`.
-
## Installation
### Precompiled binaries
```
Expand Down Expand Up @@ -81,20 +81,25 @@ metabuli classify read_1.fna read_2.fna dbdir outdir jobid
# Single-end
metabuli classify --seq-mode 1 read.fna dbdir outdir jobid
# Long-read
metabuli classify --seq-mode 3 read.fna dbdir outdir jobid
* Important parameters:
--threads : The number of CPU-cores used (all by default)
--threads : The number of threads used (all by default)
--max-ram : The maximum RAM usage. (128 GiB by default)
--min-score : The minimum score to be classified (0.15 for precision mode)
--min-sp-score : The minimum score to be classified at or below species rank. (0.5 for precision mode)
--min-score : The minimum score to be classified
--min-sp-score : The minimum score to be classified at or below species rank.
--taxonomy-path: Directory where the taxonomy dump files are stored. (DBDIR/taxonomy by default)
--reduced-aa : 0. Use 20 alphabets or 1. Use 15 alphabets to encode amino acids.
Give the same value used for DB creation.
--accession-level : Set 1 to use accession level classification (0 by default).
It is available when the DB is also built with accession level taxonomy.
* Values of --min-score and --min-sp-score for precision mode are optimized only for short reads.
* We don't recommend using them for long reads.
```
- Paratemers for precision mode (Metabuli-P)
- Illumina short reads: `--min-score 0.15 --min-sp-score 0.5`
- PacBio HiFi reads: `--min-score 0.07 --min-sp-score 0.3`
- PacBio Sequel II reads: `--min-score 0.005`
- ONT reads: `--min-score 0.008`

This will generate two result files: `JobID_classifications.tsv`, `JobID_report.tsv`, and `JobID_krona.html`.
#### JobID_classifications.tsv
Expand All @@ -103,16 +108,14 @@ This will generate two result files: `JobID_classifications.tsv`, `JobID_report.
3. Taxonomy identifier
4. Effective read length
5. DNA level identity score
6. Amino-acid level identity score
7. Total Hamming distance
8. Classification Rank
9. List of "taxID : k-mer match count"
6. Classification Rank
7. List of "taxID : k-mer match count"

```
#Example
1 read_1 2688 294 0.627551 0.806122 35 subspecies 2688:65
1 read_2 2688 294 0.816327 1 36 subspecies 2688:78
0 read_3 0 294 0 0 0 no rank
1 read_1 2688 294 0.627551 subspecies 2688:65
1 read_2 2688 294 0.816327 subspecies 2688:78
0 read_3 0 294 0 no rank
```

#### JobID_report.tsv
Expand Down Expand Up @@ -246,27 +249,36 @@ This will generate **diffIdx**, **info**, **split**, and **taxID_list** and some


## Example
```

Classifying RNA-seq reads from a COVID-19 patient to identify the culprit variant.
The whole process must take less than 10 mins using a personal machine.

1. Download RefSeq Virus DB (1.5 GiB)
metabuli databases RefSeq_virus refseq_virus tmp
2. Download an RNA-seq result (SRR14484345) from this link
- https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR14484345&display=data-access
3. Classify the reads using metabuli
metabuli classify SRR14484345_1.fq SRR14484345_2.fq refseq_virus RESULT_DIR JOB_ID --max-ram RAM_SIZE
4. Check RESULT_DIR/JOB_ID_report.tsv
- Find a section like the example below
...
92.1346 509945 492302 no rank 2697049 Severe acute respiratory syndrome coronavirus 2
3.1174 17254 17254 subspecies 3000001 SARS-CoV-2 beta
0.0558 309 309 subspecies 3000000 SARS-CoV-2 alpha
0.0065 36 36 subspecies 3000004 SARS-CoV-2 omicron
0.0045 25 25 subspecies 3000003 SARS-CoV-2 gamma
0.0034 19 19 subspecies 3000002 SARS-CoV-2 delta
...
```
#### 1. Download RefSeq Virus DB (1.5 GiB)
`metabuli databases RefSeq_virus refseq_virus tmp`

#### 2. Download an RNA-seq result (SRR14484345)
Option 1. Download using SRA Toolkit
```
fasterq-dump --split-files SRR14484345
```
Option 2. Download from web browser as FASTQ format
- link: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR14484345&display=download
- If the donwnloaded file includes both R1 and R2, use following commands.
```
cat SRR14484345.fastq | paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > SRR14484345_1.fq) | cut -f 5-8 | tr "\t" "\n" > SRR14484345_2.fq
```

#### 3. Classify the reads using metabuli
```
metabuli classify SRR14484345_1.fq SRR14484345_2.fq refseq_virus RESULT_DIR JOB_ID --max-ram RAM_SIZE
```
#### 4. Check RESULT_DIR/JOB_ID_report.tsv
Find a section like the example below
```
92.1796 510194 489403 no rank 2697049 Severe acute respiratory syndrome coronavirus 2
3.4290 18979 18979 subspecies 3000001 SARS-CoV-2 beta
0.2488 1377 1377 subspecies 3000003 SARS-CoV-2 gamma
0.0459 254 254 subspecies 3000000 SARS-CoV-2 alpha
0.0284 157 157 subspecies 3000004 SARS-CoV-2 omicron
0.0043 24 24 subspecies 3000002 SARS-CoV-2 delta
```
1 change: 1 addition & 0 deletions src/LocalCommandDeclarations.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ extern int updataDB(int argc, const char **argv, const Command& command);
extern int classify(int argc, const char **argv, const Command& command);
extern int filter(int argc, const char **argv, const Command& command);
extern int grade(int argc, const char **argv, const Command& command);
extern int gradeByCladeSize(int argc, const char **argv, const Command& command);
extern int seqHeader2TaxId(int argc, const char **argv, const Command& command);
extern int addToLibrary(int argc, const char **argv, const Command& command);
extern int applyThreshold(int argc, const char **argv, const Command& command);
Expand Down
1 change: 1 addition & 0 deletions src/commons/BitManipulateMacros.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
#define GET_4_BITS(num) (0XfU & (num))
#define GET_2_BITS(num) (signed char)(0x3U & (num))
#define GET_2_BITS_INT(num) (int)(0x3U & (num))
#define GET_24_BITS_UINT(num) (unsigned int)(0xffffffU & (num))
#define GET_2_BITS_FL(num) (float)(0x3U & (num))
#define GET_2_BITS_2(num) (signed char)(0xCu & (num))
#define GET_2_BITS_3(num) (signed char)(0x30u & (num))
Expand Down
7 changes: 0 additions & 7 deletions src/commons/Classifier.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -108,13 +108,6 @@ void Classifier::startClassify(const LocalParameters &par) {
processedSeqCnt += queryReadSplit[splitIdx].end - queryReadSplit[splitIdx].start;
cout << "The number of processed sequences: " << processedSeqCnt << " (" << (double) processedSeqCnt / (double) numOfSeq << ")" << endl;

// for (size_t i = 0; i < matchBuffer.startIndexOfReserve; i++) {
// cout << matchBuffer.buffer[i].queryId << " " << matchBuffer.buffer[i].splitIdx << " " <<
// matchBuffer.buffer[i].targetSplitIdx << " " << matchBuffer.buffer[i].targetId << " " <<
// genusTaxIdList[matchBuffer.buffer[i].targetId] << " " << speciesTaxIdList[matchBuffer.buffer[i].targetId] << " "
// << matchBuffer.buffer[i].position << " " << (int) matchBuffer.buffer[i].hamming << " " << taxIdList[matchBuffer.buffer[i].targetId] << endl;
// }

// Write classification results
reporter->writeReadClassification(queryList);
}
Expand Down
131 changes: 0 additions & 131 deletions src/commons/FileMerger.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -393,137 +393,6 @@ void FileMerger::mergeTargetFiles(const LocalParameters & par, int numOfSplits)
cout<<diffIdxSplitFileName<<endl;
}



///It updates target database. Most of this function is the same with 'mergeTargetFiles', only some additional tasks are added to handle seqID
///It is not tested yet 2020.01.28
//void FileMerger::updateTargetDatabase(std::vector<char*> diffIdxFileNames, std::vector<char*> infoFileNames, vector<int> & taxListAtRank, vector<int> & taxIdList, const int & seqIdOffset) {
// size_t totalKmerCnt = 0;
// size_t writtenKmerCnt = 0;
//
// ///Files to write on & buffers to fill them
// FILE * mergedDiffFile = fopen(mergedDiffFileName, "wb");
// FILE * mergedInfoFile = fopen(mergedInfoFileName, "wb");
// uint16_t * diffBuffer = (uint16_t *)malloc(sizeof(uint16_t) * kmerBufSize); size_t diffBufferIdx = 0;
// TargetKmerInfo * infoBuffer = (TargetKmerInfo *)malloc(sizeof(TargetKmerInfo) * kmerBufSize); size_t infoBufferIdx = 0;
//
//
// ///Prepare files to merge
// size_t numOfSplitFiles = diffIdxFileNames.size();
// size_t numOfincompletedFiles = numOfSplitFiles;
// uint64_t lookingKmers[numOfSplitFiles];
// TargetKmerInfo lookingInfos[numOfSplitFiles];
// size_t diffFileIdx[numOfSplitFiles];
// memset(diffFileIdx, 0, sizeof(diffFileIdx));
// size_t infoFileIdx[numOfSplitFiles];
// memset(infoFileIdx, 0, sizeof(infoFileIdx));
// size_t maxIdxOfEachFiles[numOfSplitFiles];
// struct MmapedData<uint16_t> *diffFileList = new struct MmapedData<uint16_t>[numOfSplitFiles];
// struct MmapedData<TargetKmerInfo> *infoFileList = new struct MmapedData<TargetKmerInfo>[numOfSplitFiles];
// for (size_t file = 0; file < numOfSplitFiles; file++) {
// diffFileList[file] = mmapData<uint16_t>(diffIdxFileNames[file]);
// infoFileList[file] = mmapData<TargetKmerInfo>(infoFileNames[file]);
// maxIdxOfEachFiles[file] = diffFileList[file].fileSize / sizeof(uint16_t);
// }
//
// ///TODO check the for loop condition; size
// ///Fix sequence IDs of new k-mers
// for (int i = 1; i < numOfSplitFiles; i++){
// size_t size = (infoFileList[i].fileSize - 1) / sizeof(TargetKmerInfo);
// for(int j = 0; j < size + 1; j++){
// infoFileList[i].data[j].sequenceID += seqIdOffset;
// }
// }
//
// /// get the first k-mer to write
// for(size_t file = 0; file < numOfSplitFiles; file++){
// lookingKmers[file] = getNextKmer(0, diffFileList[file], diffFileIdx[file]);
// lookingInfos[file] = infoFileList[file].data[0];
// infoFileIdx[file] ++;
// }
//
// size_t idxOfMin = smallest(lookingKmers, lookingInfos, taxListAtRank, numOfSplitFiles);
// uint64_t lastWrittenKmer = 0;
// uint64_t lastKmer = lookingKmers[idxOfMin];
// TargetKmerInfo lastInfo = lookingInfos[idxOfMin];
//
// ///write first k-mer
// cre->getDiffIdx(0, lastKmer, mergedDiffFile, diffBuffer, diffBufferIdx);
// lastWrittenKmer = lastKmer;
// cre->writeInfo(&lastInfo, mergedInfoFile, infoBuffer, infoBufferIdx);;
// writtenKmerCnt++;
//
// int endFlag = 0;
// int hasSeenOtherStrains;
// int hasSeenOtherGenus;
// ///끝부분 잘 되는지 확인할 것
// while(true){
// ///update last k-mer
// lastKmer = lookingKmers[idxOfMin];
// lastInfo = lookingInfos[idxOfMin];
//
// ///update looking k-mers
// lookingKmers[idxOfMin] = getNextKmer(lastKmer, diffFileList[idxOfMin], diffFileIdx[idxOfMin]);
// lookingInfos[idxOfMin] = infoFileList[idxOfMin].data[infoFileIdx[idxOfMin]];
// infoFileIdx[idxOfMin] ++;
// if( diffFileIdx[idxOfMin] > maxIdxOfEachFiles[idxOfMin] ){
// lookingKmers[idxOfMin] = UINT64_MAX;
// numOfincompletedFiles--;
// if(numOfincompletedFiles == 0) break;
// }
// totalKmerCnt ++;
// idxOfMin = smallest(lookingKmers, lookingInfos, taxListAtRank, numOfSplitFiles);
//
// hasSeenOtherStrains = 0;
// hasSeenOtherGenus = 0;
//
// while(taxListAtRank[lastInfo.sequenceID] == taxListAtRank[lookingInfos[idxOfMin].sequenceID]){
// if(lastKmer != lookingKmers[idxOfMin]) break;
//
// hasSeenOtherStrains += (taxIdList[lastInfo.sequenceID] != taxIdList[lookingInfos[idxOfMin].sequenceID]);
//
// lookingKmers[idxOfMin] = getNextKmer(lastKmer, diffFileList[idxOfMin], diffFileIdx[idxOfMin]);
// lookingInfos[idxOfMin] = infoFileList[idxOfMin].data[infoFileIdx[idxOfMin]];
// infoFileIdx[idxOfMin] ++;
// totalKmerCnt ++;
//
// if(diffFileIdx[idxOfMin] > maxIdxOfEachFiles[idxOfMin]){
// lookingKmers[idxOfMin] = UINT64_MAX;
// numOfincompletedFiles--;
// if(numOfincompletedFiles == 0){
// endFlag = 1;
// break;
// }
// }
// idxOfMin = smallest(lookingKmers, lookingInfos, taxListAtRank, numOfSplitFiles);
// }
//
// lastInfo.redundancy = (hasSeenOtherStrains > 0 || lastInfo.redundancy);
// cre->getDiffIdx(lastWrittenKmer, lastKmer, mergedDiffFile, diffBuffer, diffBufferIdx);
// lastWrittenKmer = lastKmer;
// cre->writeInfo(&lastInfo, mergedInfoFile, infoBuffer, infoBufferIdx);
// writtenKmerCnt++;
//
// if(endFlag == 1) break;
// }
//
// cre->flushInfoBuf(infoBuffer, mergedInfoFile, infoBufferIdx);
// cre->flushKmerBuf(diffBuffer, mergedDiffFile, diffBufferIdx);
//
// free(diffBuffer);
// free(infoBuffer);
// fclose(mergedDiffFile);
// fclose(mergedInfoFile);
// for(size_t file = 0; file < numOfSplitFiles; file++){
// munmap(diffFileList[file].data, diffFileList[file].fileSize + 1);
// munmap(infoFileList[file].data, infoFileList[file].fileSize + 1);
// }
//
// cout<<"Creating target DB is done"<<endl;
// cout << "Total k-mer count : " << totalKmerCnt << endl;
// cout<<"Written k-mer count : " << writtenKmerCnt << endl;
//}

uint64_t FileMerger::getNextKmer(uint64_t lookingTarget, const struct MmapedData<uint16_t> & diffList, size_t & idx)
{
uint16_t fragment = 0;
Expand Down
20 changes: 15 additions & 5 deletions src/commons/KmerMatcher.cpp
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
#include "KmerMatcher.h"
#include "BitManipulateMacros.h"
#include "IndexCreator.h"
#include "Kmer.h"
#include "Mmap.h"
#include <ostream>
#include <vector>

KmerMatcher::KmerMatcher(const LocalParameters & par,
NcbiTaxonomy * taxonomy) {
Expand Down Expand Up @@ -231,6 +233,7 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI
std::vector<uint8_t> selectedHammingSum;
std::vector<size_t> selectedMatches;
std::vector<uint16_t> selectedHammings;
std::vector<uint32_t> selectedDnaEncodings;
size_t posToWrite;

int currMatchNum;
Expand Down Expand Up @@ -293,6 +296,7 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI
matches[matchCnt] = {queryKmerList[j].info,
candidateKmerInfos[idx].sequenceID,
taxId2speciesId[candidateKmerInfos[idx].sequenceID],
selectedDnaEncodings[k],
selectedHammings[k],
selectedHammingSum[k],
(bool) candidateKmerInfos[idx].redundancy};
Expand All @@ -303,11 +307,12 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI
selectedMatches.clear();
selectedHammingSum.clear();
selectedHammings.clear();
selectedDnaEncodings.clear();

// Reuse the candidate target k-mers to compare in DNA level if queries are the same at amino acid level but not at DNA level
if (currentQueryAA == AminoAcidPart(queryKmerList[j].ADkmer)) {
compareDna(queryKmerList[j].ADkmer, candidateTargetKmers, selectedMatches,
selectedHammingSum, selectedHammings,queryKmerList[j].info.frame);
selectedHammingSum, selectedHammings, selectedDnaEncodings, queryKmerList[j].info.frame);
currMatchNum = selectedMatches.size();

// If local buffer is full, copy them to the shared buffer.
Expand All @@ -334,6 +339,7 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI
matches[matchCnt] = {queryKmerList[j].info,
candidateKmerInfos[idx].sequenceID,
taxId2speciesId[candidateKmerInfos[idx].sequenceID],
selectedDnaEncodings[k],
selectedHammings[k],
selectedHammingSum[k],
(bool) candidateKmerInfos[idx].redundancy};
Expand Down Expand Up @@ -401,7 +407,7 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI

// Compare the current query and the loaded target k-mers and select
compareDna(currentQuery, candidateTargetKmers, selectedMatches, selectedHammingSum,
selectedHammings, queryKmerList[j].info.frame);
selectedHammings, selectedDnaEncodings, queryKmerList[j].info.frame);

// If local buffer is full, copy them to the shared buffer.
currMatchNum = selectedMatches.size();
Expand Down Expand Up @@ -429,6 +435,7 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI
matches[matchCnt] = {queryKmerList[j].info,
candidateKmerInfos[idx].sequenceID,
taxId2speciesId[candidateKmerInfos[idx].sequenceID],
selectedDnaEncodings[k],
selectedHammings[k],
selectedHammingSum[k],
(bool) candidateKmerInfos[idx].redundancy};
Expand Down Expand Up @@ -495,8 +502,9 @@ void KmerMatcher::compareDna(uint64_t query,
std::vector<uint64_t> &targetKmersToCompare,
std::vector<size_t> &selectedMatches,
std::vector<uint8_t> &selectedHammingSum,
std::vector<uint16_t> &selectedHammings, uint8_t frame) {

std::vector<uint16_t> &selectedHammings,
std::vector<uint32_t> &selectedDnaEncodings,
uint8_t frame) {
size_t size = targetKmersToCompare.size();
auto *hammingSums = new uint8_t[size + 1];
uint8_t currentHammingSum;
Expand All @@ -516,11 +524,13 @@ void KmerMatcher::compareDna(uint64_t query,
if (hammingSums[h] <= min(minHammingSum * 2, 7)) {
selectedMatches.push_back(h);
selectedHammingSum.push_back(hammingSums[h]);
if (frame < 3) {
if (frame < 3) { // Frame of query k-mer
selectedHammings.push_back(getHammings(query, targetKmersToCompare[h]));
} else {
selectedHammings.push_back(getHammings_reverse(query, targetKmersToCompare[h]));
}
// Store right 24 bits of the target k-mer in selectedDnaEncodings
selectedDnaEncodings.push_back(GET_24_BITS_UINT(targetKmersToCompare[h]));
}
}
delete[] hammingSums;
Expand Down
Loading

0 comments on commit a94e4f1

Please sign in to comment.