Merge pull request #45 from jaebeom-kim/master

Updates for revision
steineggerlab · Dec 1, 2023 · a94e4f1 · a94e4f1
2 parents 50ce7dd + e5d3d47
commit a94e4f1
Show file tree

Hide file tree

Showing 18 changed files with 225 additions and 310 deletions.
diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@ In addition, it can classify reads against a database of any size as long as it
 - Generate `taxonomyDB` during `build` and load it during `classify` workflow for faster loading of taxonomy information.
 - Support gzipped FASTA/FASTQ files in `add-to-library` and `classify` workflows.
 - low-complexity filtering in `build` workflow as default with `--mask-prob 0.9`.
-- 
+  
 ## Installation
 ### Precompiled binaries
 ```
@@ -81,20 +81,25 @@ metabuli classify read_1.fna read_2.fna dbdir outdir jobid
 # Single-end
 metabuli classify --seq-mode 1 read.fna dbdir outdir jobid
 
+# Long-read 
+metabuli classify --seq-mode 3 read.fna dbdir outdir jobid
+
   * Important parameters:
-   --threads : The number of CPU-cores used (all by default)
+   --threads : The number of threads used (all by default)
    --max-ram : The maximum RAM usage. (128 GiB by default)
-   --min-score : The minimum score to be classified (0.15 for precision mode)
-   --min-sp-score : The minimum score to be classified at or below species rank. (0.5 for precision mode)
+   --min-score : The minimum score to be classified 
+   --min-sp-score : The minimum score to be classified at or below species rank. 
    --taxonomy-path: Directory where the taxonomy dump files are stored. (DBDIR/taxonomy by default)
    --reduced-aa : 0. Use 20 alphabets or 1. Use 15 alphabets to encode amino acids. 
                   Give the same value used for DB creation.
    --accession-level : Set 1 to use accession level classification (0 by default). 
                        It is available when the DB is also built with accession level taxonomy.
-   
-  * Values of --min-score and --min-sp-score for precision mode are optimized only for short reads.
-  * We don't recommend using them for long reads.
 ```
+- Paratemers for precision mode (Metabuli-P)
+  - Illumina short reads: `--min-score 0.15 --min-sp-score 0.5`
+  - PacBio HiFi reads: `--min-score 0.07 --min-sp-score 0.3`
+  - PacBio Sequel II reads: `--min-score 0.005`
+  - ONT reads: `--min-score 0.008`
 
 This will generate two result files: `JobID_classifications.tsv`, `JobID_report.tsv`, and `JobID_krona.html`.
 #### JobID_classifications.tsv
@@ -103,16 +108,14 @@ This will generate two result files: `JobID_classifications.tsv`, `JobID_report.
 3. Taxonomy identifier
 4. Effective read length
 5. DNA level identity score
-6. Amino-acid level identity score
-7. Total Hamming distance
-8. Classification Rank
-9. List of "taxID : k-mer match count"
+6. Classification Rank
+7. List of "taxID : k-mer match count"
 
 ```
 #Example
-1       read_1     2688    294     0.627551        0.806122        35      subspecies      2688:65
-1       read_2     2688    294     0.816327        1       36      subspecies      2688:78
-0       read_3     0       294     0       0       0       no rank
+1 read_1  2688  294     0.627551 subspecies  2688:65
+1 read_2  2688  294     0.816327 subspecies  2688:78
+0 read_3  0     294     0        no rank
 ```
 
 #### JobID_report.tsv
@@ -246,27 +249,36 @@ This will generate **diffIdx**, **info**, **split**, and **taxID_list** and some
 
 
 ## Example
-```
+
 Classifying RNA-seq reads from a COVID-19 patient to identify the culprit variant.
 The whole process must take less than 10 mins using a personal machine.
 
-1. Download RefSeq Virus DB (1.5 GiB)
-metabuli databases RefSeq_virus refseq_virus tmp
-
-2. Download an RNA-seq result (SRR14484345) from this link
-  - https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR14484345&display=data-access
-
-3. Classify the reads using metabuli
-metabuli classify SRR14484345_1.fq SRR14484345_2.fq refseq_virus RESULT_DIR JOB_ID --max-ram RAM_SIZE
-
-4. Check RESULT_DIR/JOB_ID_report.tsv
-  - Find a section like the example below
-...
-92.1346 509945  492302  no rank 2697049                           Severe acute respiratory syndrome coronavirus 2
-3.1174  17254   17254   subspecies      3000001                             SARS-CoV-2 beta
-0.0558  309     309     subspecies      3000000                             SARS-CoV-2 alpha
-0.0065  36      36      subspecies      3000004                             SARS-CoV-2 omicron
-0.0045  25      25      subspecies      3000003                             SARS-CoV-2 gamma
-0.0034  19      19      subspecies      3000002                             SARS-CoV-2 delta
-...
-```
+#### 1. Download RefSeq Virus DB (1.5 GiB)
+`metabuli databases RefSeq_virus refseq_virus tmp`
+
+#### 2. Download an RNA-seq result (SRR14484345)
+   Option 1. Download using SRA Toolkit 
+   ```
+   fasterq-dump --split-files SRR14484345
+   ```
+   Option 2. Download from web browser as FASTQ format
+   - link: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR14484345&display=download
+   - If the donwnloaded file includes both R1 and R2, use following commands.
+``` 
+cat SRR14484345.fastq | paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > SRR14484345_1.fq) | cut -f 5-8 | tr "\t" "\n" > SRR14484345_2.fq
+``` 
+
+#### 3. Classify the reads using metabuli
+   ```
+   metabuli classify SRR14484345_1.fq SRR14484345_2.fq refseq_virus RESULT_DIR JOB_ID --max-ram RAM_SIZE
+   ```
+#### 4. Check RESULT_DIR/JOB_ID_report.tsv
+  Find a section like the example below
+  ```
+  92.1796 510194  489403  no rank 2697049 Severe acute respiratory syndrome coronavirus 2
+  3.4290  18979 18979 subspecies  3000001   SARS-CoV-2 beta
+  0.2488  1377  1377  subspecies  3000003   SARS-CoV-2 gamma
+  0.0459  254   254   subspecies  3000000   SARS-CoV-2 alpha
+  0.0284  157   157   subspecies  3000004   SARS-CoV-2 omicron
+  0.0043  24    24    subspecies  3000002   SARS-CoV-2 delta
+  ```
diff --git a/src/LocalCommandDeclarations.h b/src/LocalCommandDeclarations.h
@@ -7,6 +7,7 @@ extern int updataDB(int argc, const char **argv, const Command& command);
 extern int classify(int argc, const char **argv, const Command& command);
 extern int filter(int argc, const char **argv, const Command& command);
 extern int grade(int argc, const char **argv, const Command& command);
+extern int gradeByCladeSize(int argc, const char **argv, const Command& command);
 extern int seqHeader2TaxId(int argc, const char **argv, const Command& command);
 extern int addToLibrary(int argc, const char **argv, const Command& command);
 extern int applyThreshold(int argc, const char **argv, const Command& command);

diff --git a/src/commons/BitManipulateMacros.h b/src/commons/BitManipulateMacros.h
@@ -12,6 +12,7 @@
 #define GET_4_BITS(num)             (0XfU    & (num))
 #define GET_2_BITS(num) (signed char)(0x3U & (num))
 #define GET_2_BITS_INT(num) (int)(0x3U & (num))
+#define GET_24_BITS_UINT(num) (unsigned int)(0xffffffU & (num))
 #define GET_2_BITS_FL(num) (float)(0x3U & (num))
 #define GET_2_BITS_2(num) (signed char)(0xCu & (num))
 #define GET_2_BITS_3(num) (signed char)(0x30u & (num))

diff --git a/src/commons/Classifier.cpp b/src/commons/Classifier.cpp
@@ -108,13 +108,6 @@ void Classifier::startClassify(const LocalParameters &par) {
         processedSeqCnt += queryReadSplit[splitIdx].end - queryReadSplit[splitIdx].start;
         cout << "The number of processed sequences: " << processedSeqCnt << " (" << (double) processedSeqCnt / (double) numOfSeq << ")" << endl;
 
-//        for (size_t i = 0; i < matchBuffer.startIndexOfReserve; i++) {
-//            cout << matchBuffer.buffer[i].queryId << " " <<  matchBuffer.buffer[i].splitIdx << " " <<
-//            matchBuffer.buffer[i].targetSplitIdx << " " << matchBuffer.buffer[i].targetId << " " <<
-//            genusTaxIdList[matchBuffer.buffer[i].targetId] << " " << speciesTaxIdList[matchBuffer.buffer[i].targetId] << " "
-//            << matchBuffer.buffer[i].position << " " << (int) matchBuffer.buffer[i].hamming << " " << taxIdList[matchBuffer.buffer[i].targetId] << endl;
-//        }
-
         // Write classification results
         reporter->writeReadClassification(queryList);
     }

diff --git a/src/commons/FileMerger.cpp b/src/commons/FileMerger.cpp
@@ -393,137 +393,6 @@ void FileMerger::mergeTargetFiles(const LocalParameters & par, int numOfSplits)
     cout<<diffIdxSplitFileName<<endl;
 }
 
-
-
-///It updates target database. Most of this function is the same with 'mergeTargetFiles', only some additional tasks are added to handle seqID
-///It is not tested yet 2020.01.28
-//void FileMerger::updateTargetDatabase(std::vector<char*> diffIdxFileNames, std::vector<char*> infoFileNames, vector<int> & taxListAtRank, vector<int> & taxIdList, const int & seqIdOffset) {
-//    size_t totalKmerCnt = 0;
-//    size_t writtenKmerCnt = 0;
-//
-//    ///Files to write on & buffers to fill them
-//    FILE * mergedDiffFile = fopen(mergedDiffFileName, "wb");
-//    FILE * mergedInfoFile = fopen(mergedInfoFileName, "wb");
-//    uint16_t * diffBuffer = (uint16_t *)malloc(sizeof(uint16_t) * kmerBufSize); size_t diffBufferIdx = 0;
-//    TargetKmerInfo * infoBuffer = (TargetKmerInfo *)malloc(sizeof(TargetKmerInfo) * kmerBufSize); size_t infoBufferIdx = 0;
-//
-//
-//    ///Prepare files to merge
-//    size_t numOfSplitFiles = diffIdxFileNames.size();
-//    size_t numOfincompletedFiles = numOfSplitFiles;
-//    uint64_t lookingKmers[numOfSplitFiles];
-//    TargetKmerInfo lookingInfos[numOfSplitFiles];
-//    size_t diffFileIdx[numOfSplitFiles];
-//    memset(diffFileIdx, 0, sizeof(diffFileIdx));
-//    size_t infoFileIdx[numOfSplitFiles];
-//    memset(infoFileIdx, 0, sizeof(infoFileIdx));
-//    size_t maxIdxOfEachFiles[numOfSplitFiles];
-//    struct MmapedData<uint16_t> *diffFileList = new struct MmapedData<uint16_t>[numOfSplitFiles];
-//    struct MmapedData<TargetKmerInfo> *infoFileList = new struct MmapedData<TargetKmerInfo>[numOfSplitFiles];
-//    for (size_t file = 0; file < numOfSplitFiles; file++) {
-//        diffFileList[file] = mmapData<uint16_t>(diffIdxFileNames[file]);
-//        infoFileList[file] = mmapData<TargetKmerInfo>(infoFileNames[file]);
-//        maxIdxOfEachFiles[file] = diffFileList[file].fileSize / sizeof(uint16_t);
-//    }
-//
-//    ///TODO check the for loop condition; size
-//    ///Fix sequence IDs of new k-mers
-//    for (int i = 1; i < numOfSplitFiles; i++){
-//        size_t size = (infoFileList[i].fileSize - 1) / sizeof(TargetKmerInfo);
-//        for(int j = 0; j < size + 1; j++){
-//            infoFileList[i].data[j].sequenceID += seqIdOffset;
-//        }
-//    }
-//
-//    /// get the first k-mer to write
-//    for(size_t file = 0; file < numOfSplitFiles; file++){
-//        lookingKmers[file] = getNextKmer(0, diffFileList[file], diffFileIdx[file]);
-//        lookingInfos[file] = infoFileList[file].data[0];
-//        infoFileIdx[file] ++;
-//    }
-//
-//    size_t idxOfMin = smallest(lookingKmers, lookingInfos, taxListAtRank, numOfSplitFiles);
-//    uint64_t lastWrittenKmer = 0;
-//    uint64_t lastKmer = lookingKmers[idxOfMin];
-//    TargetKmerInfo lastInfo = lookingInfos[idxOfMin];
-//
-//    ///write first k-mer
-//    cre->getDiffIdx(0, lastKmer, mergedDiffFile, diffBuffer, diffBufferIdx);
-//    lastWrittenKmer = lastKmer;
-//    cre->writeInfo(&lastInfo, mergedInfoFile, infoBuffer, infoBufferIdx);;
-//    writtenKmerCnt++;
-//
-//    int endFlag = 0;
-//    int hasSeenOtherStrains;
-//    int hasSeenOtherGenus;
-//    ///끝부분 잘 되는지 확인할 것
-//    while(true){
-//        ///update last k-mer
-//        lastKmer = lookingKmers[idxOfMin];
-//        lastInfo = lookingInfos[idxOfMin];
-//
-//        ///update looking k-mers
-//        lookingKmers[idxOfMin] = getNextKmer(lastKmer, diffFileList[idxOfMin], diffFileIdx[idxOfMin]);
-//        lookingInfos[idxOfMin] = infoFileList[idxOfMin].data[infoFileIdx[idxOfMin]];
-//        infoFileIdx[idxOfMin] ++;
-//        if( diffFileIdx[idxOfMin] > maxIdxOfEachFiles[idxOfMin] ){
-//            lookingKmers[idxOfMin] = UINT64_MAX;
-//            numOfincompletedFiles--;
-//            if(numOfincompletedFiles == 0) break;
-//        }
-//        totalKmerCnt ++;
-//        idxOfMin = smallest(lookingKmers, lookingInfos, taxListAtRank, numOfSplitFiles);
-//
-//        hasSeenOtherStrains = 0;
-//        hasSeenOtherGenus = 0;
-//
-//        while(taxListAtRank[lastInfo.sequenceID] == taxListAtRank[lookingInfos[idxOfMin].sequenceID]){
-//            if(lastKmer != lookingKmers[idxOfMin]) break;
-//
-//            hasSeenOtherStrains += (taxIdList[lastInfo.sequenceID] != taxIdList[lookingInfos[idxOfMin].sequenceID]);
-//
-//            lookingKmers[idxOfMin] = getNextKmer(lastKmer, diffFileList[idxOfMin], diffFileIdx[idxOfMin]);
-//            lookingInfos[idxOfMin] = infoFileList[idxOfMin].data[infoFileIdx[idxOfMin]];
-//            infoFileIdx[idxOfMin] ++;
-//            totalKmerCnt ++;
-//
-//            if(diffFileIdx[idxOfMin] > maxIdxOfEachFiles[idxOfMin]){
-//                lookingKmers[idxOfMin] = UINT64_MAX;
-//                numOfincompletedFiles--;
-//                if(numOfincompletedFiles == 0){
-//                    endFlag = 1;
-//                    break;
-//                }
-//            }
-//            idxOfMin = smallest(lookingKmers, lookingInfos, taxListAtRank, numOfSplitFiles);
-//        }
-//
-//        lastInfo.redundancy = (hasSeenOtherStrains > 0 || lastInfo.redundancy);
-//        cre->getDiffIdx(lastWrittenKmer, lastKmer, mergedDiffFile, diffBuffer, diffBufferIdx);
-//        lastWrittenKmer = lastKmer;
-//        cre->writeInfo(&lastInfo, mergedInfoFile, infoBuffer, infoBufferIdx);
-//        writtenKmerCnt++;
-//
-//        if(endFlag == 1) break;
-//    }
-//
-//    cre->flushInfoBuf(infoBuffer, mergedInfoFile, infoBufferIdx);
-//    cre->flushKmerBuf(diffBuffer, mergedDiffFile, diffBufferIdx);
-//
-//    free(diffBuffer);
-//    free(infoBuffer);
-//    fclose(mergedDiffFile);
-//    fclose(mergedInfoFile);
-//    for(size_t file = 0; file < numOfSplitFiles; file++){
-//        munmap(diffFileList[file].data, diffFileList[file].fileSize + 1);
-//        munmap(infoFileList[file].data, infoFileList[file].fileSize + 1);
-//    }
-//
-//    cout<<"Creating target DB is done"<<endl;
-//    cout << "Total k-mer count    : " << totalKmerCnt << endl;
-//    cout<<"Written k-mer count  : " << writtenKmerCnt << endl;
-//}
-
 uint64_t FileMerger::getNextKmer(uint64_t lookingTarget, const struct MmapedData<uint16_t> & diffList, size_t & idx)
 {
     uint16_t fragment = 0;

diff --git a/src/commons/KmerMatcher.cpp b/src/commons/KmerMatcher.cpp
@@ -1,8 +1,10 @@
 #include "KmerMatcher.h"
+#include "BitManipulateMacros.h"
 #include "IndexCreator.h"
 #include "Kmer.h"
 #include "Mmap.h"
 #include <ostream>
+#include <vector>
 
 KmerMatcher::KmerMatcher(const LocalParameters & par,
                          NcbiTaxonomy * taxonomy) {
@@ -231,6 +233,7 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI
             std::vector<uint8_t> selectedHammingSum;
             std::vector<size_t> selectedMatches;
             std::vector<uint16_t> selectedHammings;
+            std::vector<uint32_t> selectedDnaEncodings;
             size_t posToWrite;
 
             int currMatchNum;
@@ -293,6 +296,7 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI
                             matches[matchCnt] = {queryKmerList[j].info,
                                                  candidateKmerInfos[idx].sequenceID,
                                                  taxId2speciesId[candidateKmerInfos[idx].sequenceID],
+                                                 selectedDnaEncodings[k],
                                                  selectedHammings[k],
                                                  selectedHammingSum[k],
                                                  (bool) candidateKmerInfos[idx].redundancy};
@@ -303,11 +307,12 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI
                     selectedMatches.clear();
                     selectedHammingSum.clear();
                     selectedHammings.clear();
+                    selectedDnaEncodings.clear();
 
                     // Reuse the candidate target k-mers to compare in DNA level if queries are the same at amino acid level but not at DNA level
                     if (currentQueryAA == AminoAcidPart(queryKmerList[j].ADkmer)) {
                         compareDna(queryKmerList[j].ADkmer, candidateTargetKmers, selectedMatches,
-                                   selectedHammingSum, selectedHammings,queryKmerList[j].info.frame);
+                                   selectedHammingSum, selectedHammings, selectedDnaEncodings, queryKmerList[j].info.frame);
                         currMatchNum = selectedMatches.size();
 
                         // If local buffer is full, copy them to the shared buffer.
@@ -334,6 +339,7 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI
                             matches[matchCnt] = {queryKmerList[j].info,
                                                  candidateKmerInfos[idx].sequenceID,
                                                  taxId2speciesId[candidateKmerInfos[idx].sequenceID],
+                                                 selectedDnaEncodings[k],
                                                  selectedHammings[k],
                                                  selectedHammingSum[k],
                                                  (bool) candidateKmerInfos[idx].redundancy};
@@ -401,7 +407,7 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI
 
                     // Compare the current query and the loaded target k-mers and select
                     compareDna(currentQuery, candidateTargetKmers, selectedMatches, selectedHammingSum,
-                               selectedHammings, queryKmerList[j].info.frame);
+                               selectedHammings, selectedDnaEncodings, queryKmerList[j].info.frame);
 
                     // If local buffer is full, copy them to the shared buffer.
                     currMatchNum = selectedMatches.size();
@@ -429,6 +435,7 @@ querySplits, queryKmerList, matchBuffer, cout, targetDiffIdxFileName, numOfDiffI
                         matches[matchCnt] = {queryKmerList[j].info,
                                              candidateKmerInfos[idx].sequenceID,
                                              taxId2speciesId[candidateKmerInfos[idx].sequenceID],
+                                             selectedDnaEncodings[k],
                                              selectedHammings[k],
                                              selectedHammingSum[k],
                                              (bool) candidateKmerInfos[idx].redundancy};
@@ -495,8 +502,9 @@ void KmerMatcher::compareDna(uint64_t query,
                              std::vector<uint64_t> &targetKmersToCompare,
                              std::vector<size_t> &selectedMatches,
                              std::vector<uint8_t> &selectedHammingSum,
-                             std::vector<uint16_t> &selectedHammings, uint8_t frame) {
-
+                             std::vector<uint16_t> &selectedHammings,
+                             std::vector<uint32_t> &selectedDnaEncodings,
+                             uint8_t frame) {
     size_t size = targetKmersToCompare.size();
     auto *hammingSums = new uint8_t[size + 1];
     uint8_t currentHammingSum;
@@ -516,11 +524,13 @@ void KmerMatcher::compareDna(uint64_t query,
         if (hammingSums[h] <= min(minHammingSum * 2, 7)) {
             selectedMatches.push_back(h);
             selectedHammingSum.push_back(hammingSums[h]);
-            if (frame < 3) {
+            if (frame < 3) { // Frame of query k-mer
                 selectedHammings.push_back(getHammings(query, targetKmersToCompare[h]));
             } else {
                 selectedHammings.push_back(getHammings_reverse(query, targetKmersToCompare[h]));
             }
+            // Store right 24 bits of the target k-mer in selectedDnaEncodings
+            selectedDnaEncodings.push_back(GET_24_BITS_UINT(targetKmersToCompare[h]));
         }
     }
     delete[] hammingSums;