change count thresholds to be fractions instead of integers #63

kelly-sovacool · 2024-08-20T17:07:45Z

previously, ccbrpipeliner used fractions which was more portable across groups of different sizes.

current code straight from nidap:

Lines 247 to 300 in 79e612e

    
           remove_low_count_genes <- function(counts_matrix, sample_metadata, 
        
                                              gene_names_column, 
        
                                              groups_column, 
        
                                              use_cpm_counts_to_filter = TRUE, 
        
                                              Use_Group_Based_Filtering = FALSE, 
        
                                              Minimum_Count_Value_to_be_Considered_Nonzero = 8, 
        
                                              Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total = 7, 
        
                                              Minimum_Number_of_Samples_with_Nonzero_Counts_in_a_Group = 3) { 
        
             value <- NULL 
        
             df <- counts_matrix 
        
             df <- df[stats::complete.cases(df), ] 
        
             ## duplicate Rows should be removed in Clean_Raw_Counts template 
        
             # df %>% dplyr::group_by(.data[[gene_names_column]]) %>% summarise_all(sum) %>% as.data.frame() -> df 
        
             # print(paste0("Number of features before filtering: ", nrow(df))) 
        
             ## USE CPM Transformation 
        
             if (use_cpm_counts_to_filter == TRUE) { 
        
               trans.df <- df 
        
               trans.df[, -1] <- edgeR::cpm(as.matrix(df[, -1])) 
        
               counts_label <- "Filtered Counts (CPM)" 
        
             } else { 
        
               trans.df <- df 
        
               counts_label <- "Filtered Counts" 
        
             } 
        
             if (Use_Group_Based_Filtering == TRUE) { 
        
               rownames(trans.df) <- trans.df[, gene_names_column] 
        
               trans.df[, gene_names_column] <- NULL 
        
               counts <- trans.df >= Minimum_Count_Value_to_be_Considered_Nonzero # boolean matrix 
        
               tcounts <- as.data.frame(t(counts)) 
        
               colnum <- dim(counts)[1] # number of genes 
        
               tcounts <- merge(sample_metadata[groups_column], tcounts, by = "row.names") 
        
               tcounts$Row.names <- NULL 
        
               melted <- reshape2::melt(tcounts, id.vars = groups_column) 
        
               tcounts.tot <- dplyr::summarise(dplyr::group_by_at(melted, c(groups_column, "variable")), sum = sum(value)) 
        
               tcounts.group <- tcounts.tot %>% 
        
                 tidyr::pivot_wider(names_from = "variable", values_from = "sum") 
        
               colSums(tcounts.group[(1:colnum + 1)] >= Minimum_Number_of_Samples_with_Nonzero_Counts_in_a_Group) >= 1 -> tcounts.keep 
        
               df.filt <- trans.df[tcounts.keep, ] 
        
               df.filt %>% tibble::rownames_to_column(gene_names_column) -> df.filt 
        
             } else { 
        
               trans.df$isexpr1 <- rowSums(as.matrix(trans.df[, -1]) > Minimum_Count_Value_to_be_Considered_Nonzero) >= Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total 
        
               df.filt <- as.data.frame(trans.df[trans.df$isexpr1, ]) 
        
             } 
        
             # colnames(df.filt)[colnames(df.filt)==gene_names_column] <- "Gene" 
        
             # print(paste0("Number of features after filtering: ", nrow(df.filt))) 
        
             return(df.filt) 
        
           }

kelly-sovacool · 2024-08-20T17:22:58Z

@phoman14 do you have any thoughts on this?

phoman14 · 2024-08-20T19:26:22Z

do you mean that if a dataframe has 14 samples then Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total = 0.5 instead of Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total = 7?

kelly-sovacool · 2024-08-20T19:46:45Z

do you mean that if a dataframe has 14 samples then Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total = 0.5 instead of Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total = 7?

Yes exactly. This is @kopardev's suggestion.

phoman14 · 2024-08-20T19:52:15Z

I think this is a fine way to do it.
In my head It is easier to specify the exact number but if we make the input a fraction we could always calculate the fraction upstream from the exact number.
The other consideration is the input format. We should include an error check to make sure the input is in the correct format. I could see a user not understanding the format in enter any of the following 0.5, 50% or 50

kopardev added the reneeTools RepoName label Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change count thresholds to be fractions instead of integers #63

change count thresholds to be fractions instead of integers #63

kelly-sovacool commented Aug 20, 2024

kelly-sovacool commented Aug 20, 2024

phoman14 commented Aug 20, 2024

kelly-sovacool commented Aug 20, 2024

phoman14 commented Aug 20, 2024

change count thresholds to be fractions instead of integers #63

change count thresholds to be fractions instead of integers #63

Comments

kelly-sovacool commented Aug 20, 2024

kelly-sovacool commented Aug 20, 2024

phoman14 commented Aug 20, 2024

kelly-sovacool commented Aug 20, 2024

phoman14 commented Aug 20, 2024