Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change count thresholds to be fractions instead of integers #63

Open
kelly-sovacool opened this issue Aug 20, 2024 · 4 comments
Open

change count thresholds to be fractions instead of integers #63

kelly-sovacool opened this issue Aug 20, 2024 · 4 comments
Labels
reneeTools RepoName

Comments

@kelly-sovacool
Copy link
Member

previously, ccbrpipeliner used fractions which was more portable across groups of different sizes.

current code straight from nidap:

reneeTools/R/filter.R

Lines 247 to 300 in 79e612e

remove_low_count_genes <- function(counts_matrix, sample_metadata,
gene_names_column,
groups_column,
use_cpm_counts_to_filter = TRUE,
Use_Group_Based_Filtering = FALSE,
Minimum_Count_Value_to_be_Considered_Nonzero = 8,
Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total = 7,
Minimum_Number_of_Samples_with_Nonzero_Counts_in_a_Group = 3) {
value <- NULL
df <- counts_matrix
df <- df[stats::complete.cases(df), ]
## duplicate Rows should be removed in Clean_Raw_Counts template
# df %>% dplyr::group_by(.data[[gene_names_column]]) %>% summarise_all(sum) %>% as.data.frame() -> df
# print(paste0("Number of features before filtering: ", nrow(df)))
## USE CPM Transformation
if (use_cpm_counts_to_filter == TRUE) {
trans.df <- df
trans.df[, -1] <- edgeR::cpm(as.matrix(df[, -1]))
counts_label <- "Filtered Counts (CPM)"
} else {
trans.df <- df
counts_label <- "Filtered Counts"
}
if (Use_Group_Based_Filtering == TRUE) {
rownames(trans.df) <- trans.df[, gene_names_column]
trans.df[, gene_names_column] <- NULL
counts <- trans.df >= Minimum_Count_Value_to_be_Considered_Nonzero # boolean matrix
tcounts <- as.data.frame(t(counts))
colnum <- dim(counts)[1] # number of genes
tcounts <- merge(sample_metadata[groups_column], tcounts, by = "row.names")
tcounts$Row.names <- NULL
melted <- reshape2::melt(tcounts, id.vars = groups_column)
tcounts.tot <- dplyr::summarise(dplyr::group_by_at(melted, c(groups_column, "variable")), sum = sum(value))
tcounts.group <- tcounts.tot %>%
tidyr::pivot_wider(names_from = "variable", values_from = "sum")
colSums(tcounts.group[(1:colnum + 1)] >= Minimum_Number_of_Samples_with_Nonzero_Counts_in_a_Group) >= 1 -> tcounts.keep
df.filt <- trans.df[tcounts.keep, ]
df.filt %>% tibble::rownames_to_column(gene_names_column) -> df.filt
} else {
trans.df$isexpr1 <- rowSums(as.matrix(trans.df[, -1]) > Minimum_Count_Value_to_be_Considered_Nonzero) >= Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total
df.filt <- as.data.frame(trans.df[trans.df$isexpr1, ])
}
# colnames(df.filt)[colnames(df.filt)==gene_names_column] <- "Gene"
# print(paste0("Number of features after filtering: ", nrow(df.filt)))
return(df.filt)
}

@kopardev kopardev added the reneeTools RepoName label Aug 20, 2024
@kelly-sovacool
Copy link
Member Author

@phoman14 do you have any thoughts on this?

@phoman14
Copy link
Collaborator

do you mean that if a dataframe has 14 samples then Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total = 0.5 instead of Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total = 7?

@kelly-sovacool
Copy link
Member Author

do you mean that if a dataframe has 14 samples then Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total = 0.5 instead of Minimum_Number_of_Samples_with_Nonzero_Counts_in_Total = 7?

Yes exactly. This is @kopardev's suggestion.

@phoman14
Copy link
Collaborator

I think this is a fine way to do it.
In my head It is easier to specify the exact number but if we make the input a fraction we could always calculate the fraction upstream from the exact number.
The other consideration is the input format. We should include an error check to make sure the input is in the correct format. I could see a user not understanding the format in enter any of the following 0.5, 50% or 50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reneeTools RepoName
Projects
None yet
Development

No branches or pull requests

3 participants