Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure to reproduce significant perturbations in Norman et al. #25

Open
wconnell opened this issue Jul 1, 2024 · 1 comment
Open

Comments

@wconnell
Copy link

wconnell commented Jul 1, 2024

I was unable to recover any perturbations with significant effects using the e-distance and corresponding test in the Norman et al. dataset following paper methods and suggested baselines. Following from the preprint manuscript:

As a baseline for significant perturbations, as defined by the E-test, we suggest at least 300 cells per perturbation (Fig 5B) and 1000 average UMI counts per cell (Fig 5D) as an experimental guideline for distinguishable perturbations.

And additional details from methods section:

Cells were kept if they had a minimum of 1000 UMI counts, and genes with a minimum of 50 cells. 2000 highly variable genes were selected using scanpy.pp.find_variable_genes with flavor ‘seurat_v3’. We normalized the count matrix using scanpy.pp.normalize_total and log-transformed the data using scanpy.pp.log1p; We did not z-scale the data. Next, we computed PCA based on the highly variable genes. The E-distances were computed in that PCA space using 50 components and Euclidean distance. To avoid problems due to different numbers of cells per perturbation, we subsampled each dataset such that all perturbations had the same number of cells. We removed all perturbations with fewer than 50 cells and then subsampled to the number of cells in the smallest perturbation left after filtering.

With min_cells=300 (cell per pert), subsampling/filtering the perturbation conditions resulted in 153 conditions (including control) with 300 total cells. The control cells condition is also subsampled. Upon usage of the e-distance test, 0 passed adjusted significance. I also tried this with min_cells=50 (cell per pert), with the same result of 0 passing significance. This is in contrast to Supp Table 4.

Can you please clarify this discrepancy? Pointing to the code would help. Thanks.

Here is the code I used -

import scanpy as sc
from scperturb import *

def filter_and_subsample(adata, perturbation_key, min_cells=300):
    # Remove perturbations with fewer than n cells
    perturbation_counts = adata.obs[perturbation_key].value_counts()
    valid_perturbations = perturbation_counts[perturbation_counts >= min_cells].index
    adata = adata[adata.obs[perturbation_key].isin(valid_perturbations)]
    
    # Subsample each perturbation to the number of cells in the smallest perturbation left
    min_cells_in_perturbation = min(adata.obs[perturbation_key].value_counts())
    subsampled_indices = []
    
    for pert in valid_perturbations:
        pert_indices = adata.obs[adata.obs[perturbation_key] == pert].sample(min_cells_in_perturbation, random_state=42).index
        subsampled_indices.extend(pert_indices)
    
    adata_subsampled = adata[subsampled_indices]
    
    return adata_subsampled

adata = sc.read_h5ad("data/NormanWeissman2019_filtered.h5ad")

adata = adata[adata.X.sum(axis=1) >= 1000]

sc.pp.filter_genes(adata, min_cells=50)

# or use min_cells=50
adata = filter_and_subsample(adata, 'perturbation', min_cells=300)

sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat_v3')
adata = adata[:, adata.var['highly_variable']]

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

sc.tl.pca(adata, n_comps=50)

estats = edist_to_control(adata, obs_key='perturbation', control='control', 
                          obsm_key='X_pca', dist='sqeuclidean', )
df = etest(adata, obs_key='perturbation', obsm_key='X_pca', dist='sqeuclidean', control='control', alpha=0.05, runs=1000, n_jobs=-1)
df['significant_adj'].sum()
@tessadgreen
Copy link
Collaborator

Hi @wconnell

The code for Figure 5 is located here.

There were some changes made when switching to the final Nature Methods version of the paper, most notably the introduction of a sample correction. Here's an access link for the final version of the paper/figure.

Let me know if this material isn't enough to help and I'll take a closer look at what you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants