Swan has several analysis options to use.
- Differential expression tests
- Isoform switching / differential isoform expression
- Combining metadata columns
- Exon skipping and intron retention
- More differential expression
import swan_vis as swan
sg = swan.read('data/swan.p')
Read in graph from data/swan.p
Swan's old differential gene and transcript expression tests using diffxpy
have now been deprecated as it seems that the library is unsupported. I recommend that users interested in running differential gene or transcript expression test either use PyDESeq2 or Scanpy's rank_genes_groups
test, which both support the AnnData format for simple compatibility with Swan's AnnData expression representation.
Please read the PyDESeq2 documentation for details on how to use one of the SwanGraph AnnData objects to obtain differential expression results. Below is an example on how to find differentially expressed transcripts between cell lines.
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import numpy as np
adata = sg.adata.copy()
# PyDESeq2 currently doesn't support column names with underscores, so change that
adata.obs.rename({'cell_line': 'cellline'}, axis=1, inplace=True)
obs_col = 'cellline'
threads = 8
# densify matrix
adata.X = np.array(adata.X.todense())
# run test
dds = DeseqDataSet(adata=adata,
design_factors=obs_col,
n_cpus=threads,
refit_cooks=True)
dds.deseq2()
stat_res = DeseqStats(dds,
n_cpus=threads)
stat_res.summary()
df = stat_res.results_df
Fitting size factors...
... done in 0.00 seconds.
Fitting dispersions...
... done in 34.09 seconds.
Fitting dispersion trend curve...
... done in 12.01 seconds.
Fitting MAP dispersions...
... done in 13.97 seconds.
Fitting LFCs...
... done in 4.86 seconds.
Refitting 0 outliers.
Running Wald tests...
... done in 2.36 seconds.
Log2 fold change & Wald test p-value: cellline hffc6 vs hepg2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
---|---|---|---|---|---|---|
tid | ||||||
TALONT000296400 | 3.497574 | 2.416436 | 1.472616 | 1.640914 | 0.100815 | 0.193840 |
ENST00000591581.1 | 0.162115 | 0.624956 | 5.002765 | 0.124922 | 0.900585 | NaN |
ENST00000546893.5 | 8.173445 | -0.053334 | 0.793179 | -0.067241 | 0.946390 | 0.968527 |
ENST00000537289.1 | 5.140216 | -1.248175 | 0.978927 | -1.275044 | 0.202294 | 0.328286 |
ENST00000258382.9 | 2.012104 | -0.011819 | 1.493866 | -0.007912 | 0.993687 | NaN |
... | ... | ... | ... | ... | ... | ... |
ENST00000506914.1 | 1.022692 | 0.947588 | 2.272813 | 0.416923 | 0.676735 | NaN |
ENST00000571080.1 | 0.473263 | -2.824409 | 3.426473 | -0.824291 | 0.409774 | NaN |
ENST00000378615.7 | 1.334873 | -1.199810 | 1.790781 | -0.669993 | 0.502862 | NaN |
ENST00000409586.7 | 0.338615 | 1.464364 | 4.023896 | 0.363917 | 0.715920 | NaN |
ENST00000370278.3 | 1.328549 | -1.201246 | 1.962358 | -0.612144 | 0.540443 | NaN |
34814 rows × 6 columns
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
---|---|---|---|---|---|---|
tid | ||||||
TALONT000296400 | 3.497574 | 2.416436 | 1.472616 | 1.640914 | 0.100815 | 0.193840 |
ENST00000591581.1 | 0.162115 | 0.624956 | 5.002765 | 0.124922 | 0.900585 | NaN |
ENST00000546893.5 | 8.173445 | -0.053334 | 0.793179 | -0.067241 | 0.946390 | 0.968527 |
ENST00000537289.1 | 5.140216 | -1.248175 | 0.978927 | -1.275044 | 0.202294 | 0.328286 |
ENST00000258382.9 | 2.012104 | -0.011819 | 1.493866 | -0.007912 | 0.993687 | NaN |
Isoform switching / differential isoform expression (DIE) testing is implemented according to the strategy in Joglekar et. al., 2021. DIE can roughly be described as finding statistically significant changes in isoform expression between two conditions along with a change in percent isoform usage per gene.
Pairwise comparisons can be set up using different columns in the metadata that was added to the SwanGraph with the obs_col
and obs_conditions
arguments.
# look at valid metadata options
sg.adata.obs
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
cell_line | replicate | dataset | total_counts | |
---|---|---|---|---|
dataset | ||||
hepg2_1 | hepg2 | 1 | hepg2_1 | 499647.0 |
hepg2_2 | hepg2 | 2 | hepg2_2 | 848447.0 |
hffc6_1 | hffc6 | 1 | hffc6_1 | 761493.0 |
hffc6_2 | hffc6 | 2 | hffc6_2 | 787967.0 |
hffc6_3 | hffc6 | 3 | hffc6_3 | 614921.0 |
# find genes that exhibit DIE between HFFc6 and HepG2
obs_col = 'cell_line'
obs_conditions = ['hepg2', 'hffc6']
die_table, die_results = sg.die_gene_test(obs_col=obs_col,
obs_conditions=obs_conditions,
verbose=True)
Testing for DIE for each gene: 100%|██████████| 14684/14684 [03:50<00:00, 123.69it/s]
The resultant table contains an entry for each gene with the p value (p_val
), adjusted p value (adj_p_val
), and change in percent isoform usage for the top two isoforms (dpi
), as well as the identities of the top isoforms involved in the switch. Exact details on these calculations can be found in Joglekar et. al., 2021.
die_table.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000130175 | 0.206667 | 39.264992 | TALONT000296399 | NaN | 39.264992 | NaN | TALONT000296400 | ENST00000589838.5 | -20.116056 | -10.638298 | 0.469420 | PRKCSH |
1 | ENSG00000130202 | 0.000367 | 11.893251 | ENST00000252485.8 | ENST00000252483.9 | 9.684967 | 2.208281 | TALONT000406668 | ENST00000591581.1 | -11.473083 | -0.420168 | 0.003560 | NECTIN2 |
2 | ENSG00000111371 | 0.680435 | 9.401713 | ENST00000398637.9 | ENST00000439706.5 | 8.547012 | 0.854701 | ENST00000546893.5 | NaN | -9.401711 | NaN | 0.886243 | SLC38A1 |
3 | ENSG00000181924 | 0.028195 | 9.452934 | ENST00000537289.1 | ENST00000545127.1 | 7.298603 | 2.154328 | ENST00000355693.4 | NaN | -9.452934 | NaN | 0.122619 | COA4 |
4 | ENSG00000163468 | 0.255788 | 0.680048 | ENST00000295688.7 | TALONT000476055 | 0.366966 | 0.277693 | ENST00000368259.6 | ENST00000489870.1 | -0.568503 | -0.111545 | 0.525066 | CCT3 |
Differential isoform expression testing results are stored automatically in sg.adata.uns
, and will be saved to the SwanGraph if you save it. You can regenerate this key and access the summary table by running the following code:
# die_iso - isoform level differential isoform expression test results
uns_key = swan.make_uns_key('die',
obs_col=obs_col,
obs_conditions=obs_conditions)
test = sg.adata.uns[uns_key]
test.head(2)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000130175 | 0.206667 | 39.264992 | TALONT000296399 | NaN | 39.264992 | NaN | TALONT000296400 | ENST00000589838.5 | -20.116056 | -10.638298 | 0.46942 | PRKCSH |
1 | ENSG00000130202 | 0.000367 | 11.893251 | ENST00000252485.8 | ENST00000252483.9 | 9.684967 | 2.208281 | TALONT000406668 | ENST00000591581.1 | -11.473083 | -0.420168 | 0.00356 | NECTIN2 |
Swan comes with an easy way to filter your DIE test results based on adjusted p value and dpi thresholds.
test = sg.get_die_genes(obs_col=obs_col, obs_conditions=obs_conditions,
p=0.05, dpi=10)
test.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ENSG00000130202 | 3.674234e-04 | 11.893251 | ENST00000252485.8 | ENST00000252483.9 | 9.684967 | 2.208281 | TALONT000406668 | ENST00000591581.1 | -11.473083 | -0.420168 | 3.560035e-03 | NECTIN2 |
8 | ENSG00000025156 | 1.857411e-03 | 35.714287 | ENST00000452194.5 | ENST00000465214.2 | 25.714287 | 10.000000 | ENST00000368455.8 | NaN | -35.714287 | NaN | 1.405048e-02 | HSF2 |
20 | ENSG00000105254 | 4.779408e-38 | 25.419458 | ENST00000585910.5 | TALONT000366329 | 20.394853 | 4.702971 | ENST00000221855.7 | ENST00000589996.5 | -25.016304 | -0.403154 | 7.343219e-36 | TBCB |
23 | ENSG00000105379 | 1.802782e-307 | 81.136333 | ENST00000354232.8 | NaN | 81.136333 | NaN | ENST00000309244.8 | ENST00000596253.1 | -80.857935 | -0.278394 | 1.551113e-304 | ETFB |
28 | ENSG00000148180 | 0.000000e+00 | 89.319039 | ENST00000373818.8 | TALONT000419680 | 85.245850 | 4.073189 | TALONT000418752 | ENST00000373808.8 | -68.603180 | -7.768958 | 0.000000e+00 | GSN |
Swan also now automatically tracks transcription start site (TSS) and transcription end site (TES) usage, and can find genes that exhibit DIE on the basis of their starts or ends. To do this, use the kind
argument to die_gene_test
.
# find genes that exhibit DIE for TSSs between HFFc6 and HepG2
die_table, die_results = sg.die_gene_test(kind='tss',
obs_col=obs_col,
obs_conditions=obs_conditions,
verbose=True)
die_table.head()
Testing for DIE for each gene: 100%|█████████▉| 14674/14684 [02:18<00:00, 162.60it/s]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000000419 | 0.249561 | 10.002187 | ENSG00000000419_2 | ENSG00000000419_1 | 9.906052 | 0.096133 | ENSG00000000419_3 | ENSG00000000419_4 | -7.218704 | -2.783483 | 0.536906 | DPM1 |
1 | ENSG00000001461 | 0.852914 | 9.093739 | ENSG00000001461_5 | NaN | 9.093739 | NaN | ENSG00000001461_1 | ENSG00000001461_3 | -5.543442 | -1.775148 | 1.000000 | NIPAL3 |
2 | ENSG00000001497 | 0.891496 | 3.846153 | ENSG00000001497_1 | NaN | 3.846146 | NaN | ENSG00000001497_2 | NaN | -3.846153 | NaN | 1.000000 | LAS1L |
3 | ENSG00000001630 | 0.286866 | 2.580168 | ENSG00000001630_3 | NaN | 2.580168 | NaN | ENSG00000001630_2 | ENSG00000001630_1 | -1.549232 | -1.030928 | 0.582636 | CYP51A1 |
4 | ENSG00000002330 | 0.184635 | 12.817431 | ENSG00000002330_2 | ENSG00000002330_1 | 11.868484 | 0.948944 | ENSG00000002330_4 | ENSG00000002330_3 | -12.603584 | -0.213847 | 0.454053 | BAD |
To access the die results on the tss level, use die_kind='tss'
as input to make_uns_key()
.
# die_iso - TSS level differential isoform expression test results
uns_key = swan.make_uns_key('die',
obs_col=obs_col,
obs_conditions=obs_conditions,
die_kind='tss')
test = sg.adata.uns[uns_key]
test.head(2)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000000419 | 0.249561 | 10.002187 | ENSG00000000419_2 | ENSG00000000419_1 | 9.906052 | 0.096133 | ENSG00000000419_3 | ENSG00000000419_4 | -7.218704 | -2.783483 | 0.536906 | DPM1 |
1 | ENSG00000001461 | 0.852914 | 9.093739 | ENSG00000001461_5 | NaN | 9.093739 | NaN | ENSG00000001461_1 | ENSG00000001461_3 | -5.543442 | -1.775148 | 1.000000 | NIPAL3 |
And provide the kind='tss'
option to get_die_genes()
when trying to filter your test results.
test = sg.get_die_genes(kind='tss', obs_col=obs_col,
obs_conditions=obs_conditions,
p=0.05, dpi=10)
test.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | ENSG00000003402 | 7.338098e-15 | 84.848486 | ENSG00000003402_3 | ENSG00000003402_7 | 77.272728 | 7.575758 | ENSG00000003402_1 | ENSG00000003402_5 | -31.717173 | -22.222223 | 4.390279e-13 | CFLAR |
11 | ENSG00000003436 | 3.812087e-06 | 34.072281 | ENSG00000003436_1 | ENSG00000003436_3 | 28.073771 | 4.564083 | ENSG00000003436_4 | NaN | -34.072281 | NaN | 7.064168e-05 | TFPI |
16 | ENSG00000004487 | 1.135407e-03 | 75.714287 | ENSG00000004487_1 | NaN | 75.714287 | NaN | ENSG00000004487_2 | NaN | -75.714285 | NaN | 1.020405e-02 | KDM1A |
33 | ENSG00000006282 | 4.544048e-03 | 19.819595 | ENSG00000006282_2 | ENSG00000006282_1 | 13.679245 | 6.140350 | ENSG00000006282_3 | NaN | -19.819595 | NaN | 3.236475e-02 | SPATA20 |
43 | ENSG00000007376 | 1.256379e-04 | 27.898551 | ENSG00000007376_3 | ENSG00000007376_1 | 23.454107 | 4.444445 | ENSG00000007376_5 | ENSG00000007376_7 | -14.130434 | -7.512077 | 1.525134e-03 | RPUSD1 |
For TESs, use kind='tes'
as input to die_genes_test()
, die_kind='tes'
to make_uns_key()
, and kind='tes'
to get_die_genes()
.
# find genes that exhibit DIE for TESs between HFFc6 and HepG2
die_table, die_results = sg.die_gene_test(kind='tes',
obs_col='cell_line',
obs_conditions=['hepg2', 'hffc6'])
die_table.head()
Testing for DIE for each gene: 100%|██████████| 14684/14684 [02:19<00:00, 105.51it/s]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000000419 | 1.000000 | 0.096133 | ENSG00000000419_2 | NaN | 0.096133 | NaN | ENSG00000000419_1 | NaN | -0.096130 | NaN | 1.000000 | DPM1 |
1 | ENSG00000001461 | 0.852914 | 9.093739 | ENSG00000001461_3 | NaN | 9.093739 | NaN | ENSG00000001461_4 | ENSG00000001461_1 | -5.543442 | -1.775148 | 1.000000 | NIPAL3 |
2 | ENSG00000001630 | 0.286866 | 2.580168 | ENSG00000001630_2 | NaN | 2.580168 | NaN | ENSG00000001630_1 | ENSG00000001630_3 | -1.549232 | -1.030928 | 0.618077 | CYP51A1 |
3 | ENSG00000002330 | 0.184635 | 12.817431 | ENSG00000002330_1 | ENSG00000002330_4 | 11.868484 | 0.948944 | ENSG00000002330_2 | ENSG00000002330_3 | -12.603584 | -0.213847 | 0.480860 | BAD |
4 | ENSG00000002549 | 0.679148 | 0.694543 | ENSG00000002549_2 | NaN | 0.694542 | NaN | ENSG00000002549_1 | NaN | -0.694543 | NaN | 0.926889 | LAP3 |
# die_iso - TES level differential isoform expression test results
uns_key = swan.make_uns_key('die',
obs_col=obs_col,
obs_conditions=obs_conditions,
die_kind='tes')
test = sg.adata.uns[uns_key]
test.head(2)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000000419 | 1.000000 | 0.096133 | ENSG00000000419_2 | NaN | 0.096133 | NaN | ENSG00000000419_1 | NaN | -0.096130 | NaN | 1.0 | DPM1 |
1 | ENSG00000001461 | 0.852914 | 9.093739 | ENSG00000001461_3 | NaN | 9.093739 | NaN | ENSG00000001461_4 | ENSG00000001461_1 | -5.543442 | -1.775148 | 1.0 | NIPAL3 |
test = sg.get_die_genes(kind='tes', obs_col=obs_col,
obs_conditions=obs_conditions,
p=0.05, dpi=10)
test.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | ENSG00000003402 | 7.338098e-15 | 84.848486 | ENSG00000003402_5 | ENSG00000003402_6 | 77.272728 | 7.575758 | ENSG00000003402_9 | ENSG00000003402_7 | -31.717173 | -22.222223 | 5.296535e-13 | CFLAR |
10 | ENSG00000003436 | 2.772096e-07 | 32.637854 | ENSG00000003436_1 | ENSG00000003436_4 | 22.082711 | 10.555142 | ENSG00000003436_2 | ENSG00000003436_5 | -30.435921 | -1.818182 | 7.003007e-06 | TFPI |
14 | ENSG00000004487 | 1.135407e-03 | 75.714287 | ENSG00000004487_2 | NaN | 75.714287 | NaN | ENSG00000004487_1 | NaN | -75.714285 | NaN | 1.141621e-02 | KDM1A |
32 | ENSG00000006282 | 6.858641e-03 | 19.819595 | ENSG00000006282_2 | NaN | 19.819595 | NaN | ENSG00000006282_1 | NaN | -19.819595 | NaN | 4.915014e-02 | SPATA20 |
60 | ENSG00000010278 | 3.191221e-22 | 39.024387 | ENSG00000010278_2 | NaN | 39.024387 | NaN | ENSG00000010278_3 | ENSG00000010278_1 | -38.605976 | -0.418410 | 3.486194e-20 | CD9 |
Finally, for transcriptomes generated with Cerberus, Swan automatically tracks intron chain usage, and you can perform intron chain switching analysis with kind='ic'
as input to the die_gene_test()
function.
sg_brain = swan.read('swan_modelad.p')
# find genes that exhibit DIE for ICs between genotypes
die_table, die_results = sg_brain.die_gene_test(kind='ic',
obs_col='genotype',
obs_conditions=['b6n', '5xfad'])
die_table.head()
Read in graph from swan_modelad.p
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENCODEMG000055991 | 4.100695e-01 | 6.071428 | ENCODEMG000055991_2 | ENCODEMG000055991_3 | 3.571428 | 2.500000 | ENCODEMG000055991_1 | NaN | -6.071426 | NaN | 0.707992 | ENCODEMG000055991 |
1 | ENCODEMG000055998 | 6.190162e-01 | 9.722223 | ENCODEMG000055998_2 | NaN | 9.722223 | NaN | ENCODEMG000055998_1 | NaN | -9.722218 | NaN | 0.836200 | ENCODEMG000055998 |
2 | ENCODEMG000056718 | 8.081718e-07 | 6.289159 | NaN | NaN | 2.329770 | 1.953835 | NaN | NaN | -3.339438 | -2.949721 | 0.000020 | ENCODEMG000056718 |
3 | ENCODEMG000056804 | 2.107047e-03 | 27.863049 | ENCODEMG000056804_1 | NaN | 27.863047 | NaN | ENCODEMG000056804_2 | NaN | -27.863049 | NaN | 0.021510 | ENCODEMG000056804 |
4 | ENCODEMG000063411 | 7.699701e-01 | 12.500000 | ENCODEMG000063411_1 | NaN | 12.500000 | NaN | ENCODEMG000063411_2 | NaN | -12.500000 | NaN | 0.919808 | ENCODEMG000063411 |
What if none of the metadata columns you have summarize the comparisons you want to make? What if I want to find differentially expressed genes or transcripts, or find isoform-switching genes between hffc6 replicate 3 and hepg2 replicate 1?
sg.adata.obs.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
cell_line | replicate | dataset | total_counts | |
---|---|---|---|---|
dataset | ||||
hepg2_1 | hepg2 | 1 | hepg2_1 | 499647.0 |
hepg2_2 | hepg2 | 2 | hepg2_2 | 848447.0 |
hffc6_1 | hffc6 | 1 | hffc6_1 | 761493.0 |
hffc6_2 | hffc6 | 2 | hffc6_2 | 787967.0 |
hffc6_3 | hffc6 | 3 | hffc6_3 | 614921.0 |
Let's ignore for a moment the fact that the dataset
column does effectively capture both replicate as well as cell line metadata. This may not always be the case with more complex datasets. Swan has a function to concatenate columns together and add them as an additional column to the metadata tables. Use the following code to generate a new column that concatenates as many preexisting metadata columns as you wish:
col_name = sg.add_multi_groupby(['cell_line', 'replicate'])
print(col_name)
print(sg.adata.obs.head())
cell_line_replicate
cell_line replicate dataset total_counts cell_line_replicate
dataset
hepg2_1 hepg2 1 hepg2_1 499647.0 hepg2_1
hepg2_2 hepg2 2 hepg2_2 848447.0 hepg2_2
hffc6_1 hffc6 1 hffc6_1 761493.0 hffc6_1
hffc6_2 hffc6 2 hffc6_2 787967.0 hffc6_2
hffc6_3 hffc6 3 hffc6_3 614921.0 hffc6_3
The added column in col_name
can then be used as the obs_col
input to die_gene_test()
as follows:
obs_col = col_name
obs_conditions = ['hffc6_3', 'hepg2_1']
die_table, die_results = sg.die_gene_test(kind='iso',
obs_col=obs_col,
obs_conditions=obs_conditions)
Swan can detect novel (unannotated) exon skipping and intron retention events.
To obtain a dataframe of novel exon skipping events, run the following code:
# returns a DataFrame of genes, transcripts, and specific edges in
# the SwanGraph with novel exon skipping events
es_df = sg.find_es_genes(verbose=True)
Testing each novel edge for exon skipping: 0%| | 0/855 [00:00<?, ?it/s]
Analyzing 855 intronic edges for ES
Testing each novel edge for exon skipping: 100%|██████████| 855/855 [1:26:06<00:00, 6.13s/it]
Found 529 novel es events in 149 transcripts.
es_df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gid | tid | edge_id | |
---|---|---|---|
0 | ENSG00000157916.19 | TALONT000218256 | 952616 |
0 | ENSG00000122406.13 | TALONT000425229 | 952716 |
0 | ENSG00000224093.5 | TALONT000434035 | 952720 |
0 | ENSG00000224093.5 | TALONT000434035 | 952720 |
0 | ENSG00000224093.5 | TALONT000434035 | 952720 |
To obtain a list of genes containing novel intron retention events, run the following code:
# returns a DataFrame of genes, transcripts, and specific edges in
# the SwanGraph with novel intron retaining events
ir_df = sg.find_ir_genes(verbose=True)
0%| | 0/1186 [00:00<?, ?it/s]�[A
Testing each novel edge for intron retention: 0%| | 0/1186 [00:00<?, ?it/s]�[A
Analyzing 1186 exonic edges for IR
Testing each novel edge for intron retention: 100%|██████████| 1186/1186 [1:59:16<00:00, 5.97s/it]�[A
Found 35 novel ir events in 27 transcripts.
You can pass gene IDs from es_df
into gen_report()
or plot_graph()
to visualize where these exon-skipping events are in gene reports or gene summary graphs respectively.