Skip to content

Latest commit

 

History

History
1868 lines (1637 loc) · 41.7 KB

analysis_tools.md

File metadata and controls

1868 lines (1637 loc) · 41.7 KB

Analysis tools

Swan has several analysis options to use.

import swan_vis as swan

sg = swan.read('data/swan.p')
Read in graph from data/swan.p

Differential expression tests

Swan's old differential gene and transcript expression tests using diffxpy have now been deprecated as it seems that the library is unsupported. I recommend that users interested in running differential gene or transcript expression test either use PyDESeq2 or Scanpy's rank_genes_groups test, which both support the AnnData format for simple compatibility with Swan's AnnData expression representation.

Using PyDESeq2

Please read the PyDESeq2 documentation for details on how to use one of the SwanGraph AnnData objects to obtain differential expression results. Below is an example on how to find differentially expressed transcripts between cell lines.

from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import numpy as np

adata = sg.adata.copy()

# PyDESeq2 currently doesn't support column names with underscores, so change that
adata.obs.rename({'cell_line': 'cellline'}, axis=1, inplace=True)
obs_col = 'cellline'

threads = 8

# densify matrix
adata.X = np.array(adata.X.todense())

# run test
dds = DeseqDataSet(adata=adata,
               design_factors=obs_col,
               n_cpus=threads,
               refit_cooks=True)
dds.deseq2()
stat_res = DeseqStats(dds,
                  n_cpus=threads)
stat_res.summary()

df = stat_res.results_df
Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 34.09 seconds.

Fitting dispersion trend curve...
... done in 12.01 seconds.

Fitting MAP dispersions...
... done in 13.97 seconds.

Fitting LFCs...
... done in 4.86 seconds.

Refitting 0 outliers.

Running Wald tests...
... done in 2.36 seconds.

Log2 fold change & Wald test p-value: cellline hffc6 vs hepg2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
baseMean log2FoldChange lfcSE stat pvalue padj
tid
TALONT000296400 3.497574 2.416436 1.472616 1.640914 0.100815 0.193840
ENST00000591581.1 0.162115 0.624956 5.002765 0.124922 0.900585 NaN
ENST00000546893.5 8.173445 -0.053334 0.793179 -0.067241 0.946390 0.968527
ENST00000537289.1 5.140216 -1.248175 0.978927 -1.275044 0.202294 0.328286
ENST00000258382.9 2.012104 -0.011819 1.493866 -0.007912 0.993687 NaN
... ... ... ... ... ... ...
ENST00000506914.1 1.022692 0.947588 2.272813 0.416923 0.676735 NaN
ENST00000571080.1 0.473263 -2.824409 3.426473 -0.824291 0.409774 NaN
ENST00000378615.7 1.334873 -1.199810 1.790781 -0.669993 0.502862 NaN
ENST00000409586.7 0.338615 1.464364 4.023896 0.363917 0.715920 NaN
ENST00000370278.3 1.328549 -1.201246 1.962358 -0.612144 0.540443 NaN

34814 rows × 6 columns

df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
baseMean log2FoldChange lfcSE stat pvalue padj
tid
TALONT000296400 3.497574 2.416436 1.472616 1.640914 0.100815 0.193840
ENST00000591581.1 0.162115 0.624956 5.002765 0.124922 0.900585 NaN
ENST00000546893.5 8.173445 -0.053334 0.793179 -0.067241 0.946390 0.968527
ENST00000537289.1 5.140216 -1.248175 0.978927 -1.275044 0.202294 0.328286
ENST00000258382.9 2.012104 -0.011819 1.493866 -0.007912 0.993687 NaN

Isoform switching / Differential isoform expression testing

Isoform switching / differential isoform expression (DIE) testing is implemented according to the strategy in Joglekar et. al., 2021. DIE can roughly be described as finding statistically significant changes in isoform expression between two conditions along with a change in percent isoform usage per gene.

Pairwise comparisons can be set up using different columns in the metadata that was added to the SwanGraph with the obs_col and obs_conditions arguments.

# look at valid metadata options
sg.adata.obs
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cell_line replicate dataset total_counts
dataset
hepg2_1 hepg2 1 hepg2_1 499647.0
hepg2_2 hepg2 2 hepg2_2 848447.0
hffc6_1 hffc6 1 hffc6_1 761493.0
hffc6_2 hffc6 2 hffc6_2 787967.0
hffc6_3 hffc6 3 hffc6_3 614921.0
# find genes that exhibit DIE between HFFc6 and HepG2
obs_col = 'cell_line'
obs_conditions = ['hepg2', 'hffc6']
die_table, die_results = sg.die_gene_test(obs_col=obs_col,
                                          obs_conditions=obs_conditions,
                                          verbose=True)
Testing for DIE for each gene: 100%|██████████| 14684/14684 [03:50<00:00, 123.69it/s]

The resultant table contains an entry for each gene with the p value (p_val), adjusted p value (adj_p_val), and change in percent isoform usage for the top two isoforms (dpi), as well as the identities of the top isoforms involved in the switch. Exact details on these calculations can be found in Joglekar et. al., 2021.

die_table.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid p_val dpi pos_iso_1 pos_iso_2 pos_iso_1_dpi pos_iso_2_dpi neg_iso_1 neg_iso_2 neg_iso_1_dpi neg_iso_2_dpi adj_p_val gname
0 ENSG00000130175 0.206667 39.264992 TALONT000296399 NaN 39.264992 NaN TALONT000296400 ENST00000589838.5 -20.116056 -10.638298 0.469420 PRKCSH
1 ENSG00000130202 0.000367 11.893251 ENST00000252485.8 ENST00000252483.9 9.684967 2.208281 TALONT000406668 ENST00000591581.1 -11.473083 -0.420168 0.003560 NECTIN2
2 ENSG00000111371 0.680435 9.401713 ENST00000398637.9 ENST00000439706.5 8.547012 0.854701 ENST00000546893.5 NaN -9.401711 NaN 0.886243 SLC38A1
3 ENSG00000181924 0.028195 9.452934 ENST00000537289.1 ENST00000545127.1 7.298603 2.154328 ENST00000355693.4 NaN -9.452934 NaN 0.122619 COA4
4 ENSG00000163468 0.255788 0.680048 ENST00000295688.7 TALONT000476055 0.366966 0.277693 ENST00000368259.6 ENST00000489870.1 -0.568503 -0.111545 0.525066 CCT3

Differential isoform expression testing results are stored automatically in sg.adata.uns, and will be saved to the SwanGraph if you save it. You can regenerate this key and access the summary table by running the following code:

# die_iso - isoform level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col,
                            obs_conditions=obs_conditions)
test = sg.adata.uns[uns_key]
test.head(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid p_val dpi pos_iso_1 pos_iso_2 pos_iso_1_dpi pos_iso_2_dpi neg_iso_1 neg_iso_2 neg_iso_1_dpi neg_iso_2_dpi adj_p_val gname
0 ENSG00000130175 0.206667 39.264992 TALONT000296399 NaN 39.264992 NaN TALONT000296400 ENST00000589838.5 -20.116056 -10.638298 0.46942 PRKCSH
1 ENSG00000130202 0.000367 11.893251 ENST00000252485.8 ENST00000252483.9 9.684967 2.208281 TALONT000406668 ENST00000591581.1 -11.473083 -0.420168 0.00356 NECTIN2

Swan comes with an easy way to filter your DIE test results based on adjusted p value and dpi thresholds.

test = sg.get_die_genes(obs_col=obs_col, obs_conditions=obs_conditions,
                       p=0.05, dpi=10)
test.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid p_val dpi pos_iso_1 pos_iso_2 pos_iso_1_dpi pos_iso_2_dpi neg_iso_1 neg_iso_2 neg_iso_1_dpi neg_iso_2_dpi adj_p_val gname
1 ENSG00000130202 3.674234e-04 11.893251 ENST00000252485.8 ENST00000252483.9 9.684967 2.208281 TALONT000406668 ENST00000591581.1 -11.473083 -0.420168 3.560035e-03 NECTIN2
8 ENSG00000025156 1.857411e-03 35.714287 ENST00000452194.5 ENST00000465214.2 25.714287 10.000000 ENST00000368455.8 NaN -35.714287 NaN 1.405048e-02 HSF2
20 ENSG00000105254 4.779408e-38 25.419458 ENST00000585910.5 TALONT000366329 20.394853 4.702971 ENST00000221855.7 ENST00000589996.5 -25.016304 -0.403154 7.343219e-36 TBCB
23 ENSG00000105379 1.802782e-307 81.136333 ENST00000354232.8 NaN 81.136333 NaN ENST00000309244.8 ENST00000596253.1 -80.857935 -0.278394 1.551113e-304 ETFB
28 ENSG00000148180 0.000000e+00 89.319039 ENST00000373818.8 TALONT000419680 85.245850 4.073189 TALONT000418752 ENST00000373808.8 -68.603180 -7.768958 0.000000e+00 GSN

Swan also now automatically tracks transcription start site (TSS) and transcription end site (TES) usage, and can find genes that exhibit DIE on the basis of their starts or ends. To do this, use the kind argument to die_gene_test.

# find genes that exhibit DIE for TSSs between HFFc6 and HepG2
die_table, die_results = sg.die_gene_test(kind='tss',
                                          obs_col=obs_col,
                                          obs_conditions=obs_conditions,
                                          verbose=True)
die_table.head()
Testing for DIE for each gene: 100%|█████████▉| 14674/14684 [02:18<00:00, 162.60it/s]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid p_val dpi pos_iso_1 pos_iso_2 pos_iso_1_dpi pos_iso_2_dpi neg_iso_1 neg_iso_2 neg_iso_1_dpi neg_iso_2_dpi adj_p_val gname
0 ENSG00000000419 0.249561 10.002187 ENSG00000000419_2 ENSG00000000419_1 9.906052 0.096133 ENSG00000000419_3 ENSG00000000419_4 -7.218704 -2.783483 0.536906 DPM1
1 ENSG00000001461 0.852914 9.093739 ENSG00000001461_5 NaN 9.093739 NaN ENSG00000001461_1 ENSG00000001461_3 -5.543442 -1.775148 1.000000 NIPAL3
2 ENSG00000001497 0.891496 3.846153 ENSG00000001497_1 NaN 3.846146 NaN ENSG00000001497_2 NaN -3.846153 NaN 1.000000 LAS1L
3 ENSG00000001630 0.286866 2.580168 ENSG00000001630_3 NaN 2.580168 NaN ENSG00000001630_2 ENSG00000001630_1 -1.549232 -1.030928 0.582636 CYP51A1
4 ENSG00000002330 0.184635 12.817431 ENSG00000002330_2 ENSG00000002330_1 11.868484 0.948944 ENSG00000002330_4 ENSG00000002330_3 -12.603584 -0.213847 0.454053 BAD

To access the die results on the tss level, use die_kind='tss' as input to make_uns_key().

# die_iso - TSS level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col,
                            obs_conditions=obs_conditions,
                            die_kind='tss')
test = sg.adata.uns[uns_key]
test.head(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid p_val dpi pos_iso_1 pos_iso_2 pos_iso_1_dpi pos_iso_2_dpi neg_iso_1 neg_iso_2 neg_iso_1_dpi neg_iso_2_dpi adj_p_val gname
0 ENSG00000000419 0.249561 10.002187 ENSG00000000419_2 ENSG00000000419_1 9.906052 0.096133 ENSG00000000419_3 ENSG00000000419_4 -7.218704 -2.783483 0.536906 DPM1
1 ENSG00000001461 0.852914 9.093739 ENSG00000001461_5 NaN 9.093739 NaN ENSG00000001461_1 ENSG00000001461_3 -5.543442 -1.775148 1.000000 NIPAL3

And provide the kind='tss' option to get_die_genes() when trying to filter your test results.

test = sg.get_die_genes(kind='tss', obs_col=obs_col,
                        obs_conditions=obs_conditions,
                        p=0.05, dpi=10)
test.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid p_val dpi pos_iso_1 pos_iso_2 pos_iso_1_dpi pos_iso_2_dpi neg_iso_1 neg_iso_2 neg_iso_1_dpi neg_iso_2_dpi adj_p_val gname
10 ENSG00000003402 7.338098e-15 84.848486 ENSG00000003402_3 ENSG00000003402_7 77.272728 7.575758 ENSG00000003402_1 ENSG00000003402_5 -31.717173 -22.222223 4.390279e-13 CFLAR
11 ENSG00000003436 3.812087e-06 34.072281 ENSG00000003436_1 ENSG00000003436_3 28.073771 4.564083 ENSG00000003436_4 NaN -34.072281 NaN 7.064168e-05 TFPI
16 ENSG00000004487 1.135407e-03 75.714287 ENSG00000004487_1 NaN 75.714287 NaN ENSG00000004487_2 NaN -75.714285 NaN 1.020405e-02 KDM1A
33 ENSG00000006282 4.544048e-03 19.819595 ENSG00000006282_2 ENSG00000006282_1 13.679245 6.140350 ENSG00000006282_3 NaN -19.819595 NaN 3.236475e-02 SPATA20
43 ENSG00000007376 1.256379e-04 27.898551 ENSG00000007376_3 ENSG00000007376_1 23.454107 4.444445 ENSG00000007376_5 ENSG00000007376_7 -14.130434 -7.512077 1.525134e-03 RPUSD1

For TESs, use kind='tes' as input to die_genes_test(), die_kind='tes' to make_uns_key(), and kind='tes' to get_die_genes().

# find genes that exhibit DIE for TESs between HFFc6 and HepG2
die_table, die_results = sg.die_gene_test(kind='tes',
                                          obs_col='cell_line',
                                          obs_conditions=['hepg2', 'hffc6'])
die_table.head()
Testing for DIE for each gene: 100%|██████████| 14684/14684 [02:19<00:00, 105.51it/s]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid p_val dpi pos_iso_1 pos_iso_2 pos_iso_1_dpi pos_iso_2_dpi neg_iso_1 neg_iso_2 neg_iso_1_dpi neg_iso_2_dpi adj_p_val gname
0 ENSG00000000419 1.000000 0.096133 ENSG00000000419_2 NaN 0.096133 NaN ENSG00000000419_1 NaN -0.096130 NaN 1.000000 DPM1
1 ENSG00000001461 0.852914 9.093739 ENSG00000001461_3 NaN 9.093739 NaN ENSG00000001461_4 ENSG00000001461_1 -5.543442 -1.775148 1.000000 NIPAL3
2 ENSG00000001630 0.286866 2.580168 ENSG00000001630_2 NaN 2.580168 NaN ENSG00000001630_1 ENSG00000001630_3 -1.549232 -1.030928 0.618077 CYP51A1
3 ENSG00000002330 0.184635 12.817431 ENSG00000002330_1 ENSG00000002330_4 11.868484 0.948944 ENSG00000002330_2 ENSG00000002330_3 -12.603584 -0.213847 0.480860 BAD
4 ENSG00000002549 0.679148 0.694543 ENSG00000002549_2 NaN 0.694542 NaN ENSG00000002549_1 NaN -0.694543 NaN 0.926889 LAP3
# die_iso - TES level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col,
                            obs_conditions=obs_conditions,
                            die_kind='tes')
test = sg.adata.uns[uns_key]
test.head(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid p_val dpi pos_iso_1 pos_iso_2 pos_iso_1_dpi pos_iso_2_dpi neg_iso_1 neg_iso_2 neg_iso_1_dpi neg_iso_2_dpi adj_p_val gname
0 ENSG00000000419 1.000000 0.096133 ENSG00000000419_2 NaN 0.096133 NaN ENSG00000000419_1 NaN -0.096130 NaN 1.0 DPM1
1 ENSG00000001461 0.852914 9.093739 ENSG00000001461_3 NaN 9.093739 NaN ENSG00000001461_4 ENSG00000001461_1 -5.543442 -1.775148 1.0 NIPAL3
test = sg.get_die_genes(kind='tes', obs_col=obs_col,
                        obs_conditions=obs_conditions,
                        p=0.05, dpi=10)
test.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid p_val dpi pos_iso_1 pos_iso_2 pos_iso_1_dpi pos_iso_2_dpi neg_iso_1 neg_iso_2 neg_iso_1_dpi neg_iso_2_dpi adj_p_val gname
9 ENSG00000003402 7.338098e-15 84.848486 ENSG00000003402_5 ENSG00000003402_6 77.272728 7.575758 ENSG00000003402_9 ENSG00000003402_7 -31.717173 -22.222223 5.296535e-13 CFLAR
10 ENSG00000003436 2.772096e-07 32.637854 ENSG00000003436_1 ENSG00000003436_4 22.082711 10.555142 ENSG00000003436_2 ENSG00000003436_5 -30.435921 -1.818182 7.003007e-06 TFPI
14 ENSG00000004487 1.135407e-03 75.714287 ENSG00000004487_2 NaN 75.714287 NaN ENSG00000004487_1 NaN -75.714285 NaN 1.141621e-02 KDM1A
32 ENSG00000006282 6.858641e-03 19.819595 ENSG00000006282_2 NaN 19.819595 NaN ENSG00000006282_1 NaN -19.819595 NaN 4.915014e-02 SPATA20
60 ENSG00000010278 3.191221e-22 39.024387 ENSG00000010278_2 NaN 39.024387 NaN ENSG00000010278_3 ENSG00000010278_1 -38.605976 -0.418410 3.486194e-20 CD9

Finally, for transcriptomes generated with Cerberus, Swan automatically tracks intron chain usage, and you can perform intron chain switching analysis with kind='ic' as input to the die_gene_test() function.

sg_brain = swan.read('swan_modelad.p')

# find genes that exhibit DIE for ICs between genotypes
die_table, die_results = sg_brain.die_gene_test(kind='ic',
                                                obs_col='genotype',
                                                obs_conditions=['b6n', '5xfad'])
die_table.head()
Read in graph from swan_modelad.p
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid p_val dpi pos_iso_1 pos_iso_2 pos_iso_1_dpi pos_iso_2_dpi neg_iso_1 neg_iso_2 neg_iso_1_dpi neg_iso_2_dpi adj_p_val gname
0 ENCODEMG000055991 4.100695e-01 6.071428 ENCODEMG000055991_2 ENCODEMG000055991_3 3.571428 2.500000 ENCODEMG000055991_1 NaN -6.071426 NaN 0.707992 ENCODEMG000055991
1 ENCODEMG000055998 6.190162e-01 9.722223 ENCODEMG000055998_2 NaN 9.722223 NaN ENCODEMG000055998_1 NaN -9.722218 NaN 0.836200 ENCODEMG000055998
2 ENCODEMG000056718 8.081718e-07 6.289159 NaN NaN 2.329770 1.953835 NaN NaN -3.339438 -2.949721 0.000020 ENCODEMG000056718
3 ENCODEMG000056804 2.107047e-03 27.863049 ENCODEMG000056804_1 NaN 27.863047 NaN ENCODEMG000056804_2 NaN -27.863049 NaN 0.021510 ENCODEMG000056804
4 ENCODEMG000063411 7.699701e-01 12.500000 ENCODEMG000063411_1 NaN 12.500000 NaN ENCODEMG000063411_2 NaN -12.500000 NaN 0.919808 ENCODEMG000063411

Combining metadata columns

What if none of the metadata columns you have summarize the comparisons you want to make? What if I want to find differentially expressed genes or transcripts, or find isoform-switching genes between hffc6 replicate 3 and hepg2 replicate 1?

sg.adata.obs.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cell_line replicate dataset total_counts
dataset
hepg2_1 hepg2 1 hepg2_1 499647.0
hepg2_2 hepg2 2 hepg2_2 848447.0
hffc6_1 hffc6 1 hffc6_1 761493.0
hffc6_2 hffc6 2 hffc6_2 787967.0
hffc6_3 hffc6 3 hffc6_3 614921.0

Let's ignore for a moment the fact that the dataset column does effectively capture both replicate as well as cell line metadata. This may not always be the case with more complex datasets. Swan has a function to concatenate columns together and add them as an additional column to the metadata tables. Use the following code to generate a new column that concatenates as many preexisting metadata columns as you wish:

col_name = sg.add_multi_groupby(['cell_line', 'replicate'])

print(col_name)
print(sg.adata.obs.head())
cell_line_replicate
        cell_line replicate  dataset  total_counts cell_line_replicate
dataset                                                               
hepg2_1     hepg2         1  hepg2_1      499647.0             hepg2_1
hepg2_2     hepg2         2  hepg2_2      848447.0             hepg2_2
hffc6_1     hffc6         1  hffc6_1      761493.0             hffc6_1
hffc6_2     hffc6         2  hffc6_2      787967.0             hffc6_2
hffc6_3     hffc6         3  hffc6_3      614921.0             hffc6_3

The added column in col_name can then be used as the obs_col input to die_gene_test() as follows:

obs_col = col_name
obs_conditions = ['hffc6_3', 'hepg2_1']


die_table, die_results = sg.die_gene_test(kind='iso',
                                          obs_col=obs_col,
                                          obs_conditions=obs_conditions)

Exon skipping and intron retention

Swan can detect novel (unannotated) exon skipping and intron retention events.

To obtain a dataframe of novel exon skipping events, run the following code:

# returns a DataFrame of genes, transcripts, and specific edges in
# the SwanGraph with novel exon skipping events
es_df = sg.find_es_genes(verbose=True)
Testing each novel edge for exon skipping:   0%|          | 0/855 [00:00<?, ?it/s]

Analyzing 855 intronic edges for ES


Testing each novel edge for exon skipping: 100%|██████████| 855/855 [1:26:06<00:00,  6.13s/it]

Found 529 novel es events in 149 transcripts.
es_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid tid edge_id
0 ENSG00000157916.19 TALONT000218256 952616
0 ENSG00000122406.13 TALONT000425229 952716
0 ENSG00000224093.5 TALONT000434035 952720
0 ENSG00000224093.5 TALONT000434035 952720
0 ENSG00000224093.5 TALONT000434035 952720

To obtain a list of genes containing novel intron retention events, run the following code:

# returns a DataFrame of genes, transcripts, and specific edges in
# the SwanGraph with novel intron retaining events
ir_df = sg.find_ir_genes(verbose=True)
  0%|          | 0/1186 [00:00<?, ?it/s]�[A
Testing each novel edge for intron retention:   0%|          | 0/1186 [00:00<?, ?it/s]�[A

Analyzing 1186 exonic edges for IR
Testing each novel edge for intron retention: 100%|██████████| 1186/1186 [1:59:16<00:00,  5.97s/it]�[A

Found 35 novel ir events in 27 transcripts.

You can pass gene IDs from es_df into gen_report() or plot_graph() to visualize where these exon-skipping events are in gene reports or gene summary graphs respectively.