Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about GO terms in clusterProfiler #720

Open
shuaizh117 opened this issue Aug 22, 2024 · 3 comments
Open

Question about GO terms in clusterProfiler #720

shuaizh117 opened this issue Aug 22, 2024 · 3 comments

Comments

@shuaizh117
Copy link

I was confused by GO terms popped out in my GO analysis using enrichGO. For example, "GO:0141091: transforming growth factor beta receptor superfamily signaling pathway" was reported as one of the top overrepresented GO in my sig DEGs. And below is the gene list "HSPA1A/TNFAIP6/FST/ZBTB7A/HES1/ITGB8/TGFB2/MYOCD/LATS2/THBS1/GREM1" enriched to this term. However, when I refer to PANTHER regarding GO:0141091, I only found TGFB2 in this GO term. If the other genes do not belong to this GO:0141091, then how did this GO get enriched?? I think I'm still not very clear about the behind scene story of how enrichGO perform the calculation.

sessionInfo()
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] viridis_0.6.5 viridisLite_0.4.2
[3] topGO_2.56.0 SparseM_1.81
[5] GO.db_3.19.1 graph_1.82.0
[7] pathview_1.44.0 enrichplot_1.24.0
[9] clusterProfiler_4.12.3 ggpubr_0.6.0
[11] corrplot_0.92 annotables_0.2.0
[13] RColorBrewer_1.1-3 ggrepel_0.9.5
[15] pheatmap_1.0.12 lubridate_1.9.3
[17] forcats_1.0.0 stringr_1.5.1
[19] dplyr_1.1.4 purrr_1.0.2
[21] readr_2.1.5 tidyr_1.3.1
[23] tibble_3.2.1 tidyverse_2.0.0
[25] ggplot2_3.5.1 org.Hs.eg.db_3.19.1
[27] AnnotationDbi_1.66.0 DESeq2_1.44.0
[29] SummarizedExperiment_1.34.0 Biobase_2.64.0
[31] MatrixGenerics_1.16.0 matrixStats_1.3.0
[33] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
[35] IRanges_2.38.0 S4Vectors_0.42.0
[37] BiocGenerics_0.50.0

loaded via a namespace (and not attached):
[1] rstudioapi_0.16.0 jsonlite_1.8.8 magrittr_2.0.3
[4] farver_2.1.1 rmarkdown_2.26 fs_1.6.4
[7] zlibbioc_1.50.0 vctrs_0.6.5 memoise_2.0.1
[10] RCurl_1.98-1.14 ggtree_3.12.0 rstatix_0.7.2
[13] htmltools_0.5.8.1 S4Arrays_1.4.0 broom_1.0.5
[16] SparseArray_1.4.3 gridGraphics_0.5-1 plyr_1.8.9
[19] httr2_1.0.1 cachem_1.0.8 igraph_2.0.3
[22] lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.7-0
[25] R6_2.5.1 fastmap_1.1.1 gson_0.1.0
[28] GenomeInfoDbData_1.2.12 digest_0.6.31 aplot_0.2.2
[31] colorspace_2.1-0 patchwork_1.2.0 RSQLite_2.3.6
[34] labeling_0.4.3 timechange_0.3.0 fansi_1.0.4
[37] httr_1.4.7 polyclip_1.10-6 abind_1.4-5
[40] compiler_4.4.0 bit64_4.0.5 withr_3.0.0
[43] backports_1.4.1 BiocParallel_1.38.0 carData_3.0-5
[46] DBI_1.2.2 ggforce_0.4.2 ggsignif_0.6.4
[49] MASS_7.3-60.2 rappdirs_0.3.3 DelayedArray_0.30.1
[52] HDO.db_0.99.1 tools_4.4.0 ape_5.8
[55] scatterpie_0.2.2 glue_1.6.2 nlme_3.1-164
[58] GOSemSim_2.30.0 grid_4.4.0 shadowtext_0.1.3
[61] reshape2_1.4.4 fgsea_1.30.0 generics_0.1.3
[64] gtable_0.3.5 tzdb_0.4.0 hms_1.1.3
[67] data.table_1.15.4 car_3.1-2 tidygraph_1.3.1
[70] utf8_1.2.3 XVector_0.44.0 pillar_1.9.0
[73] yulab.utils_0.1.6 limma_3.60.0 splines_4.4.0
[76] tweenr_2.0.3 treeio_1.28.0 lattice_0.22-6
[79] bit_4.0.5 tidyselect_1.2.1 locfit_1.5-9.9
[82] Biostrings_2.72.0 knitr_1.46 gridExtra_2.3
[85] xfun_0.43 graphlayouts_1.1.1 statmod_1.5.0
[88] KEGGgraph_1.64.0 stringi_1.8.4 UCSC.utils_1.0.0
[91] lazyeval_0.2.2 ggfun_0.1.4 yaml_2.3.7
[94] evaluate_0.23 codetools_0.2-20 ggraph_2.2.1
[97] qvalue_2.36.0 Rgraphviz_2.48.0 ggplotify_0.1.2
[100] cli_3.6.2 munsell_0.5.1 Rcpp_1.0.12
[103] png_0.1-8 XML_3.99-0.16.1 parallel_4.4.0
[106] blob_1.2.4 DOSE_3.30.0 bitops_1.0-7
[109] tidytree_0.4.6 scales_1.3.0 crayon_1.5.2
[112] rlang_1.1.3 cowplot_1.1.3 fastmatch_1.1-4
[115] KEGGREST_1.44.0

@ifitsok
Copy link

ifitsok commented Aug 27, 2024

I came across the same problem, does anyone know what's going on?

The enrichment result was got; but when I used the genes enriched to look up to their original GO ID, I just couldn't find the GO IDs that the software provided from the original GO IDs of the genes.

@ifitsok
Copy link

ifitsok commented Aug 27, 2024

Hey bro,

https://www.ebi.ac.uk/QuickGO/. Maybe you can go to this website and search your GO IDs, the hitten GOID might be the parent term of the annotated GOID for your genes. (Although there is some exclusion).

@guidohooiveld
Copy link

guidohooiveld commented Aug 28, 2024

enrichGO() (and other GO-based analyses) uses the Gene Ontology information present in the Bioconductor package GO.db. (here).

To see which genes have been annotated to a specific GO category in the GO.db package you can for example do:
(note the use of keytype = "GOALL", and not (just) "GO"!)

> library(org.Hs.eg.db)
> library(GO.db)
> 
> ## query
> results <- AnnotationDbi::select(org.Hs.eg.db, keys=c("GO:0141091"), columns = c('ENTREZID','SYMBOL'), keytype = "GOALL")
'select()' returned 1:many mapping between keys and columns
> 
> dim(results)
[1] 592   5
> head(results)
       GOALL EVIDENCEALL ONTOLOGYALL ENTREZID SYMBOL
1 GO:0141091         IEA          BP       25   ABL1
2 GO:0141091         IBA          BP       90  ACVR1
3 GO:0141091         IDA          BP       90  ACVR1
4 GO:0141091         IMP          BP       90  ACVR1
5 GO:0141091         ISS          BP       90  ACVR1
6 GO:0141091         IBA          BP       91 ACVR1B
> 
> ## remove duplicates 
> results <- results[!duplicated(results[,"ENTREZID"]), ]
> 
> dim(results)
[1] 388   5
> head(results)
        GOALL EVIDENCEALL ONTOLOGYALL ENTREZID SYMBOL
1  GO:0141091         IEA          BP       25   ABL1
2  GO:0141091         IBA          BP       90  ACVR1
6  GO:0141091         IBA          BP       91 ACVR1B
10 GO:0141091         IBA          BP       92 ACVR2A
13 GO:0141091         IBA          BP       93 ACVR2B
17 GO:0141091         IBA          BP       94 ACVRL1
> 

So 388 genes have been annotated to this particular GO category.

Manually confirm that all enriched genes are in these as well? Note that last gene is not! (I just added GAPDH as example).
Answer: YES!

> enriched.genes <- c("HSPA1A","TNFAIP6","FST","ZBTB7A","HES1","ITGB8",
+                     "TGFB2","MYOCD","LATS2","THBS1","GREM1","GAPDH")
> enriched.genes %in% results[,"SYMBOL"]
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
>

As said, enrichGO uses GOALL, so also taking into account the parent-child relationships, but maybe PANTHER does not? Therefore only directly annotated genes are retrieved?
In this context, see also this recent discussion: #697 (comment)

Also note that although the last release of PANTHER (=19) is from June 2024, the underlying GO annotations used for this release seem to be from May 2023 (thus a year old).
https://www.pantherdb.org/news/news20240620.jsp

Yet, the source of the GO.db is 2024-Mar12.

> GO.db
GODb object:
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
| GOSOURCEDATE: 2024-01-17
| Db type: GODb
| package: AnnotationDbi
| DBSCHEMA: GO_DB
| GOEGSOURCEDATE: 2024-Mar12
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| DBSCHEMAVERSION: 2.1

Please see: help('select') for usage information
> 

Both could explain your observation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants