Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gene name in cnetplot--with own GO annotation data #275

Open
kersng opened this issue May 23, 2024 · 2 comments
Open

gene name in cnetplot--with own GO annotation data #275

kersng opened this issue May 23, 2024 · 2 comments

Comments

@kersng
Copy link

kersng commented May 23, 2024

Hi, I have my own GO annotation data with multiple treatment groups for comparison, and so I did the comparecluster with 'enricher' function. I would like to visualize the result in cnetplot with gene name. Do you know how to change the gene ID into the gene name in this case? I tried the 'setReadable' function but it was not working since my microorganism is not from the Org database.

f=compareCluster(GeneID~group+treatment, data = big_df, fun="enricher", TERM2GENE = go_gene, TERM2NAME = go_terms, pvalueCutoff = 0.05, pAdjustMethod = "BH")

Thank you! :)

@guidohooiveld
Copy link

guidohooiveld commented Jun 26, 2024

In case you were not able to solve this yet:

The function setReadable is indeed normally used to do this, but as you said it depends on an OrgDb. Yet, the nice thing of open-source software is that... the source (code) is open to everyone!

So you could lookup the code of setReadable and tweak it according to your use case. See below for an example. Note that setReadable is part of DOSE.
https://github.com/YuLab-SMU/DOSE/blob/devel/R/setReadable.R

Key is line 53 of setReadable:
https://github.com/YuLab-SMU/DOSE/blob/98301d9860df116cd89852c660cd0fd132efac2e/R/setReadable.R#L53
gn <- EXTID2NAME(OrgDb, genes, keyType))

In this line 53 the gene ids (= genes) are converted in symbols, and stored in object gn. Note that gn is just a simple character vector (with names that are the gene ids!).

Thus: if you create such character vector, independent from an OrgDb, then you are set!

To illustrate the above let's get started by generating a compareClusterResult using the example human dataset; note that I generate a compareClusterResult for KEGG-based gene sets, but in order to show the concept that doesn't matter. The essence is that the input are entrezid, that are then converted into gene symbols.

> ## load libraries
> library(clusterProfiler)
> library(org.Hs.eg.db)
> 
> ##load example data
> data(gcSample)
> 
> ## run enrichKEGG on all clusters
> ck <- compareCluster(geneCluster = gcSample, fun = enrichKEGG)
> 
> ## check output
> ck
#
# Result of Comparing 8 gene clusters 
#
#.. @fun         enrichKEGG 
#.. @geneClusters       List of 8
 $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
 $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
 $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
 $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
 $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
 $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
 $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
 $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result      'data.frame':   76 obs. of  15 variables:
 $ Cluster       : Factor w/ 8 levels "X1","X2","X3",..: 2 2 2 3 3 3 4 4 4 4 ...
 $ category      : chr  "Human Diseases" "Human Diseases" "Cellular Processes" "Environmental Information Processing" ...
 $ subcategory   : chr  "Infectious disease: viral" "Immune disease" "Cell growth and death" "Signaling molecules and interaction" ...
 $ ID            : chr  "hsa05169" "hsa05340" "hsa04110" "hsa04512" ...
 $ Description   : chr  "Epstein-Barr virus infection" "Primary immunodeficiency" "Cell cycle" "ECM-receptor interaction" ...
 $ GeneRatio     : chr  "23/410" "8/410" "18/410" "9/195" ...
 $ BgRatio       : chr  "203/8842" "38/8842" "158/8842" "89/8842" ...
 $ RichFactor    : num  0.113 0.211 0.114 0.101 0.057 ...
 $ FoldEnrichment: num  2.44 4.54 2.46 4.59 2.59 ...
 $ zScore        : num  4.59 4.82 4.07 5.1 4.18 ...
 $ pvalue        : num  6.28e-05 2.86e-04 3.65e-04 1.42e-04 3.05e-04 ...
 $ p.adjust      : num  0.0198 0.0383 0.0383 0.0358 0.0358 ...
 $ qvalue        : num  0.0184 0.0357 0.0357 0.0328 0.0328 ...
 $ geneID        : chr  "4067/3383/7128/1869/890/1871/578/864/637/9641/6891/355/9134/5971/916/956/6850/7187/3551/919/4734/958/6772" "100/6891/3932/973/916/925/958/64421" "991/1869/890/1871/701/990/10926/9088/8317/9700/9134/1029/2810/699/11200/23594/8555/4173" "7057/3339/1299/3695/1101/3679/3910/3696/3693" ...
 $ Count         : int  23 8 18 9 17 19 17 10 22 19 ...
#.. number of enriched terms found for each gene cluster:
#..   X1: 0 
#..   X2: 3 
#..   X3: 3 
#..   X4: 21 
#..   X5: 9 
#..   X6: 1 
#..   X7: 18 
#..   X8: 21 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

>

The challenge is now to be able to manually convert the gene ids into symbols.

To do this a character vector is needed that 'links' the gene ids to symbols (i.e. my object id2symbol). Note that the names of the vector are the gene ids.

Note also that since the example dataset are human entrezids, the human OrgDb can be used to generate such character vector. This is what I do now. In your case you should generate such id-to-symbol vector manually.

> ## create id-to-symbol mapping vector for all gene ids
> ## first extract all gene ids from annotation database
> all.genes <- keys(org.Hs.eg.db)
> 
> ## then map all.genes (entrezid) to symbols
> ## note that the rownames of id2symbol are the (input) entrezids
> ## within the function setReadable 'id2symbol' corresponds to the output of the function 'EXTID2NAME', but then for all genes
> ## 'gn' is generated on line 53 (gn <- EXTID2NAME(OrgDb, genes, keyType) )
> id2symbol <-  mapIds(org.Hs.eg.db, keys=all.genes, column="SYMBOL", keytype="ENTREZID")
'select()' returned 1:1 mapping between keys and columns
> 
> class(id2symbol)
[1] "character"
> head(id2symbol)
      1       2       3       9      10      11 
 "A1BG"   "A2M" "A2MP1"  "NAT1"  "NAT2"  "NATP" 
> tail(id2symbol)
  133395148   133395149   133395150   133834869   134006662   134198642 
   "SKP1P4"    "AMANZI"   "LNCARGI"     "MLDHR" "SSU72-AS1"      "PSS3" 
>

Continue by running step-by-step the relevant parts of the code of setReadable:

> ## extract all gene ids from result object
> ## note that function geneInCategory is part of DOSE
> ## line 38 code
> gc <- geneInCategory(ck)
> 
> ## check whether compareCluster is based on ORA or GSEA analysis
> ## in case of latter, column named "core_enrichment" is present in result object
> ## lines 42-48
> if ("core_enrichment" %in% colnames(as.data.frame(ck))) {
+             geneslist <- ck@geneClusters
+             names(geneslist) <- NULL
+             genes <- unique(names(unlist(geneslist)))
+         } else {
+             genes <- unique(unlist(ck@geneClusters))
+         }     
>

Now the key step: modify line 53 to use id2symbol for all genes in the result object.

> # gn <- EXTID2NAME(org.Hs.eg.db, genes, keyType) ## original code line 53
> gn <- id2symbol[genes]

Continue with code.

> ## lines 57-69
> gc2 <- list()
> k <- 1
> for(i in seq_len(length(gc))) {
+     for(j in seq_len(length(gc[[i]]))) {
+         gc2[[k]] <- gc[[i]][[j]]
+         names(gc2)[k] <- paste(names(gc)[[i]], names(gc[[i]])[j], sep="-")
+         k <- k + 1
+     }
+ }
> gc <- gc2
> gc <- lapply(gc, function(i) gn[i])
> res <- ck@compareClusterResult
> gc <- gc[paste(res$Cluster, res$ID, sep= "-")]

> ## prepare converted ids to be put on result object
> ## line 81
> geneID <- sapply(gc, paste0, collapse="/")
>
> ## line 86
> res$geneID <- unlist(geneID)
>
> ## complete the output that added to the returned object
> ## lines 88-92
> ck@gene2Symbol <- gn
> ck@keytype <- keyType
> ck@readable <- TRUE
> ck@compareClusterResult <- res
> 

Check:

> ck
#
# Result of Comparing 8 gene clusters 
#
#.. @fun         enrichKEGG 
#.. @geneClusters       List of 8
 $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
 $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
 $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
 $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
 $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
 $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
 $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
 $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result      'data.frame':   76 obs. of  15 variables:
 $ Cluster       : Factor w/ 8 levels "X1","X2","X3",..: 2 2 2 3 3 3 4 4 4 4 ...
 $ category      : chr  "Human Diseases" "Human Diseases" "Cellular Processes" "Environmental Information Processing" ...
 $ subcategory   : chr  "Infectious disease: viral" "Immune disease" "Cell growth and death" "Signaling molecules and interaction" ...
 $ ID            : chr  "hsa05169" "hsa05340" "hsa04110" "hsa04512" ...
 $ Description   : chr  "Epstein-Barr virus infection" "Primary immunodeficiency" "Cell cycle" "ECM-receptor interaction" ...
 $ GeneRatio     : chr  "23/410" "8/410" "18/410" "9/195" ...
 $ BgRatio       : chr  "203/8842" "38/8842" "158/8842" "89/8842" ...
 $ RichFactor    : num  0.113 0.211 0.114 0.101 0.057 ...
 $ FoldEnrichment: num  2.44 4.54 2.46 4.59 2.59 ...
 $ zScore        : num  4.59 4.82 4.07 5.1 4.18 ...
 $ pvalue        : num  6.28e-05 2.86e-04 3.65e-04 1.42e-04 3.05e-04 ...
 $ p.adjust      : num  0.0198 0.0383 0.0383 0.0358 0.0358 ...
 $ qvalue        : num  0.0184 0.0357 0.0357 0.0328 0.0328 ...
 $ geneID        : chr  "LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1" "ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C" "CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4" "THBS1/HSPG2/COL9A3/ITGB7/CHAD/ITGA7/LAMA4/ITGB8/ITGB5" ...
 $ Count         : int  23 8 18 9 17 19 17 10 22 19 ...
#.. number of enriched terms found for each gene cluster:
#..   X1: 0 
#..   X2: 3 
#..   X3: 3 
#..   X4: 21 
#..   X5: 9 
#..   X6: 1 
#..   X7: 18 
#..   X8: 21 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

> as.data.frame(ck)[1:3,]
  Cluster           category               subcategory       ID
1      X2     Human Diseases Infectious disease: viral hsa05169
2      X2     Human Diseases            Immune disease hsa05340
3      X2 Cellular Processes     Cell growth and death hsa04110
                   Description GeneRatio  BgRatio RichFactor FoldEnrichment
1 Epstein-Barr virus infection    23/410 203/8842  0.1133005       2.443422
2     Primary immunodeficiency     8/410  38/8842  0.2105263       4.540180
3                   Cell cycle    18/410 158/8842  0.1139241       2.456869
    zScore       pvalue   p.adjust     qvalue
1 4.587614 6.281651e-05 0.01978720 0.01844822
2 4.822302 2.860603e-04 0.03828932 0.03569832
3 4.074427 3.646602e-04 0.03828932 0.03569832
                                                                                                                         geneID
1 LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1
2                                                                                     ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C
3                          CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4
  Count
1    23
2     8
3    18
> 


Compare with output from setReadable()

> ck.again <- compareCluster(geneCluster = gcSample, fun = enrichKEGG)
> ck.readable <- setReadable(ck.again, org.Hs.eg.db, "ENTREZID")
> 
> ck.readable
#
# Result of Comparing 8 gene clusters 
#
#.. @fun         enrichKEGG 
#.. @geneClusters       List of 8
 $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
 $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
 $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
 $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
 $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
 $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
 $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
 $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
#...Result      'data.frame':   76 obs. of  15 variables:
 $ Cluster       : Factor w/ 8 levels "X1","X2","X3",..: 2 2 2 3 3 3 4 4 4 4 ...
 $ category      : chr  "Human Diseases" "Human Diseases" "Cellular Processes" "Environmental Information Processing" ...
 $ subcategory   : chr  "Infectious disease: viral" "Immune disease" "Cell growth and death" "Signaling molecules and interaction" ...
 $ ID            : chr  "hsa05169" "hsa05340" "hsa04110" "hsa04512" ...
 $ Description   : chr  "Epstein-Barr virus infection" "Primary immunodeficiency" "Cell cycle" "ECM-receptor interaction" ...
 $ GeneRatio     : chr  "23/410" "8/410" "18/410" "9/195" ...
 $ BgRatio       : chr  "203/8842" "38/8842" "158/8842" "89/8842" ...
 $ RichFactor    : num  0.113 0.211 0.114 0.101 0.057 ...
 $ FoldEnrichment: num  2.44 4.54 2.46 4.59 2.59 ...
 $ zScore        : num  4.59 4.82 4.07 5.1 4.18 ...
 $ pvalue        : num  6.28e-05 2.86e-04 3.65e-04 1.42e-04 3.05e-04 ...
 $ p.adjust      : num  0.0198 0.0383 0.0383 0.0358 0.0358 ...
 $ qvalue        : num  0.0184 0.0357 0.0357 0.0328 0.0328 ...
 $ geneID        : chr  "LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1" "ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C" "CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4" "THBS1/HSPG2/COL9A3/ITGB7/CHAD/ITGA7/LAMA4/ITGB8/ITGB5" ...
 $ Count         : int  23 8 18 9 17 19 17 10 22 19 ...
#.. number of enriched terms found for each gene cluster:
#..   X1: 0 
#..   X2: 3 
#..   X3: 3 
#..   X4: 21 
#..   X5: 9 
#..   X6: 1 
#..   X7: 18 
#..   X8: 21 
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, 
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. 
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. 
The Innovation. 2021, 2(3):100141 

> as.data.frame(ck.readable)[1:3,]
  Cluster           category               subcategory       ID
1      X2     Human Diseases Infectious disease: viral hsa05169
2      X2     Human Diseases            Immune disease hsa05340
3      X2 Cellular Processes     Cell growth and death hsa04110
                   Description GeneRatio  BgRatio RichFactor FoldEnrichment
1 Epstein-Barr virus infection    23/410 203/8842  0.1133005       2.443422
2     Primary immunodeficiency     8/410  38/8842  0.2105263       4.540180
3                   Cell cycle    18/410 158/8842  0.1139241       2.456869
    zScore       pvalue   p.adjust     qvalue
1 4.587614 6.281651e-05 0.01978720 0.01844822
2 4.822302 2.860603e-04 0.03828932 0.03569832
3 4.074427 3.646602e-04 0.03828932 0.03569832
                                                                                                                         geneID
1 LYN/ICAM1/TNFAIP3/E2F1/CCNA2/E2F3/BAK1/RUNX3/BID/IKBKE/TAP2/FAS/CCNE2/RELB/CD3E/ENTPD3/SYK/TRAF3/IKBKB/CD247/NEDD4/CD40/STAT1
2                                                                                     ADA/TAP2/LCK/CD79A/CD3E/CD8A/CD40/DCLRE1C
3                          CDC20/E2F1/CCNA2/E2F3/BUB1B/CDC6/DBF4/PKMYT1/CDC7/ESPL1/CCNE2/CDKN2A/SFN/BUB1/CHEK2/ORC6/CDC14B/MCM4
  Count
1    23
2     8
3    18
> 



@kersng
Copy link
Author

kersng commented Jul 1, 2024

Hi, thank you so much for the detail demonstration :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants