Introduction

In RNA-seq analysis, it is common that we need to convert gene IDs in Gene Symbol, ENSEMBL Gene ID, Entrez Gene ID and normalize the count matrix with different methods for subsequent analysis. DEbPeak provides IDConversion to perform gene ID conversion, and provides five different normalization methods (CPM, TMM, DESeq2’s median of ratios, RPKM and TPM) to normalize the count matrix.


Example data

We will used the results output by “Quality Control” vignette.

# library
suppressWarnings(suppressMessages(library(DESeq2)))
suppressWarnings(suppressMessages(library(DEbPeak)))
# load data
load(file = "/home/songyabing/R/learn/tmp/DEbPeak/example.RData")

Differential Expression Analysis

We will use DESeq2 as example:

# set control level
dds$condition <- relevel(dds$condition, ref = "WT")
# conduct differential expressed genes analysis
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# extract results
dds.results <- results(dds,contrast=c("condition",'KO','WT'))
dds.results.ordered <- dds.results[order(dds.results$log2FoldChange,decreasing = TRUE),]
head(dds.results.ordered)
## log2 fold change (MLE): condition KO vs WT 
## Wald test p-value: condition KO vs WT 
## DataFrame with 6 rows and 6 columns
##                        baseMean log2FoldChange     lfcSE      stat      pvalue
##                       <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSMUSG00000025934.15   2.01093        4.46030   3.01742   1.47818          NA
## ENSMUSG00000105597.1    1.64376        4.16931   2.98872   1.39502 0.163011113
## ENSMUSG00000109866.1    2.32064        4.14573   1.22504   3.38415 0.000713987
## ENSMUSG00000085819.7    2.23441        4.07496   1.65939   2.45569 0.014061354
## ENSMUSG00000102474.1    1.52867        4.06781   1.80781   2.25013 0.024440396
## ENSMUSG00000042677.7    1.49781        4.05394   1.71350   2.36588 0.017987495
##                            padj
##                       <numeric>
## ENSMUSG00000025934.15        NA
## ENSMUSG00000105597.1    0.99973
## ENSMUSG00000109866.1    0.99973
## ENSMUSG00000085819.7    0.99973
## ENSMUSG00000102474.1    0.99973
## ENSMUSG00000042677.7    0.99973

Gene ID conversion

The raw matrix:

head(count.matrix)
##                        KO1  KO2  KO3  KO4  KO5  KO6  WT1  WT2  WT3  WT4  WT5
## ENSMUSG00000000001.4  4556 4218 3835 3718 5741 3875 4115 5074 2931 4374 5118
## ENSMUSG00000000003.15    0    0    0    0    0    0    0    0    0    0    0
## ENSMUSG00000000028.14  350  579  435  316  432  317  245  621  419  506  545
## ENSMUSG00000000031.16  268  804   66  207   46   66  336  112   60   69  137
## ENSMUSG00000000037.16  262  157  184  162  301  233  311  176   94  139  229
## ENSMUSG00000000049.11    0    0    0    0    0    0    1    0    2    0    1
##                        WT6
## ENSMUSG00000000001.4  3625
## ENSMUSG00000000003.15    0
## ENSMUSG00000000028.14  371
## ENSMUSG00000000031.16  193
## ENSMUSG00000000037.16  186
## ENSMUSG00000000049.11    0

After gene IDs conversion:

count.matrix <- IDConversion(deres = count.matrix, gene.type = "ENSEMBL",sort.key=NULL)
## 'select()' returned 1:many mapping between keys and columns
head(count.matrix)
##                        KO1  KO2  KO3  KO4  KO5  KO6  WT1  WT2  WT3  WT4  WT5
## ENSMUSG00000000001.4  4556 4218 3835 3718 5741 3875 4115 5074 2931 4374 5118
## ENSMUSG00000000003.15    0    0    0    0    0    0    0    0    0    0    0
## ENSMUSG00000000028.14  350  579  435  316  432  317  245  621  419  506  545
## ENSMUSG00000000031.16  268  804   66  207   46   66  336  112   60   69  137
## ENSMUSG00000000037.16  262  157  184  162  301  233  311  176   94  139  229
## ENSMUSG00000000049.11    0    0    0    0    0    0    1    0    2    0    1
##                        WT6 ENTREZID SYMBOL
## ENSMUSG00000000001.4  3625    14679  Gnai3
## ENSMUSG00000000003.15    0    54192   Pbsn
## ENSMUSG00000000028.14  371    12544  Cdc45
## ENSMUSG00000000031.16  193    14955    H19
## ENSMUSG00000000037.16  186   107815  Scml2
## ENSMUSG00000000049.11    0    11818   Apoh

Get gene length

The gene length is useful when perfrom count normalization (RPKM and TPM). Here, we provide GetGeneLength to calculate gene length:

# this will get DEbPeak_geneLength.txt file in working directory, you can also provide out.file
GetGeneLength(gtf.file = '/path/to/gtf', save = TRUE)

Count normalization

Here, we provide five different normalization methods: CPM, TMM, DESeq2’s median of ratios, RPKM and TPM, of which RPKM and TPM need to provide gtf files to calculate gene length.

CPM

Counts Per Million reads mapped (CPM) takes into account the sequencing depth. For each feature i, CPM is the count of sequenced fragments mapping to the feature scaled by the total number of reads times one million (to bring it up to a more convenient number).

\[ \mathrm{CPM}=\frac{\text { Number of reads mapped to gene } \times 10^{6}}{\text { Total number of mapped reads }} \]

cpm.matrix=NormalizedCount(dds, norm.type="CPM")
head(cpm.matrix)
##                             KO1       KO2        KO3       KO4        KO5
## ENSMUSG00000000001.4  405.92819 400.64944 389.764529 377.62444 380.625026
## ENSMUSG00000000003.15   0.00000   0.00000   0.000000   0.00000   0.000000
## ENSMUSG00000000028.14  31.18412  54.99669  44.210579  32.09503  28.641354
## ENSMUSG00000000031.16  23.87813  76.36846   6.707812  21.02428   3.049774
## ENSMUSG00000000037.16  23.34354  14.91275  18.700567  16.45378  19.956128
## ENSMUSG00000000049.11   0.00000   0.00000   0.000000   0.00000   0.000000
##                              KO6         WT1        WT2         WT3        WT4
## ENSMUSG00000000001.4  362.974809 424.0865495 343.046563 389.7566993 402.307594
## ENSMUSG00000000003.15   0.000000   0.0000000   0.000000   0.0000000   0.000000
## ENSMUSG00000000028.14  29.693681  25.2493814  41.985005  55.7175220  46.540385
## ENSMUSG00000000031.16   6.182281  34.6277231   7.572175   7.9786428   6.346416
## ENSMUSG00000000037.16  21.825324  32.0512556  11.899132  12.4998737  12.784809
## ENSMUSG00000000049.11   0.000000   0.1030587   0.000000   0.2659548   0.000000
##                                WT5       WT6
## ENSMUSG00000000001.4  383.64870911 355.77287
## ENSMUSG00000000003.15   0.00000000   0.00000
## ENSMUSG00000000028.14  40.85356516  36.41151
## ENSMUSG00000000031.16  10.26961179  18.94184
## ENSMUSG00000000037.16  17.16599343  18.25483
## ENSMUSG00000000049.11   0.07496067   0.00000

TMM

edgeR’s Trimmed Mean of M values (TMM) takes into account the sequencing depth, RNA composition, and gene length. For detailed information, please refer to A scaling normalization method for differential expression analysis of RNA-seq data.

tmm.matrix=NormalizedCount(dds, norm.type="TMM")
head(tmm.matrix)
##                             KO1       KO2        KO3       KO4        KO5
## ENSMUSG00000000001.4  421.83331 371.93907 371.189992 382.69198 413.745541
## ENSMUSG00000000003.15   0.00000   0.00000   0.000000   0.00000   0.000000
## ENSMUSG00000000028.14  32.40598  51.05565  42.103689  32.52573  31.133613
## ENSMUSG00000000031.16  24.81372  70.89592   6.388146  21.30641   3.315153
## ENSMUSG00000000037.16  24.25819  13.84410  17.809376  16.67458  21.692633
## ENSMUSG00000000049.11   0.00000   0.00000   0.000000   0.00000   0.000000
##                              KO6         WT1        WT2         WT3       WT4
## ENSMUSG00000000001.4  373.041177 486.1604573 334.563934 361.3055744 394.46383
## ENSMUSG00000000003.15   0.000000   0.0000000   0.000000   0.0000000   0.00000
## ENSMUSG00000000028.14  30.517175  28.9451548  40.946828  51.6503022  45.63299
## ENSMUSG00000000031.16   6.353734  39.6962123   7.384935   7.3962247   6.22268
## ENSMUSG00000000037.16  22.430605  36.7426251  11.604898  11.5874186  12.53554
## ENSMUSG00000000049.11   0.000000   0.1181435   0.000000   0.2465408   0.00000
##                                WT5       WT6
## ENSMUSG00000000001.4  371.61142274 347.48647
## ENSMUSG00000000003.15   0.00000000   0.00000
## ENSMUSG00000000028.14  39.57175174  35.56344
## ENSMUSG00000000031.16   9.94739447  18.50066
## ENSMUSG00000000037.16  16.62739660  17.82965
## ENSMUSG00000000049.11   0.07260872   0.00000

DESeq2’s median of ratios

DESeq2’s median of ratios takes into account the sequencing depth and RNA composition. For detailed information, please refer to Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

deseq2.matrix=NormalizedCount(dds, norm.type="DESeq2")
head(deseq2.matrix)
##                             KO1       KO2        KO3       KO4        KO5
## ENSMUSG00000000001.4  4606.4055 4014.0575 4037.53736 4145.2746 4476.42605
## ENSMUSG00000000003.15    0.0000    0.0000    0.00000    0.0000    0.00000
## ENSMUSG00000000028.14  353.8722  551.0050  457.97360  352.3149  336.84307
## ENSMUSG00000000031.16  270.9650  765.1262   69.48565  230.7886   35.86755
## ENSMUSG00000000037.16  264.8986  149.4090  193.71757  180.6171  234.69853
## ENSMUSG00000000049.11    0.0000    0.0000    0.00000    0.0000    0.00000
##                              KO6         WT1        WT2         WT3       WT4
## ENSMUSG00000000001.4  4069.51479 5285.859380 3591.55792 3889.896093 4266.4884
## ENSMUSG00000000003.15    0.00000    0.000000    0.00000    0.000000    0.0000
## ENSMUSG00000000028.14  332.91257  314.710947  439.56592  556.078630  493.5627
## ENSMUSG00000000031.16   69.31303  431.603585   79.27759   79.629398   67.3040
## ENSMUSG00000000037.16  244.69599  399.490223  124.57907  124.752724  135.5834
## ENSMUSG00000000049.11    0.00000    1.284534    0.00000    2.654313    0.0000
##                                WT5       WT6
## ENSMUSG00000000001.4  3981.8764585 3774.6246
## ENSMUSG00000000003.15    0.0000000    0.0000
## ENSMUSG00000000028.14  424.0177159  386.3133
## ENSMUSG00000000031.16  106.5879396  200.9662
## ENSMUSG00000000037.16  178.1652421  193.6773
## ENSMUSG00000000049.11    0.7780142    0.0000

RPKM

Reads Per Kilobase of exon per Million reads mapped (RPKM) takes into account the sequencing depth and gene length. For each feature i as the count scaled by the feature’s length times one thousand (to kilobase) and further scaled by the total number of reads times one million.

\[ \text { RPKM }=\frac{\text { Number of reads mapped to gene } \times 10^{3} \times 10^{6}}{\text { Total number of mapped reads } \times \text { gene length }} \]

# you can either provide gtf.file or gene.length.file
rpkm.matrix=NormalizedCount(dds,gtf.file = '/path/to/gtf', norm.type="RPKM")

TPM

Transcripts Per kilobase Million (TPM) takes into account the sequencing depth and gene length.

\[ \mathrm{TPM}=\frac{\frac{\text { total reads mapped to gene } \times 10^{3}}{\text { gene length }}}{\sum(\frac{\text { total reads mapped to gene } \times 10^{3}}{\text { gene length }})} \times 10^{6} \]

TPM is proportional to RPKM:

\[ \mathrm{TPM}=\frac{R P K M}{\sum(R P K M)} \times 10^{6} \]

# you can either provide gtf.file or gene.length.file
tpm.matrix=NormalizedCount(dds,gtf.file = '/path/to/gtf', norm.type="TPM")

Session info

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /home/softwares/anaconda3/envs/r4.0/lib/libopenblasp-r0.3.12.so
## 
## locale:
##  [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=zh_CN.UTF-8        LC_COLLATE=zh_CN.UTF-8    
##  [5] LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
##  [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] org.Mm.eg.db_3.12.0         AnnotationDbi_1.52.0       
##  [3] DEbPeak_1.4.0               DESeq2_1.30.1              
##  [5] SummarizedExperiment_1.20.0 Biobase_2.50.0             
##  [7] MatrixGenerics_1.2.1        matrixStats_0.58.0         
##  [9] GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
## [11] IRanges_2.24.1              S4Vectors_0.28.1           
## [13] BiocGenerics_0.42.0        
## 
## loaded via a namespace (and not attached):
##   [1] rsvd_1.0.3                             
##   [2] ggvenn_0.1.9                           
##   [3] apeglm_1.12.0                          
##   [4] Rsamtools_2.6.0                        
##   [5] rsvg_2.1                               
##   [6] foreach_1.5.1                          
##   [7] rprojroot_2.0.2                        
##   [8] crayon_1.4.1                           
##   [9] V8_3.4.2                               
##  [10] MASS_7.3-58                            
##  [11] nlme_3.1-152                           
##  [12] backports_1.2.1                        
##  [13] sva_3.38.0                             
##  [14] GOSemSim_2.25.0                        
##  [15] rlang_1.1.0                            
##  [16] XVector_0.30.0                         
##  [17] readxl_1.4.2                           
##  [18] irlba_2.3.5                            
##  [19] limma_3.46.0                           
##  [20] GOstats_2.56.0                         
##  [21] BiocParallel_1.24.1                    
##  [22] rjson_0.2.20                           
##  [23] bit64_4.0.5                            
##  [24] glue_1.6.2                             
##  [25] DiffBind_3.0.15                        
##  [26] mixsqp_0.3-43                          
##  [27] pheatmap_1.0.12                        
##  [28] parallel_4.0.3                         
##  [29] DEFormats_1.18.0                       
##  [30] base64url_1.4                          
##  [31] tcltk_4.0.3                            
##  [32] DOSE_3.23.2                            
##  [33] haven_2.5.2                            
##  [34] tidyselect_1.2.0                       
##  [35] rio_0.5.27                             
##  [36] XML_3.99-0.6                           
##  [37] tidyr_1.3.0                            
##  [38] ggpubr_0.4.0                           
##  [39] GenomicAlignments_1.26.0               
##  [40] xtable_1.8-4                           
##  [41] ggnetwork_0.5.12                       
##  [42] magrittr_2.0.3                         
##  [43] evaluate_0.14                          
##  [44] ggplot2_3.4.2                          
##  [45] cli_3.6.1                              
##  [46] zlibbioc_1.36.0                        
##  [47] hwriter_1.3.2                          
##  [48] rstudioapi_0.14                        
##  [49] bslib_0.3.1                            
##  [50] GreyListChIP_1.22.0                    
##  [51] fastmatch_1.1-3                        
##  [52] BiocSingular_1.6.0                     
##  [53] xfun_0.30                              
##  [54] askpass_1.1                            
##  [55] clue_0.3-59                            
##  [56] gson_0.0.9                             
##  [57] cluster_2.1.1                          
##  [58] caTools_1.18.2                         
##  [59] tidygraph_1.2.0                        
##  [60] tibble_3.2.1                           
##  [61] ggrepel_0.9.1                          
##  [62] Biostrings_2.58.0                      
##  [63] png_0.1-7                              
##  [64] withr_2.5.0                            
##  [65] bitops_1.0-6                           
##  [66] ggforce_0.3.3                          
##  [67] RBGL_1.66.0                            
##  [68] plyr_1.8.6                             
##  [69] cellranger_1.1.0                       
##  [70] GSEABase_1.52.1                        
##  [71] pcaPP_2.0-1                            
##  [72] dqrng_0.2.1                            
##  [73] coda_0.19-4                            
##  [74] pillar_1.9.0                           
##  [75] gplots_3.1.1                           
##  [76] GlobalOptions_0.1.2                    
##  [77] cachem_1.0.4                           
##  [78] GenomicFeatures_1.42.2                 
##  [79] fs_1.5.0                               
##  [80] GetoptLong_1.0.5                       
##  [81] clusterProfiler_4.7.1                  
##  [82] DelayedMatrixStats_1.12.3              
##  [83] vctrs_0.6.2                            
##  [84] generics_0.1.0                         
##  [85] plot3D_1.4                             
##  [86] tools_4.0.3                            
##  [87] foreign_0.8-81                         
##  [88] NOISeq_2.34.0                          
##  [89] munsell_0.5.0                          
##  [90] tweenr_1.0.2                           
##  [91] fgsea_1.16.0                           
##  [92] DelayedArray_0.16.3                    
##  [93] fastmap_1.1.0                          
##  [94] compiler_4.0.3                         
##  [95] abind_1.4-5                            
##  [96] rtracklayer_1.50.0                     
##  [97] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [98] GenomeInfoDbData_1.2.4                 
##  [99] gridExtra_2.3                          
## [100] edgeR_3.32.1                           
## [101] lattice_0.20-45                        
## [102] ggnewscale_0.4.7                       
## [103] AnnotationForge_1.32.0                 
## [104] utf8_1.2.1                             
## [105] dplyr_1.1.2                            
## [106] BiocFileCache_1.14.0                   
## [107] jsonlite_1.8.4                         
## [108] scales_1.2.1                           
## [109] graph_1.68.0                           
## [110] carData_3.0-4                          
## [111] sparseMatrixStats_1.2.1                
## [112] TFEA.ChIP_1.10.0                       
## [113] genefilter_1.72.1                      
## [114] car_3.0-11                             
## [115] doParallel_1.0.16                      
## [116] latticeExtra_0.6-29                    
## [117] R.utils_2.12.0                         
## [118] brew_1.0-6                             
## [119] checkmate_2.0.0                        
## [120] rmarkdown_2.14                         
## [121] openxlsx_4.2.3                         
## [122] pkgdown_1.6.1                          
## [123] cowplot_1.1.1                          
## [124] textshaping_0.3.6                      
## [125] forcats_1.0.0                          
## [126] downloader_0.4                         
## [127] BSgenome_1.58.0                        
## [128] igraph_1.4.99.9024                     
## [129] survival_3.2-10                        
## [130] numDeriv_2016.8-1.1                    
## [131] yaml_2.2.1                             
## [132] plotrix_3.8-2                          
## [133] systemfonts_1.0.4                      
## [134] ashr_2.2-47                            
## [135] SQUAREM_2021.1                         
## [136] htmltools_0.5.2                        
## [137] memoise_2.0.0                          
## [138] VariantAnnotation_1.36.0               
## [139] locfit_1.5-9.4                         
## [140] graphlayouts_0.7.1                     
## [141] batchtools_0.9.15                      
## [142] PCAtools_2.2.0                         
## [143] viridisLite_0.4.0                      
## [144] rrcov_1.7-0                            
## [145] digest_0.6.27                          
## [146] assertthat_0.2.1                       
## [147] rappdirs_0.3.3                         
## [148] emdbook_1.3.12                         
## [149] RSQLite_2.2.5                          
## [150] amap_0.8-18                            
## [151] yulab.utils_0.0.4                      
## [152] debugme_1.1.0                          
## [153] misc3d_0.9-1                           
## [154] data.table_1.14.2                      
## [155] blob_1.2.1                             
## [156] R.oo_1.24.0                            
## [157] ragg_0.4.0                             
## [158] splines_4.0.3                          
## [159] ggupset_0.3.0                          
## [160] RCurl_1.98-1.3                         
## [161] broom_1.0.4                            
## [162] hms_1.1.3                              
## [163] colorspace_2.0-0                       
## [164] BiocManager_1.30.16                    
## [165] shape_1.4.6                            
## [166] sass_0.4.1                             
## [167] GEOquery_2.58.0                        
## [168] Rcpp_1.0.9                             
## [169] mvtnorm_1.1-2                          
## [170] circlize_0.4.15                        
## [171] enrichplot_1.10.2                      
## [172] fansi_0.4.2                            
## [173] tzdb_0.3.0                             
## [174] truncnorm_1.0-8                        
## [175] ChIPseeker_1.33.0.900                  
## [176] R6_2.5.0                               
## [177] grid_4.0.3                             
## [178] lifecycle_1.0.3                        
## [179] ShortRead_1.48.0                       
## [180] zip_2.1.1                              
## [181] curl_4.3                               
## [182] ggsignif_0.6.3                         
## [183] jquerylib_0.1.3                        
## [184] robustbase_0.95-0                      
## [185] DO.db_2.9                              
## [186] Matrix_1.5-4                           
## [187] qvalue_2.22.0                          
## [188] desc_1.3.0                             
## [189] org.Hs.eg.db_3.12.0                    
## [190] RColorBrewer_1.1-2                     
## [191] iterators_1.0.13                       
## [192] stringr_1.5.0                          
## [193] DOT_0.1                                
## [194] ggpie_0.2.5                            
## [195] beachmat_2.6.4                         
## [196] polyclip_1.10-0                        
## [197] biomaRt_2.46.3                         
## [198] purrr_1.0.1                            
## [199] shadowtext_0.0.9                       
## [200] gridGraphics_0.5-1                     
## [201] mgcv_1.8-34                            
## [202] ComplexHeatmap_2.13.1                  
## [203] openssl_1.4.3                          
## [204] patchwork_1.0.0                        
## [205] bdsmatrix_1.3-4                        
## [206] codetools_0.2-18                       
## [207] invgamma_1.1                           
## [208] GO.db_3.12.1                           
## [209] gtools_3.8.2                           
## [210] prettyunits_1.1.1                      
## [211] dbplyr_2.3.2                           
## [212] R.methodsS3_1.8.1                      
## [213] gtable_0.3.0                           
## [214] DBI_1.1.1                              
## [215] ggfun_0.0.6                            
## [216] httr_1.4.5                             
## [217] KernSmooth_2.23-18                     
## [218] stringi_1.5.3                          
## [219] progress_1.2.2                         
## [220] reshape2_1.4.4                         
## [221] farver_2.1.0                           
## [222] annotate_1.68.0                        
## [223] viridis_0.6.1                          
## [224] Rgraphviz_2.34.0                       
## [225] xml2_1.3.4                             
## [226] bbmle_1.0.24                           
## [227] systemPipeR_1.24.3                     
## [228] boot_1.3-28                            
## [229] readr_2.1.4                            
## [230] geneplotter_1.68.0                     
## [231] ggplotify_0.1.0                        
## [232] Category_2.56.0                        
## [233] DEoptimR_1.0-11                        
## [234] bit_4.0.4                              
## [235] scatterpie_0.1.7                       
## [236] jpeg_0.1-8.1                           
## [237] ggraph_2.0.5                           
## [238] pkgconfig_2.0.3                        
## [239] rstatix_0.7.0                          
## [240] knitr_1.37