Utils.Rmd
In RNA-seq analysis, it is common that we need to convert gene IDs in Gene Symbol, ENSEMBL Gene ID, Entrez Gene ID and normalize the count matrix with different methods for subsequent analysis. DEbPeak
provides IDConversion
to perform gene ID conversion, and provides five different normalization methods (CPM, TMM, DESeq2’s median of ratios, RPKM and TPM) to normalize the count matrix.
We will used the results output by “Quality Control” vignette.
# library
suppressWarnings(suppressMessages(library(DESeq2)))
suppressWarnings(suppressMessages(library(DEbPeak)))
# load data
load(file = "/home/songyabing/R/learn/tmp/DEbPeak/example.RData")
We will use DESeq2
as example:
# set control level
dds$condition <- relevel(dds$condition, ref = "WT")
# conduct differential expressed genes analysis
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# extract results
dds.results <- results(dds,contrast=c("condition",'KO','WT'))
dds.results.ordered <- dds.results[order(dds.results$log2FoldChange,decreasing = TRUE),]
head(dds.results.ordered)
## log2 fold change (MLE): condition KO vs WT
## Wald test p-value: condition KO vs WT
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000025934.15 2.01093 4.46030 3.01742 1.47818 NA
## ENSMUSG00000105597.1 1.64376 4.16931 2.98872 1.39502 0.163011113
## ENSMUSG00000109866.1 2.32064 4.14573 1.22504 3.38415 0.000713987
## ENSMUSG00000085819.7 2.23441 4.07496 1.65939 2.45569 0.014061354
## ENSMUSG00000102474.1 1.52867 4.06781 1.80781 2.25013 0.024440396
## ENSMUSG00000042677.7 1.49781 4.05394 1.71350 2.36588 0.017987495
## padj
## <numeric>
## ENSMUSG00000025934.15 NA
## ENSMUSG00000105597.1 0.99973
## ENSMUSG00000109866.1 0.99973
## ENSMUSG00000085819.7 0.99973
## ENSMUSG00000102474.1 0.99973
## ENSMUSG00000042677.7 0.99973
The raw matrix:
head(count.matrix)
## KO1 KO2 KO3 KO4 KO5 KO6 WT1 WT2 WT3 WT4 WT5
## ENSMUSG00000000001.4 4556 4218 3835 3718 5741 3875 4115 5074 2931 4374 5118
## ENSMUSG00000000003.15 0 0 0 0 0 0 0 0 0 0 0
## ENSMUSG00000000028.14 350 579 435 316 432 317 245 621 419 506 545
## ENSMUSG00000000031.16 268 804 66 207 46 66 336 112 60 69 137
## ENSMUSG00000000037.16 262 157 184 162 301 233 311 176 94 139 229
## ENSMUSG00000000049.11 0 0 0 0 0 0 1 0 2 0 1
## WT6
## ENSMUSG00000000001.4 3625
## ENSMUSG00000000003.15 0
## ENSMUSG00000000028.14 371
## ENSMUSG00000000031.16 193
## ENSMUSG00000000037.16 186
## ENSMUSG00000000049.11 0
After gene IDs conversion:
count.matrix <- IDConversion(deres = count.matrix, gene.type = "ENSEMBL",sort.key=NULL)
## 'select()' returned 1:many mapping between keys and columns
head(count.matrix)
## KO1 KO2 KO3 KO4 KO5 KO6 WT1 WT2 WT3 WT4 WT5
## ENSMUSG00000000001.4 4556 4218 3835 3718 5741 3875 4115 5074 2931 4374 5118
## ENSMUSG00000000003.15 0 0 0 0 0 0 0 0 0 0 0
## ENSMUSG00000000028.14 350 579 435 316 432 317 245 621 419 506 545
## ENSMUSG00000000031.16 268 804 66 207 46 66 336 112 60 69 137
## ENSMUSG00000000037.16 262 157 184 162 301 233 311 176 94 139 229
## ENSMUSG00000000049.11 0 0 0 0 0 0 1 0 2 0 1
## WT6 ENTREZID SYMBOL
## ENSMUSG00000000001.4 3625 14679 Gnai3
## ENSMUSG00000000003.15 0 54192 Pbsn
## ENSMUSG00000000028.14 371 12544 Cdc45
## ENSMUSG00000000031.16 193 14955 H19
## ENSMUSG00000000037.16 186 107815 Scml2
## ENSMUSG00000000049.11 0 11818 Apoh
The gene length is useful when perfrom count normalization (RPKM and TPM). Here, we provide GetGeneLength
to calculate gene length:
# this will get DEbPeak_geneLength.txt file in working directory, you can also provide out.file
GetGeneLength(gtf.file = '/path/to/gtf', save = TRUE)
Here, we provide five different normalization methods: CPM, TMM, DESeq2’s median of ratios, RPKM and TPM, of which RPKM and TPM need to provide gtf files to calculate gene length.
Counts Per Million reads mapped (CPM) takes into account the sequencing depth. For each feature i
, CPM is the count of sequenced fragments mapping to the feature scaled by the total number of reads times one million (to bring it up to a more convenient number).
\[ \mathrm{CPM}=\frac{\text { Number of reads mapped to gene } \times 10^{6}}{\text { Total number of mapped reads }} \]
cpm.matrix=NormalizedCount(dds, norm.type="CPM")
head(cpm.matrix)
## KO1 KO2 KO3 KO4 KO5
## ENSMUSG00000000001.4 405.92819 400.64944 389.764529 377.62444 380.625026
## ENSMUSG00000000003.15 0.00000 0.00000 0.000000 0.00000 0.000000
## ENSMUSG00000000028.14 31.18412 54.99669 44.210579 32.09503 28.641354
## ENSMUSG00000000031.16 23.87813 76.36846 6.707812 21.02428 3.049774
## ENSMUSG00000000037.16 23.34354 14.91275 18.700567 16.45378 19.956128
## ENSMUSG00000000049.11 0.00000 0.00000 0.000000 0.00000 0.000000
## KO6 WT1 WT2 WT3 WT4
## ENSMUSG00000000001.4 362.974809 424.0865495 343.046563 389.7566993 402.307594
## ENSMUSG00000000003.15 0.000000 0.0000000 0.000000 0.0000000 0.000000
## ENSMUSG00000000028.14 29.693681 25.2493814 41.985005 55.7175220 46.540385
## ENSMUSG00000000031.16 6.182281 34.6277231 7.572175 7.9786428 6.346416
## ENSMUSG00000000037.16 21.825324 32.0512556 11.899132 12.4998737 12.784809
## ENSMUSG00000000049.11 0.000000 0.1030587 0.000000 0.2659548 0.000000
## WT5 WT6
## ENSMUSG00000000001.4 383.64870911 355.77287
## ENSMUSG00000000003.15 0.00000000 0.00000
## ENSMUSG00000000028.14 40.85356516 36.41151
## ENSMUSG00000000031.16 10.26961179 18.94184
## ENSMUSG00000000037.16 17.16599343 18.25483
## ENSMUSG00000000049.11 0.07496067 0.00000
edgeR’s Trimmed Mean of M values (TMM) takes into account the sequencing depth, RNA composition, and gene length. For detailed information, please refer to A scaling normalization method for differential expression analysis of RNA-seq data.
tmm.matrix=NormalizedCount(dds, norm.type="TMM")
head(tmm.matrix)
## KO1 KO2 KO3 KO4 KO5
## ENSMUSG00000000001.4 421.83331 371.93907 371.189992 382.69198 413.745541
## ENSMUSG00000000003.15 0.00000 0.00000 0.000000 0.00000 0.000000
## ENSMUSG00000000028.14 32.40598 51.05565 42.103689 32.52573 31.133613
## ENSMUSG00000000031.16 24.81372 70.89592 6.388146 21.30641 3.315153
## ENSMUSG00000000037.16 24.25819 13.84410 17.809376 16.67458 21.692633
## ENSMUSG00000000049.11 0.00000 0.00000 0.000000 0.00000 0.000000
## KO6 WT1 WT2 WT3 WT4
## ENSMUSG00000000001.4 373.041177 486.1604573 334.563934 361.3055744 394.46383
## ENSMUSG00000000003.15 0.000000 0.0000000 0.000000 0.0000000 0.00000
## ENSMUSG00000000028.14 30.517175 28.9451548 40.946828 51.6503022 45.63299
## ENSMUSG00000000031.16 6.353734 39.6962123 7.384935 7.3962247 6.22268
## ENSMUSG00000000037.16 22.430605 36.7426251 11.604898 11.5874186 12.53554
## ENSMUSG00000000049.11 0.000000 0.1181435 0.000000 0.2465408 0.00000
## WT5 WT6
## ENSMUSG00000000001.4 371.61142274 347.48647
## ENSMUSG00000000003.15 0.00000000 0.00000
## ENSMUSG00000000028.14 39.57175174 35.56344
## ENSMUSG00000000031.16 9.94739447 18.50066
## ENSMUSG00000000037.16 16.62739660 17.82965
## ENSMUSG00000000049.11 0.07260872 0.00000
DESeq2’s median of ratios takes into account the sequencing depth and RNA composition. For detailed information, please refer to Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
deseq2.matrix=NormalizedCount(dds, norm.type="DESeq2")
head(deseq2.matrix)
## KO1 KO2 KO3 KO4 KO5
## ENSMUSG00000000001.4 4606.4055 4014.0575 4037.53736 4145.2746 4476.42605
## ENSMUSG00000000003.15 0.0000 0.0000 0.00000 0.0000 0.00000
## ENSMUSG00000000028.14 353.8722 551.0050 457.97360 352.3149 336.84307
## ENSMUSG00000000031.16 270.9650 765.1262 69.48565 230.7886 35.86755
## ENSMUSG00000000037.16 264.8986 149.4090 193.71757 180.6171 234.69853
## ENSMUSG00000000049.11 0.0000 0.0000 0.00000 0.0000 0.00000
## KO6 WT1 WT2 WT3 WT4
## ENSMUSG00000000001.4 4069.51479 5285.859380 3591.55792 3889.896093 4266.4884
## ENSMUSG00000000003.15 0.00000 0.000000 0.00000 0.000000 0.0000
## ENSMUSG00000000028.14 332.91257 314.710947 439.56592 556.078630 493.5627
## ENSMUSG00000000031.16 69.31303 431.603585 79.27759 79.629398 67.3040
## ENSMUSG00000000037.16 244.69599 399.490223 124.57907 124.752724 135.5834
## ENSMUSG00000000049.11 0.00000 1.284534 0.00000 2.654313 0.0000
## WT5 WT6
## ENSMUSG00000000001.4 3981.8764585 3774.6246
## ENSMUSG00000000003.15 0.0000000 0.0000
## ENSMUSG00000000028.14 424.0177159 386.3133
## ENSMUSG00000000031.16 106.5879396 200.9662
## ENSMUSG00000000037.16 178.1652421 193.6773
## ENSMUSG00000000049.11 0.7780142 0.0000
Reads Per Kilobase of exon per Million reads mapped (RPKM) takes into account the sequencing depth and gene length. For each feature i
as the count scaled by the feature’s length times one thousand (to kilobase) and further scaled by the total number of reads times one million.
\[ \text { RPKM }=\frac{\text { Number of reads mapped to gene } \times 10^{3} \times 10^{6}}{\text { Total number of mapped reads } \times \text { gene length }} \]
# you can either provide gtf.file or gene.length.file
rpkm.matrix=NormalizedCount(dds,gtf.file = '/path/to/gtf', norm.type="RPKM")
Transcripts Per kilobase Million (TPM) takes into account the sequencing depth and gene length.
\[ \mathrm{TPM}=\frac{\frac{\text { total reads mapped to gene } \times 10^{3}}{\text { gene length }}}{\sum(\frac{\text { total reads mapped to gene } \times 10^{3}}{\text { gene length }})} \times 10^{6} \]
TPM is proportional to RPKM:
\[ \mathrm{TPM}=\frac{R P K M}{\sum(R P K M)} \times 10^{6} \]
# you can either provide gtf.file or gene.length.file
tpm.matrix=NormalizedCount(dds,gtf.file = '/path/to/gtf', norm.type="TPM")
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /home/softwares/anaconda3/envs/r4.0/lib/libopenblasp-r0.3.12.so
##
## locale:
## [1] LC_CTYPE=zh_CN.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=zh_CN.UTF-8 LC_COLLATE=zh_CN.UTF-8
## [5] LC_MONETARY=zh_CN.UTF-8 LC_MESSAGES=zh_CN.UTF-8
## [7] LC_PAPER=zh_CN.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.Mm.eg.db_3.12.0 AnnotationDbi_1.52.0
## [3] DEbPeak_1.4.0 DESeq2_1.30.1
## [5] SummarizedExperiment_1.20.0 Biobase_2.50.0
## [7] MatrixGenerics_1.2.1 matrixStats_0.58.0
## [9] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7
## [11] IRanges_2.24.1 S4Vectors_0.28.1
## [13] BiocGenerics_0.42.0
##
## loaded via a namespace (and not attached):
## [1] rsvd_1.0.3
## [2] ggvenn_0.1.9
## [3] apeglm_1.12.0
## [4] Rsamtools_2.6.0
## [5] rsvg_2.1
## [6] foreach_1.5.1
## [7] rprojroot_2.0.2
## [8] crayon_1.4.1
## [9] V8_3.4.2
## [10] MASS_7.3-58
## [11] nlme_3.1-152
## [12] backports_1.2.1
## [13] sva_3.38.0
## [14] GOSemSim_2.25.0
## [15] rlang_1.1.0
## [16] XVector_0.30.0
## [17] readxl_1.4.2
## [18] irlba_2.3.5
## [19] limma_3.46.0
## [20] GOstats_2.56.0
## [21] BiocParallel_1.24.1
## [22] rjson_0.2.20
## [23] bit64_4.0.5
## [24] glue_1.6.2
## [25] DiffBind_3.0.15
## [26] mixsqp_0.3-43
## [27] pheatmap_1.0.12
## [28] parallel_4.0.3
## [29] DEFormats_1.18.0
## [30] base64url_1.4
## [31] tcltk_4.0.3
## [32] DOSE_3.23.2
## [33] haven_2.5.2
## [34] tidyselect_1.2.0
## [35] rio_0.5.27
## [36] XML_3.99-0.6
## [37] tidyr_1.3.0
## [38] ggpubr_0.4.0
## [39] GenomicAlignments_1.26.0
## [40] xtable_1.8-4
## [41] ggnetwork_0.5.12
## [42] magrittr_2.0.3
## [43] evaluate_0.14
## [44] ggplot2_3.4.2
## [45] cli_3.6.1
## [46] zlibbioc_1.36.0
## [47] hwriter_1.3.2
## [48] rstudioapi_0.14
## [49] bslib_0.3.1
## [50] GreyListChIP_1.22.0
## [51] fastmatch_1.1-3
## [52] BiocSingular_1.6.0
## [53] xfun_0.30
## [54] askpass_1.1
## [55] clue_0.3-59
## [56] gson_0.0.9
## [57] cluster_2.1.1
## [58] caTools_1.18.2
## [59] tidygraph_1.2.0
## [60] tibble_3.2.1
## [61] ggrepel_0.9.1
## [62] Biostrings_2.58.0
## [63] png_0.1-7
## [64] withr_2.5.0
## [65] bitops_1.0-6
## [66] ggforce_0.3.3
## [67] RBGL_1.66.0
## [68] plyr_1.8.6
## [69] cellranger_1.1.0
## [70] GSEABase_1.52.1
## [71] pcaPP_2.0-1
## [72] dqrng_0.2.1
## [73] coda_0.19-4
## [74] pillar_1.9.0
## [75] gplots_3.1.1
## [76] GlobalOptions_0.1.2
## [77] cachem_1.0.4
## [78] GenomicFeatures_1.42.2
## [79] fs_1.5.0
## [80] GetoptLong_1.0.5
## [81] clusterProfiler_4.7.1
## [82] DelayedMatrixStats_1.12.3
## [83] vctrs_0.6.2
## [84] generics_0.1.0
## [85] plot3D_1.4
## [86] tools_4.0.3
## [87] foreign_0.8-81
## [88] NOISeq_2.34.0
## [89] munsell_0.5.0
## [90] tweenr_1.0.2
## [91] fgsea_1.16.0
## [92] DelayedArray_0.16.3
## [93] fastmap_1.1.0
## [94] compiler_4.0.3
## [95] abind_1.4-5
## [96] rtracklayer_1.50.0
## [97] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [98] GenomeInfoDbData_1.2.4
## [99] gridExtra_2.3
## [100] edgeR_3.32.1
## [101] lattice_0.20-45
## [102] ggnewscale_0.4.7
## [103] AnnotationForge_1.32.0
## [104] utf8_1.2.1
## [105] dplyr_1.1.2
## [106] BiocFileCache_1.14.0
## [107] jsonlite_1.8.4
## [108] scales_1.2.1
## [109] graph_1.68.0
## [110] carData_3.0-4
## [111] sparseMatrixStats_1.2.1
## [112] TFEA.ChIP_1.10.0
## [113] genefilter_1.72.1
## [114] car_3.0-11
## [115] doParallel_1.0.16
## [116] latticeExtra_0.6-29
## [117] R.utils_2.12.0
## [118] brew_1.0-6
## [119] checkmate_2.0.0
## [120] rmarkdown_2.14
## [121] openxlsx_4.2.3
## [122] pkgdown_1.6.1
## [123] cowplot_1.1.1
## [124] textshaping_0.3.6
## [125] forcats_1.0.0
## [126] downloader_0.4
## [127] BSgenome_1.58.0
## [128] igraph_1.4.99.9024
## [129] survival_3.2-10
## [130] numDeriv_2016.8-1.1
## [131] yaml_2.2.1
## [132] plotrix_3.8-2
## [133] systemfonts_1.0.4
## [134] ashr_2.2-47
## [135] SQUAREM_2021.1
## [136] htmltools_0.5.2
## [137] memoise_2.0.0
## [138] VariantAnnotation_1.36.0
## [139] locfit_1.5-9.4
## [140] graphlayouts_0.7.1
## [141] batchtools_0.9.15
## [142] PCAtools_2.2.0
## [143] viridisLite_0.4.0
## [144] rrcov_1.7-0
## [145] digest_0.6.27
## [146] assertthat_0.2.1
## [147] rappdirs_0.3.3
## [148] emdbook_1.3.12
## [149] RSQLite_2.2.5
## [150] amap_0.8-18
## [151] yulab.utils_0.0.4
## [152] debugme_1.1.0
## [153] misc3d_0.9-1
## [154] data.table_1.14.2
## [155] blob_1.2.1
## [156] R.oo_1.24.0
## [157] ragg_0.4.0
## [158] splines_4.0.3
## [159] ggupset_0.3.0
## [160] RCurl_1.98-1.3
## [161] broom_1.0.4
## [162] hms_1.1.3
## [163] colorspace_2.0-0
## [164] BiocManager_1.30.16
## [165] shape_1.4.6
## [166] sass_0.4.1
## [167] GEOquery_2.58.0
## [168] Rcpp_1.0.9
## [169] mvtnorm_1.1-2
## [170] circlize_0.4.15
## [171] enrichplot_1.10.2
## [172] fansi_0.4.2
## [173] tzdb_0.3.0
## [174] truncnorm_1.0-8
## [175] ChIPseeker_1.33.0.900
## [176] R6_2.5.0
## [177] grid_4.0.3
## [178] lifecycle_1.0.3
## [179] ShortRead_1.48.0
## [180] zip_2.1.1
## [181] curl_4.3
## [182] ggsignif_0.6.3
## [183] jquerylib_0.1.3
## [184] robustbase_0.95-0
## [185] DO.db_2.9
## [186] Matrix_1.5-4
## [187] qvalue_2.22.0
## [188] desc_1.3.0
## [189] org.Hs.eg.db_3.12.0
## [190] RColorBrewer_1.1-2
## [191] iterators_1.0.13
## [192] stringr_1.5.0
## [193] DOT_0.1
## [194] ggpie_0.2.5
## [195] beachmat_2.6.4
## [196] polyclip_1.10-0
## [197] biomaRt_2.46.3
## [198] purrr_1.0.1
## [199] shadowtext_0.0.9
## [200] gridGraphics_0.5-1
## [201] mgcv_1.8-34
## [202] ComplexHeatmap_2.13.1
## [203] openssl_1.4.3
## [204] patchwork_1.0.0
## [205] bdsmatrix_1.3-4
## [206] codetools_0.2-18
## [207] invgamma_1.1
## [208] GO.db_3.12.1
## [209] gtools_3.8.2
## [210] prettyunits_1.1.1
## [211] dbplyr_2.3.2
## [212] R.methodsS3_1.8.1
## [213] gtable_0.3.0
## [214] DBI_1.1.1
## [215] ggfun_0.0.6
## [216] httr_1.4.5
## [217] KernSmooth_2.23-18
## [218] stringi_1.5.3
## [219] progress_1.2.2
## [220] reshape2_1.4.4
## [221] farver_2.1.0
## [222] annotate_1.68.0
## [223] viridis_0.6.1
## [224] Rgraphviz_2.34.0
## [225] xml2_1.3.4
## [226] bbmle_1.0.24
## [227] systemPipeR_1.24.3
## [228] boot_1.3-28
## [229] readr_2.1.4
## [230] geneplotter_1.68.0
## [231] ggplotify_0.1.0
## [232] Category_2.56.0
## [233] DEoptimR_1.0-11
## [234] bit_4.0.4
## [235] scatterpie_0.1.7
## [236] jpeg_0.1-8.1
## [237] ggraph_2.0.5
## [238] pkgconfig_2.0.3
## [239] rstatix_0.7.0
## [240] knitr_1.37