Introduction

DEbPeak provides two aspects of Quality Control (QC):

  • QC on count matrix:
    • Proportion of genes detected in different samples under different CPM thresholds
    • The saturation of the number of genes detected.
  • QC on samples:
    • Euclidean distance and pearson correlation coefficient of samples across different conditions
    • sample similarity on selected principal components (check batch information and conduct batch correction)
    • outlier detection with robust PCA

Example data

The data used here are RNA-seq data of the external granule layer in the cerebellum of control and conditional SnoN knockout mice, the raw data are stored in GSE120279, they are used in Robust principal component analysis for accurate outlier sample detection in RNA-seq data.

First, we create DESeqDataSet object with above dataset:

suppressWarnings(suppressMessages(library(DESeq2)))
suppressWarnings(suppressMessages(library(DEbPeak)))
count.file <- system.file("extdata", "snon_count.txt", package = "DEbPeak")
meta.file <- system.file("extdata", "snon_meta.txt", package = "DEbPeak")
count.matrix <- read.table(file = count.file, header = T, sep = "\t")
head(count.matrix)
##                        KO1  KO2  KO3  KO4  KO5  KO6  WT1  WT2  WT3  WT4  WT5
## ENSMUSG00000000001.4  4556 4218 3835 3718 5741 3875 4115 5074 2931 4374 5118
## ENSMUSG00000000003.15    0    0    0    0    0    0    0    0    0    0    0
## ENSMUSG00000000028.14  350  579  435  316  432  317  245  621  419  506  545
## ENSMUSG00000000031.16  268  804   66  207   46   66  336  112   60   69  137
## ENSMUSG00000000037.16  262  157  184  162  301  233  311  176   94  139  229
## ENSMUSG00000000049.11    0    0    0    0    0    0    1    0    2    0    1
##                        WT6
## ENSMUSG00000000001.4  3625
## ENSMUSG00000000003.15    0
## ENSMUSG00000000028.14  371
## ENSMUSG00000000031.16  193
## ENSMUSG00000000037.16  186
## ENSMUSG00000000049.11    0
meta.info <- read.table(file = meta.file, header = T)
head(meta.info)
##     condition
## KO1        KO
## KO2        KO
## KO3        KO
## KO4        KO
## KO5        KO
## KO6        KO
dds <- DESeq2::DESeqDataSetFromMatrix(countData = count.matrix, colData = meta.info, design = ~condition)
dds
## class: DESeqDataSet 
## dim: 53379 12 
## metadata(1): version
## assays(1): counts
## rownames(53379): ENSMUSG00000000001.4 ENSMUSG00000000003.15 ...
##   ENSMUSG00000115849.1 ENSMUSG00000115850.1
## rowData names(0):
## colnames(12): KO1 KO2 ... WT5 WT6
## colData names(1): condition

QC on count matrix

CPM threshold

This plot shows proportion of genes detected in different samples at different CPM thresholds. Theoretically, all samples should be similarly distributed, so as to avoid false positive results obtained from detection problems.

CountQC(deobj = dds, group.key = "condition", type = "cpm")
## Differential expression analysis with DESeq2!
## [1] "Warning: 25096 features with 0 counts in all samples are to be removed for this analysis."
## [1] "Count distributions are to be computed for:"
##  [1] "KO1" "KO2" "KO3" "KO4" "KO5" "KO6" "WT1" "WT2" "WT3" "WT4" "WT5" "WT6"


sequencing saturation

This plot shows the proportion of genes detected at different sequencing depths. If the proportion is too low, it indicates that many genes are not detected and therefore some important information may be missing, and increasing the sequencing volume can effectively solve this problem.

CountQC(deobj = dds, group.key = "condition", type = "saturation")
## Differential expression analysis with DESeq2!


QC on samples

QC on samples including four aspects:

  • Euclidean distance and pearson correlation coefficient: sample similarity on the entire transcriptome.
  • PCA: sample similarity on selected principal components.
  • outlier detection with robust PCA
  • batch correction

Euclidean distance and pearson correlation coefficient

SampleRelation(deobj = dds,transform.method = "rlog",anno.key = "condition")
## Differential expression analysis with DESeq2!


PCA

qc.pca.res=PCA(deobj = dds,transform.method = "rlog")
## Differential expression analysis with DESeq2!
## Use all genes for PCA!
PCAtools::biplot(qc.pca.res,x="PC1",y="PC2",colby="condition",legendPosition="bottom")


outlier detection with robust PCA

outlier.res=OutlierDetection(deobj = dds,var.genes = NULL,transform.method = "rlog")
## Differential expression analysis with DESeq2!
## Use all genes for PCA!
## Detecting 2 outlier(s): KO3,WT1
outlier.res$outlier
## [1] "KO3" "WT1"
outlier.res$plot

For batch correction and outlier detection, you can use the following codes:

# batch effect correction
# this is a demo scripts
batch.res=QCPCA(deobj = dds,var.genes = NULL,remove.sample=NULL,transform.method = "rlog",batch = "cell",
                colby = "dex", outlier.detection = F)
batch.res$plot
# outlier detection
outlier.res=QCPCA(deobj = dds,var.genes = NULL,remove.sample=NULL,transform.method = "rlog",
                  outlier.detection = T,rpca.method = "PcaGrid")
outlier.res$plot
# the final dds
dds=outlier.res$deobj # or outlier.res$plot

Save results

Save results for downstream analysis:

save.image(file = "/home/songyabing/R/learn/tmp/DEbPeak/example.RData")

Session info

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /home/softwares/anaconda3/envs/r4.0/lib/libopenblasp-r0.3.12.so
## 
## locale:
##  [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=zh_CN.UTF-8        LC_COLLATE=zh_CN.UTF-8    
##  [5] LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
##  [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DEbPeak_1.4.0               DESeq2_1.30.1              
##  [3] SummarizedExperiment_1.20.0 Biobase_2.50.0             
##  [5] MatrixGenerics_1.2.1        matrixStats_0.58.0         
##  [7] GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
##  [9] IRanges_2.24.1              S4Vectors_0.28.1           
## [11] BiocGenerics_0.42.0        
## 
## loaded via a namespace (and not attached):
##   [1] rsvd_1.0.3                             
##   [2] ggvenn_0.1.9                           
##   [3] apeglm_1.12.0                          
##   [4] Rsamtools_2.6.0                        
##   [5] rsvg_2.1                               
##   [6] foreach_1.5.1                          
##   [7] rprojroot_2.0.2                        
##   [8] crayon_1.4.1                           
##   [9] V8_3.4.2                               
##  [10] MASS_7.3-58                            
##  [11] nlme_3.1-152                           
##  [12] backports_1.2.1                        
##  [13] sva_3.38.0                             
##  [14] GOSemSim_2.25.0                        
##  [15] rlang_1.1.0                            
##  [16] XVector_0.30.0                         
##  [17] readxl_1.4.2                           
##  [18] irlba_2.3.5                            
##  [19] limma_3.46.0                           
##  [20] GOstats_2.56.0                         
##  [21] BiocParallel_1.24.1                    
##  [22] rjson_0.2.20                           
##  [23] bit64_4.0.5                            
##  [24] glue_1.6.2                             
##  [25] DiffBind_3.0.15                        
##  [26] mixsqp_0.3-43                          
##  [27] pheatmap_1.0.12                        
##  [28] parallel_4.0.3                         
##  [29] AnnotationDbi_1.52.0                   
##  [30] DEFormats_1.18.0                       
##  [31] base64url_1.4                          
##  [32] tcltk_4.0.3                            
##  [33] DOSE_3.23.2                            
##  [34] haven_2.5.2                            
##  [35] tidyselect_1.2.0                       
##  [36] rio_0.5.27                             
##  [37] XML_3.99-0.6                           
##  [38] tidyr_1.3.0                            
##  [39] ggpubr_0.4.0                           
##  [40] GenomicAlignments_1.26.0               
##  [41] xtable_1.8-4                           
##  [42] ggnetwork_0.5.12                       
##  [43] magrittr_2.0.3                         
##  [44] evaluate_0.14                          
##  [45] ggplot2_3.4.2                          
##  [46] cli_3.6.1                              
##  [47] zlibbioc_1.36.0                        
##  [48] hwriter_1.3.2                          
##  [49] rstudioapi_0.14                        
##  [50] bslib_0.3.1                            
##  [51] GreyListChIP_1.22.0                    
##  [52] fastmatch_1.1-3                        
##  [53] BiocSingular_1.6.0                     
##  [54] xfun_0.30                              
##  [55] askpass_1.1                            
##  [56] clue_0.3-59                            
##  [57] gson_0.0.9                             
##  [58] cluster_2.1.1                          
##  [59] caTools_1.18.2                         
##  [60] tidygraph_1.2.0                        
##  [61] tibble_3.2.1                           
##  [62] ggrepel_0.9.1                          
##  [63] Biostrings_2.58.0                      
##  [64] png_0.1-7                              
##  [65] withr_2.5.0                            
##  [66] bitops_1.0-6                           
##  [67] ggforce_0.3.3                          
##  [68] RBGL_1.66.0                            
##  [69] plyr_1.8.6                             
##  [70] cellranger_1.1.0                       
##  [71] GSEABase_1.52.1                        
##  [72] pcaPP_2.0-1                            
##  [73] dqrng_0.2.1                            
##  [74] coda_0.19-4                            
##  [75] pillar_1.9.0                           
##  [76] gplots_3.1.1                           
##  [77] GlobalOptions_0.1.2                    
##  [78] cachem_1.0.4                           
##  [79] GenomicFeatures_1.42.2                 
##  [80] fs_1.5.0                               
##  [81] GetoptLong_1.0.5                       
##  [82] clusterProfiler_4.7.1                  
##  [83] DelayedMatrixStats_1.12.3              
##  [84] vctrs_0.6.2                            
##  [85] generics_0.1.0                         
##  [86] plot3D_1.4                             
##  [87] tools_4.0.3                            
##  [88] foreign_0.8-81                         
##  [89] NOISeq_2.34.0                          
##  [90] munsell_0.5.0                          
##  [91] tweenr_1.0.2                           
##  [92] fgsea_1.16.0                           
##  [93] DelayedArray_0.16.3                    
##  [94] fastmap_1.1.0                          
##  [95] compiler_4.0.3                         
##  [96] abind_1.4-5                            
##  [97] rtracklayer_1.50.0                     
##  [98] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [99] GenomeInfoDbData_1.2.4                 
## [100] gridExtra_2.3                          
## [101] edgeR_3.32.1                           
## [102] lattice_0.20-45                        
## [103] ggnewscale_0.4.7                       
## [104] AnnotationForge_1.32.0                 
## [105] utf8_1.2.1                             
## [106] dplyr_1.1.2                            
## [107] BiocFileCache_1.14.0                   
## [108] jsonlite_1.8.4                         
## [109] scales_1.2.1                           
## [110] graph_1.68.0                           
## [111] carData_3.0-4                          
## [112] sparseMatrixStats_1.2.1                
## [113] TFEA.ChIP_1.10.0                       
## [114] genefilter_1.72.1                      
## [115] car_3.0-11                             
## [116] doParallel_1.0.16                      
## [117] latticeExtra_0.6-29                    
## [118] R.utils_2.12.0                         
## [119] brew_1.0-6                             
## [120] checkmate_2.0.0                        
## [121] rmarkdown_2.14                         
## [122] openxlsx_4.2.3                         
## [123] pkgdown_1.6.1                          
## [124] cowplot_1.1.1                          
## [125] textshaping_0.3.6                      
## [126] forcats_1.0.0                          
## [127] downloader_0.4                         
## [128] BSgenome_1.58.0                        
## [129] igraph_1.4.99.9024                     
## [130] survival_3.2-10                        
## [131] numDeriv_2016.8-1.1                    
## [132] yaml_2.2.1                             
## [133] plotrix_3.8-2                          
## [134] systemfonts_1.0.4                      
## [135] ashr_2.2-47                            
## [136] SQUAREM_2021.1                         
## [137] htmltools_0.5.2                        
## [138] memoise_2.0.0                          
## [139] VariantAnnotation_1.36.0               
## [140] locfit_1.5-9.4                         
## [141] graphlayouts_0.7.1                     
## [142] batchtools_0.9.15                      
## [143] PCAtools_2.2.0                         
## [144] viridisLite_0.4.0                      
## [145] rrcov_1.7-0                            
## [146] digest_0.6.27                          
## [147] assertthat_0.2.1                       
## [148] rappdirs_0.3.3                         
## [149] emdbook_1.3.12                         
## [150] RSQLite_2.2.5                          
## [151] amap_0.8-18                            
## [152] yulab.utils_0.0.4                      
## [153] debugme_1.1.0                          
## [154] misc3d_0.9-1                           
## [155] data.table_1.14.2                      
## [156] blob_1.2.1                             
## [157] R.oo_1.24.0                            
## [158] ragg_0.4.0                             
## [159] labeling_0.4.2                         
## [160] splines_4.0.3                          
## [161] Cairo_1.5-12.2                         
## [162] ggupset_0.3.0                          
## [163] RCurl_1.98-1.3                         
## [164] broom_1.0.4                            
## [165] hms_1.1.3                              
## [166] colorspace_2.0-0                       
## [167] BiocManager_1.30.16                    
## [168] shape_1.4.6                            
## [169] sass_0.4.1                             
## [170] GEOquery_2.58.0                        
## [171] Rcpp_1.0.9                             
## [172] mvtnorm_1.1-2                          
## [173] circlize_0.4.15                        
## [174] enrichplot_1.10.2                      
## [175] fansi_0.4.2                            
## [176] tzdb_0.3.0                             
## [177] truncnorm_1.0-8                        
## [178] ChIPseeker_1.33.0.900                  
## [179] R6_2.5.0                               
## [180] grid_4.0.3                             
## [181] lifecycle_1.0.3                        
## [182] ShortRead_1.48.0                       
## [183] zip_2.1.1                              
## [184] curl_4.3                               
## [185] ggsignif_0.6.3                         
## [186] jquerylib_0.1.3                        
## [187] robustbase_0.95-0                      
## [188] DO.db_2.9                              
## [189] Matrix_1.5-4                           
## [190] qvalue_2.22.0                          
## [191] desc_1.3.0                             
## [192] org.Hs.eg.db_3.12.0                    
## [193] RColorBrewer_1.1-2                     
## [194] iterators_1.0.13                       
## [195] stringr_1.5.0                          
## [196] DOT_0.1                                
## [197] ggpie_0.2.5                            
## [198] beachmat_2.6.4                         
## [199] polyclip_1.10-0                        
## [200] biomaRt_2.46.3                         
## [201] purrr_1.0.1                            
## [202] shadowtext_0.0.9                       
## [203] gridGraphics_0.5-1                     
## [204] mgcv_1.8-34                            
## [205] ComplexHeatmap_2.13.1                  
## [206] openssl_1.4.3                          
## [207] patchwork_1.0.0                        
## [208] bdsmatrix_1.3-4                        
## [209] codetools_0.2-18                       
## [210] invgamma_1.1                           
## [211] GO.db_3.12.1                           
## [212] gtools_3.8.2                           
## [213] prettyunits_1.1.1                      
## [214] dbplyr_2.3.2                           
## [215] R.methodsS3_1.8.1                      
## [216] gtable_0.3.0                           
## [217] DBI_1.1.1                              
## [218] highr_0.8                              
## [219] ggfun_0.0.6                            
## [220] httr_1.4.5                             
## [221] KernSmooth_2.23-18                     
## [222] stringi_1.5.3                          
## [223] progress_1.2.2                         
## [224] reshape2_1.4.4                         
## [225] farver_2.1.0                           
## [226] annotate_1.68.0                        
## [227] viridis_0.6.1                          
## [228] Rgraphviz_2.34.0                       
## [229] xml2_1.3.4                             
## [230] bbmle_1.0.24                           
## [231] systemPipeR_1.24.3                     
## [232] boot_1.3-28                            
## [233] readr_2.1.4                            
## [234] geneplotter_1.68.0                     
## [235] ggplotify_0.1.0                        
## [236] Category_2.56.0                        
## [237] DEoptimR_1.0-11                        
## [238] bit_4.0.4                              
## [239] scatterpie_0.1.7                       
## [240] jpeg_0.1-8.1                           
## [241] ggraph_2.0.5                           
## [242] pkgconfig_2.0.3                        
## [243] rstatix_0.7.0                          
## [244] knitr_1.37