Introduction
GEfetch2R provides functions for users to download
processed single-cell RNA-seq data from GEO, Zenodo, CELLxGENE and Human Cell Atlas, including
files in rds, RData, h5ad,
h5, loom formats.
Until now, the public resources supported and the returned values:
| Resources | URL | Download Type | Returned values |
|---|---|---|---|
| GEO | https://www.ncbi.nlm.nih.gov/geo/ | rds, RData, h5ad, loom | SeuratObject(rds) or failed datasets |
| Zenodo | https://zenodo.org/ | count matrix, rds, RData, h5ad, et al. | SeuratObject(rds) or failed datasets |
| CELLxGENE | https://cellxgene.cziscience.com/ | rds, h5ad | SeuratObject(rds) or failed datasets |
| Human Cell Atlas | https://www.humancellatlas.org/ | rds, RData, h5, h5ad, loom | SeuratObject(rds) or failed projects |
Check API
Check the availability of APIs used:
CheckAPI(database = c("GEO", "Zenodo", "CELLxGENE", "Human Cell Atlas"))
# start checking APIs to access GEO!
# The API to access the GEO object is OK!
# trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE302nnn/GSE302912/suppl//GSE302912_counts.csv.gz?tool=geoquery'
# Content type 'application/x-gzip' length 332706 bytes (324 KB)
# ==================================================
# downloaded 324 KB
#
# The API to access supplementary files is OK!
# start checking APIs to access Zenodo!
# The API to access detailed information of a given doi is OK!
# The API to access available files is OK!
# start checking APIs to access CELLxGENE!
# The API to access all available collections is OK!
# The API to access detailed information of a given collection is OK!
# The API to access available files is OK!
# The API to access detailed information of a given dataset is OK!
# start checking APIs to access Human Cell Atlas!
# The API to access all available catalogs is OK!
# The API to access all available projects is OK!
# The API to access available files is OK!
# The API to access detailed information of a given project is OK!GEO
GEO is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community. It provides a very convenient way for users to explore and select interested scRNA-seq datasets. Nowadays, in addition to the count matrix, GEO includes processed objects uploaded by users as supplementary files.
Extract metadata (optional)
ExtractGEOMeta provides two ways to extract sample
metadata:
- user-provided sample metadata when uploading to GEO (applicable to all GEO accessions), including sample title, source name/tissue, description, cell type, treatment, paper title, paper abstract, organism, protocol, data processing methods, et al:
# library
library(tidyverse)
library(GEfetch2R)
# set VROOM_CONNECTION_SIZE to avoid error: Error: The size of the connection buffer (786432) was not large enough
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 60)
# extract metadata
GSE285723.meta <- ExtractGEOMeta(acce = "GSE285723")
GSE285723.meta[1:3, c("title", "geo_accession", "source_name_ch1", "description", "cell type")]
# title geo_accession source_name_ch1 description cell type
# 1 DRA-F GSM8707297 Lung Library name: DRA-F Homogenate
# 2 DRA-M GSM8707298 Lung Library name: DRA-M Homogenate
# 3 DRAO3-F GSM8707299 Lung Library name: DRAO3-F Homogenate- metadata in supplementary file:
# example
GSE297431.meta.supp <- ExtractGEOMeta(
acce = "GSE297431", down.supp = TRUE,
supp.idx = 2 # specify the index of used supplementary file
)
head(GSE297431.meta.supp)
# Sample_ID Batch Plate Type Growths Class
# 1 Plate1_mut_A1_S70 batch2 Plate1 mutant 3 M3
# 2 Plate1_mut_A11_S135 batch2 Plate1 mutant 3 M3
# 3 Plate1_mut_A12_S141 batch2 Plate1 mutant 1 M1
# 4 Plate1_mut_A2_S77 batch2 Plate1 mutant 1 M1
# 5 Plate1_mut_A3_S84 batch2 Plate1 mutant 2 M2
# 6 Plate1_mut_A4_S91 batch2 Plate1 mutant 2 M2Download object and load to R
After downloading the metadata, users can download the
specified objects with ParseGEOProcessed. The
format of downloaded objects are controlled by file.ext
(choose from "rds", "rdata",
"h5ad" and "loom") and the provided
object formats should be in lower case.
The processed objects in the supplementary files are in two forms:
- single file
(
rds(.gz)/rdata(.gz)/h5ad(.gz)/loom(.gz)) contain the processed object, e.g. GSE285723-RDS.gz - (gzip) archive (contain
rds(.gz)/rdata(.gz)/h5ad(.gz)/loom(.gz)files) contain the processed objects, e.g. GSE298041-RDS in tar.gz
# return SeuratObject
GSE285723.seu <- ParseGEOProcessed(
acce = "GSE285723", supp.idx = 1,
file.ext = c("rdata", "rds"), return.seu = T, timeout = 36000000,
out.folder = "~/gefetch2r/doc/download_geoObj"
)
GSE285723.seu
# An object of class Seurat
# 56055 features across 49806 samples within 2 assays
# Active assay: SCT (23770 features, 3000 variable features)
# 1 other assay present: RNA
# 4 dimensional reductions calculated: pca, harmony, umap, tsne
# download h5ad objects
GSE311813.h5ad.log <- ParseGEOProcessed(
acce = "GSE311813", supp.idx = 1,
file.ext = c("h5ad"),
out.folder = "~/gefetch2r/doc/download_geoObj"
)
# # The structure of downloaded files
# tree ~/gefetch2r/doc/download_geoObj
# ~/gefetch2r/doc/download_geoObj
# ├── GSE285723
# │ └── GSE285723_Final_Ballinger.RDS
# └── GSE311813
# └── GSM9332642_merged_raw.ARS.vivo.clean.labelled.h5ad
# 2 directories, 2 filesZenodo
Zenodo contains various types of
processed objects, such as SeuratObject which has been
clustered and annotated, AnnData which contains processed
results generated by scanpy.
Extract metadata (optional)
GEfetch2R provides ExtractZenodoMeta to
extract dataset metadata, including dataset title, description,
available files and corresponding md5. Please note that when the dataset
is restricted access, the returned dataframe will be empty.
# single doi
zebrafish.df <- ExtractZenodoMeta(doi = "10.5281/zenodo.7243603")
zebrafish.df
# title
# 1 zebrafish scRNA data set objects
# 2 zebrafish scRNA data set objects
# description
# 1 <p>Combined and converted scRNA data from http://tome.gs.washington.edu/ (Qiu et al. 2022), see a detailed description of the study here: https://www.nature.com/articles/s41588-022-01018-x</p>\n\n<p>Data were downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy package.</p>\n\n<p>If you use this data, please cite Farrel et al. 2018, Wagner et al. 2018 and Qiu et al. 2022.</p>
# 2 <p>Combined and converted scRNA data from http://tome.gs.washington.edu/ (Qiu et al. 2022), see a detailed description of the study here: https://www.nature.com/articles/s41588-022-01018-x</p>\n\n<p>Data were downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy package.</p>\n\n<p>If you use this data, please cite Farrel et al. 2018, Wagner et al. 2018 and Qiu et al. 2022.</p>
# url filename
# 1 https://zenodo.org/api/records/7243603/files/zebrafish_data.h5ad/content zebrafish_data.h5ad
# 2 https://zenodo.org/api/records/7243603/files/zebrafish_data.RData/content zebrafish_data.RData
# md5 license
# 1 124f2229128918b411a7dc7931558f97 cc-by-4.0
# 2 a08c3ebd285b370fcf34cf2f8f9bdb59 cc-by-4.0
# vector dois
multi.dois <- ExtractZenodoMeta(doi = c("1111", "10.5281/zenodo.7243603", "10.5281/zenodo.7244441"))
multi.dois
# title
# 1 zebrafish scRNA data set objects
# 2 zebrafish scRNA data set objects
# 3 frog scRNA data set objects
# 4 frog scRNA data set objects
# description
# 1 <p>Combined and converted scRNA data from http://tome.gs.washington.edu/ (Qiu et al. 2022), see a detailed description of the study here: https://www.nature.com/articles/s41588-022-01018-x</p>\n\n<p>Data were downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy package.</p>\n\n<p>If you use this data, please cite Farrel et al. 2018, Wagner et al. 2018 and Qiu et al. 2022.</p>
# 2 <p>Combined and converted scRNA data from http://tome.gs.washington.edu/ (Qiu et al. 2022), see a detailed description of the study here: https://www.nature.com/articles/s41588-022-01018-x</p>\n\n<p>Data were downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy package.</p>\n\n<p>If you use this data, please cite Farrel et al. 2018, Wagner et al. 2018 and Qiu et al. 2022.</p>
# 3 <p>Combined and converted scRNA data from http://tome.gs.washington.edu/ (Qiu et al. 2022), see a detailed description of the study here: https://www.nature.com/articles/s41588-022-01018-x</p>\n\n<p>Data were downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy package.</p>\n\n<p>If you use this data, please cite Briggs et al. 2018 and Qiu et al. 2022.</p>
# 4 <p>Combined and converted scRNA data from http://tome.gs.washington.edu/ (Qiu et al. 2022), see a detailed description of the study here: https://www.nature.com/articles/s41588-022-01018-x</p>\n\n<p>Data were downloaded from http://tome.gs.washington.edu/ as R rds files, combined into a single Seurat object and converted into loom and AnnData (h5ad) files to be able to analyse with e.g. python scanpy package.</p>\n\n<p>If you use this data, please cite Briggs et al. 2018 and Qiu et al. 2022.</p>
# url filename
# 1 https://zenodo.org/api/records/7243603/files/zebrafish_data.h5ad/content zebrafish_data.h5ad
# 2 https://zenodo.org/api/records/7243603/files/zebrafish_data.RData/content zebrafish_data.RData
# 3 https://zenodo.org/api/records/7244441/files/frog_data.h5ad/content frog_data.h5ad
# 4 https://zenodo.org/api/records/7244441/files/frog_data.RData/content frog_data.RData
# md5 license
# 1 124f2229128918b411a7dc7931558f97 cc-by-4.0
# 2 a08c3ebd285b370fcf34cf2f8f9bdb59 cc-by-4.0
# 3 7be7d6ff024ab2c8579b4d0edb2428e3 cc-by-4.0
# 4 c80f46320c0cff9e341bed195f12c3b1 cc-by-4.0Download object and load to R
After manually check the extracted metadata, users can
download the specified objects with
ParseZenodo. The format of downloaded objects are
controlled by file.ext and the provided object
formats should be in lower case.
The returned value is a dataframe containing failed objects or a
SeuratObject (if file.ext is rds
and return.seu = TRUE). If dataframe, users can re-run
ParseZenodo by setting doi.df to the returned
value.
# download objects
multi.dois.parse <- ParseZenodo(
doi = c("1111", "10.5281/zenodo.7243603", "10.5281/zenodo.7244441"),
file.ext = c("rdata"), timeout = 36000000,
out.folder = "~/gefetch2r/doc/download_zenodo"
)
# return SeuratObject
sinle.doi.parse.seu <- ParseZenodo(
doi = "10.5281/zenodo.8011282",
file.ext = c("rds"), return.seu = TRUE, timeout = 36000000,
out.folder = "~/gefetch2r/doc/download_zenodo"
)
sinle.doi.parse.seu
# An object of class Seurat
# 19594 features across 9219 samples within 2 assays
# Active assay: RNA (17594 features, 0 variable features)
# 1 other assay present: integrated
# 2 dimensional reductions calculated: pca, umap
# # The structure of downloaded files
# tree ~/gefetch2r/doc/download_zenodo
# ~/gefetch2r/doc/download_zenodo
# ├── frog_data.RData
# ├── PyMTM_immune_scRNA.rds
# └── zebrafish_data.RDataCELLxGENE
The CELLxGENE is a
web server contains 2043 single-cell datasets, users
can explore, download and upload own datasets. The downloaded datasets
provided by CELLxGENE
have two formats: h5ad (AnnData v0.8) and
rds (Seurat v4).
CELLxGENE does not
support downloading SeuratObject in versions
after 2025.Fortunately, we have downloaded all the CELLxGENE datasets in
May 2024 and stored in all.cellxgene.datasets.rds.
The all.cellxgene.datasets.rds contains the
SeuratObject for downloading. However, this does not apply
to the case of a given dataset.
CELLxGENE provides an
R package (cellxgene.census)
to access the data, but sometimes it’s not timely updated.
GEfetch2R also supports users to access CELLxGENE via cellxgene.census
(use.census = TRUE).
Given dataset
With the collection or dataset link(s), users can download
the h5ad objects with
ParseCELLxGENE.
The returned value is NULL or a dataframe containing
failed objects. If dataframe, users can re-run
ParseCELLxGENE by setting meta to the returned
value.
cellxgene.given.h5ad <- ParseCELLxGENE(
link = c(
"https://cellxgene.cziscience.com/collections/77f9d7e9-5675-49c3-abed-ce02f39eef1b", # collection
"https://cellxgene.cziscience.com/e/e12eb8a9-5e8b-4b59-90c8-77d29a811c00.cxg/" # dataset
),
timeout = 36000000,
out.folder = "~/gefetch2r/doc/download_cellxgene"
)
# The structure of downloaded files
# tree ~/gefetch2r/doc/download_cellxgene
# ~/gefetch2r/doc/download_cellxgene
# ├── Human.Immune.Health.Atlas.B.and.Plasma.cells.h5ad
# ├── Human.Immune.Health.Atlas.CD4.T.cells.h5ad
# ├── Human.Immune.Health.Atlas.CD8.T.cells.h5ad
# ├── Human.Immune.Health.Atlas.DCs.h5ad
# ├── Human.Immune.Health.Atlas.h5ad
# ├── Human.Immune.Health.Atlas.Monocytes.h5ad
# ├── Human.Immune.Health.Atlas.NK.cells.and.ILCs.h5ad
# └── Human.Immune.Health.Atlas.Other.cells.h5ad
#
# 0 directories, 8 filesFilter samples based on metadata
Show available datasets
GEfetch2R provides ShowCELLxGENEDatasets to
extract dataset metadata, including dataset title, description, contact,
organism, ethnicity, sex, tissue, disease, assay, suspension type, cell
type, et al.
# all available datasets
all.cellxgene.datasets <- ShowCELLxGENEDatasets()
nrow(all.cellxgene.datasets)
# [1] 2043Load datasets with SeuratObject:
# the datasets with SeuratObject
# wget https://github.com/showteeth/GEfetch2R/raw/ff2f19f3b557f90fce5f8bf2f8662cebdfd04298/man/benchmark/all.cellxgene.datasets.rds
all.cellxgene.datasets <- readRDS("all.cellxgene.datasets.rds")
nrow(all.cellxgene.datasets)
# [1] 1320Summary attributes
GEfetch2R provides StatDBAttribute to
summary attributes of CELLxGENE:
StatDBAttribute(
df = all.cellxgene.datasets, filter = c("organism", "sex", "disease"),
database = "CELLxGENE", combine = TRUE
)
# # A tibble: 280 × 4
# # Groups: organism, sex [18]
# organism sex disease Num
# <chr> <chr> <chr> <int>
# 1 homo sapiens male normal 687
# 2 homo sapiens female normal 535
# 3 mus musculus male normal 178
# 4 mus musculus female normal 125
# 5 mus musculus unknown normal 100
# 6 homo sapiens unknown normal 79
# 7 homo sapiens female covid-19 50
# 8 homo sapiens female dementia 50
# 9 homo sapiens male dementia 50
# 10 homo sapiens male covid-19 38
# # ℹ 270 more rows
# # ℹ Use `print(n = ...)` to see more rows
# # use cellxgene.census
# StatDBAttribute(filter = c("disease", "tissue", "cell_type"), database = "CELLxGENE",
# use.census = TRUE, organism = "homo_sapiens")Filter metadata
GEfetch2R provides ExtractCELLxGENEMeta to
filter dataset metadata, the available values of attributes can be
obtained with StatDBAttribute except cell
number:
# human 10x v2 and v3 datasets
human.10x.cellxgene.meta <- ExtractCELLxGENEMeta(
all.samples.df = all.cellxgene.datasets,
assay = c("10x 3' v2", "10x 3' v3"), organism = "Homo sapiens"
)
nrow(human.10x.cellxgene.meta)
# [1] 627
# subset
cellxgene.down.meta <- human.10x.cellxgene.meta[human.10x.cellxgene.meta$cell_type == "oligodendrocyte" &
human.10x.cellxgene.meta$tissue == "entorhinal cortex", ]
nrow(cellxgene.down.meta)
# [1] 1Download object and load to R
After manually check the extracted metadata, users can
download the specified objects with
ParseCELLxGENE. The downloaded objects are controlled by
file.ext (choose from "rds" and
"h5ad") and the provided object formats should be
in lower case.
The returned value is a dataframe containing failed objects or a
SeuratObject (if file.ext is rds
and return.seu = TRUE). If dataframe, users can re-run
ParseCELLxGENE by setting meta to the returned
value.
When using cellxgene.census, users can subset
metadata and gene.
# download objects
cellxgene.down <- ParseCELLxGENE(
meta = cellxgene.down.meta, file.ext = "rds", timeout = 36000000,
out.folder = "~/gefetch2r/doc/download_cellxgene"
)
cellxgene.down
# NULL
# retuen SeuratObject
cellxgene.down.seu <- ParseCELLxGENE(
meta = cellxgene.down.meta, file.ext = "rds", return.seu = TRUE, timeout = 36000000,
obs.value.filter = "cell_type == 'oligodendrocyte' & disease == 'Alzheimer disease'",
obs.keys = c("cell_type", "disease", "sex", "suspension_type", "development_stage"),
out.folder = "~/gefetch2r/doc/download_cellxgene"
)
cellxgene.down.seu
# An object of class Seurat
# 32743 features across 6873 samples within 1 assay
# Active assay: RNA (32743 features, 0 variable features)
# 3 dimensional reductions calculated: cca, cca.aligned, tsne
# # The structure of downloaded files
# tree ~/gefetch2r/doc/download_cellxgene
# ~/gefetch2r/doc/download_cellxgene
# ├── all.cellxgene.datasets.rds
# ├── Human.Immune.Health.Atlas.B.and.Plasma.cells.h5ad
# ├── Human.Immune.Health.Atlas.CD4.T.cells.h5ad
# ├── Human.Immune.Health.Atlas.CD8.T.cells.h5ad
# ├── Human.Immune.Health.Atlas.DCs.h5ad
# ├── Human.Immune.Health.Atlas.h5ad
# ├── Human.Immune.Health.Atlas.Monocytes.h5ad
# ├── Human.Immune.Health.Atlas.NK.cells.and.ILCs.h5ad
# ├── Human.Immune.Health.Atlas.Other.cells.h5ad
# └── Molecular.characterization.of.selectively.vulnerable.neurons.in.Alzheimer.s.Disease..EC.oligodendrocyte.rds
# 0 directories, 10 files
# # use cellxgene.census (support subset, but update is not timely)
# cellxgene.down.census <- ParseCELLxGENE(
# use.census = TRUE, organism = "Homo sapiens",
# obs.value.filter = "cell_type == 'B cell' & tissue_general == 'lung' & disease == 'COVID-19'",
# obs.keys = c("cell_type", "tissue_general", "disease", "sex"),
# include.genes = c("ENSG00000161798", "ENSG00000188229")
# )Human Cell Atlas
The Human Cell Atlas
aims to map every cell type in the human body, it contains 546
projects, most of which are from Homo sapiens
(also includes projects from Mus musculus,
Macaca mulatta and
canis lupus familiaris).
Given dataset
With the dataset link(s), users can download the processed
objects with ParseHCA. The format of downloaded
objects are controlled by file.ext (choose from
"tsv", "rds", "rdata",
"h5", "h5ad" and "loom") and
the provided object formats should be in lower
case.
The returned value is a dataframe containing failed objects or a
SeuratObject (if file.ext is rds
and return.seu = TRUE). If dataframe, users can re-run
ParseHCA by setting meta to the returned
value.
# download objects
hca.given.download <- ParseHCA(
link = c(
"https://explore.data.humancellatlas.org/projects/902dc043-7091-445c-9442-d72e163b9879",
"https://explore.data.humancellatlas.org/projects/cdabcf0b-7602-4abf-9afb-3b410e545703"
), timeout = 36000000,
out.folder = "~/gefetch2r/doc/download_hca"
)
# # The structure of downloaded files
# tree ~/gefetch2r/doc/download_hca
# ~/gefetch2r/doc/download_hca
# ├── COMBAT2022.h5ad
# └── seurat_object_hca_as_harmonized_AS_SP_nuc_refined_cells.rds.gz
#
# 0 directories, 2 files
# retuen SeuratObject
hypertrophic.heart.seu <- ParseHCA(
link = c(
"https://explore.data.humancellatlas.org/projects/902dc043-7091-445c-9442-d72e163b9879"
), timeout = 36000000, return.seu = TRUE,
out.folder = "~/gefetch2r/doc/download_hca"
)Filter samples based on metadata
Show available datasets
GEfetch2R provides ShowHCAProjects to
extract detailed project metadata, including project title, description,
organism, sex, organ/organPart, disease, assay, preservation method,
sample type, suspension type, cell type, development stage, et al.
There are 546 unique projects:
all.hca.projects <- ShowHCAProjects()
nrow(all.hca.projects)
# [1] 546Summary attributes
GEfetch2R provides StatDBAttribute to
summary attributes of Human
Cell Atlas:
StatDBAttribute(df = all.hca.projects, filter = c("organism", "sex"), database = "HCA")
# $organism
# Value Num Key
# 1 homo sapiens 520 organism
# 2 mus musculus 58 organism
# 3 canis lupus familiaris 1 organism
# 4 macaca mulatta 1 organism
#
# $sex
# Value Num Key
# 1 female 405 sex
# 2 male 392 sex
# 3 unknown 164 sex
# 4 mixed 6 sexFilter metadata
GEfetch2R provides ExtractHCAMeta to filter
projects metadata, the available values of attributes can be obtained
with StatDBAttribute except cell
number:
# human 10x v2 and v3 datasets
hca.human.10x.projects <- ExtractHCAMeta(
all.projects.df = all.hca.projects, organism = "Homo sapiens",
protocol = c("10x 3' v2", "10x 3' v3")
)
nrow(hca.human.10x.projects)
# [1] 251Download object and load to R
After manually check the extracted metadata, users can
download the specified objects with
ParseHCA. The format of downloaded objects are controlled
by file.ext (choose from "tsv",
"rds", "rdata", "h5",
"h5ad" and "loom") and the provided
object formats should be in lower case.
The returned value is a dataframe containing failed objects or a
SeuratObject (if file.ext is rds
and return.seu = TRUE). If dataframe, users can re-run
ParseHCA by setting meta to the returned
value.
# download objects
hca.human.10x.down <- ParseHCA(
meta = hca.human.10x.projects[1:3, ],
out.folder = "~/gefetch2r/doc/download_hca",
file.ext = c("h5ad", "rds"), timeout = 36000000
)
# file downloaded
# 4f30b962-d49b-4624-a233-64f048cf8632_b61a921b-7fa3-4b42-b455-aaaf32447920.h5adProcess RData files
As illustrated before, the downloaded rds file
containing SeuratObject will be automatically loaded into
R. Here, GEfetch2R provides
LoadRData to dissect and extract the RData
files.
LoadRData loads the RData file to a
separate environment, distinguishes the class of each object available,
and processes the objects according to the following logic:
- if widely used scRNA-seq objects
(
SeuratObject(Seurat v3/v4/v5package),seuratobject(Seurat v2package),SingleCellExperiment(SingleCellExperimentpackage),cell_data_set(Monocle v3package),CellDataSet(Monoclepackage)) or bulk RNA-seq objects (DESeqDataSet(DESeq2package),DGEList(edgeRpackage)) exist,LoadRDataautomatically distinguishes the object class and extracts the raw count matrix, normalized count matrix, scaled count matrix, and metadata:
| Object | raw count matrix | normalized count matrix | scaled count matrix | metadata | notes |
|---|---|---|---|---|---|
| SeuratObject (v3, v4) | slot counts
|
slot data
|
slot scale.data
|
obj@meta.data |
support multiple assays (e.g.: RNA, integrated) |
| SeuratObject (v5) | layer counts
|
layer data
|
layer scale.data
|
obj@meta.data |
support multiple assays (e.g.: RNA, integrated) |
| seuratobject | obj@raw.data |
obj@data |
obj@scale.data |
obj@meta.data |
|
| SingleCellExperiment | assay counts
|
assay logcounts
|
assay scaledata/scale.data
|
colData(obj) |
support multiple experiments |
| cell_data_set | assay counts
|
assay logcounts
|
assay scaledata/scale.data
|
colData(obj) |
support multiple experiments |
| CellDataSet | empty or exprs(obj)
|
empty or exprs(obj)
|
empty or exprs(obj)
|
pData(obj) |
|
| DESeqDataSet | counts(obj, normalized = FALSE) |
(obj <- estimateSizeFactors(obj))
counts(obj, normalized = TRUE)
|
empty | obj@colData |
|
| DGEList | obj$counts |
obj <- calcNormFactors(obj);
cpm(obj)
|
empty | obj$samples |
- else if non-standard objects (
matrix,data.frame,dgCMatrix,dgRMatrix,dgTMatrix) exist:- if
matrix/data.frame:- if the number of columns is greater than or equal to three:
- if all column values are in the same class and the class is numeric/integer, treat this object as matrix
- else if starting from the second column, all column values are numeric/integer, treat this object as matrix
- else if column names contain pattern
"sample|name|cell|id|library|well|barcode|index|type|condition|treat|group", treat this object as metadata/annotation - else display the structure of the object and load the object to
R
- else if column names contain pattern
"sample|name|cell|id|library|well|barcode|index|type|condition|treat|group", treat this object as metadata/annotation - else display the structure of the object and load the object to
R
- if the number of columns is greater than or equal to three:
- else if
dgCMatrix/dgRMatrix/dgTMatrix, treat this object as matrix - else print the first six (or less) elements of the objects
- if
To fully demonstrate the usability of the LoadRData
function and its cross-repository support, we performed tests on
RData files downloaded from GEO (primary,
selected some representative RData files),
Zenodo, Human Cell Atlas, and a simulated
RData file containing non-standard objects.
GEO examples
GSE244572:
SeuratObject with multiple assays
# download the RData file
ParseGEOProcessed(acce = "GSE244572", timeout = 360000, supp.idx = 1, file.ext = c("rdata", "rds", "h5ad", "loom"))
# process the RData file
GSE244572.list <- LoadRData(
rdata = "GSE244572/GSE244572_RPE_CITESeq.RData",
accept.fmt = c("Seurat", "seurat", "SingleCellExperiment", "cell_data_set", "CellDataSet", "DESeqDataSet", "DGEList"),
slot = "counts", return.obj = TRUE
)
# # message:
# The object classes stored in RData: Seurat.
# Class
# obj Seurat
# Detect 1 object(s) in given class(s): Seurat, seurat, SingleCellExperiment, cell_data_set, CellDataSet, DESeqDataSet, DGEList.
# Load object: obj (Seurat) to global environment!
# Extract count matrix and metadata (if available) from: obj (Seurat).
# Detect Seurat version: 4.0.4, with assay(s): RNA, ADT, nADT, SCT, integrated, IADT.
# returned value(s)
ls() # list object(s) in the global environment
# [1] "GSE244572.list" "obj"
names(GSE244572.list) # one valid object
# [1] "obj"
head(GSE244572.list$obj$meta.data)
# orig.ident nCount_RNA nFeature_RNA sample percent.mt nCount_ADT nFeature_ADT donor time rpe nCount_nADT nFeature_nADT nCount_SCT nFeature_SCT integrated_snn_res.0.5 seurat_clusters
# CCTAGATTAAT_1 SeuratProject 487751 3922 318_2W 13.34062 387973 121 318 2W adult 22 14 214660 3175 5 0
# GTTGAGCTGCG_1 SeuratProject 124013 1477 318_2W 13.27925 312671 110 318 2W adult 21 14 215337 1475 2 4
# GTTGATCTTCT_1 SeuratProject 95909 1344 318_2W 15.60333 44974 100 318 2W adult 35 27 214940 1359 0 2
# TAACCTTGAGT_1 SeuratProject 328530 2361 319_2W 17.52930 279285 128 319 2W adult 23 18 215249 2349 2 0
# CCTCCGCCTGC_1 SeuratProject 702493 4766 319_2W 22.13887 1579539 143 319 2W adult 32 17 215407 4027 1 6
# CCATATACGAC_1 SeuratProject 497485 3694 319_2W 14.13791 873609 117 319 2W adult 23 12 214400 2913 1 4
# integrated.weight wsnn_res.0.5 wsnn_res.0.2
# CCTAGATTAAT_1 0.7954453 3 0
# GTTGAGCTGCG_1 0.6762115 5 4
# GTTGATCTTCT_1 0.9341439 0 2
# TAACCTTGAGT_1 0.7959367 3 0
# CCTCCGCCTGC_1 0.5048201 10 6
# CCATATACGAC_1 0.7556721 5 4
names(GSE244572.list$obj$count.mat) # list of six assay(s)
# [1] "RNA" "ADT" "nADT" "SCT" "integrated" "IADT"
names(GSE244572.list$obj$count.mat$RNA) # list of slot(s)
# [1] "counts"
GSE244572.list$obj$count.mat$RNA$counts[1:5, 1:5] # count matrix
# 5 x 5 sparse Matrix of class "dgCMatrix"
# CCTAGATTAAT_1 GTTGAGCTGCG_1 GTTGATCTTCT_1 TAACCTTGAGT_1 CCTCCGCCTGC_1
# WASH7P . . . . .
# CICP27 . . . . .
# AL627309.6 . . . . .
# AL627309.5 . . . . .
# FO538757.1 . . . . .Key parameters:
-
accept.fmt: vector, the class of objects for loading. -
slot: vector, the type of count matrix to pull.counts: raw, un-normalized counts,data: normalized data,scale.data: z-scored/variance-stabilized data. -
return.obj: logical value, whether to load the available objects inaccept.fmtto global environment.
GSE249307:
multiple SeuratObject objects
# download the RData file
ParseGEOProcessed(acce = "GSE249307", timeout = 360000, supp.idx = 1, file.ext = c("rdata", "rds", "h5ad", "loom"))
# process the RData file
GSE249307.list <- LoadRData(
rdata = "GSE249307/GSE249307_scRNA_seurat_data.RData",
accept.fmt = c("Seurat", "seurat", "SingleCellExperiment", "cell_data_set", "CellDataSet", "DESeqDataSet", "DGEList"),
slot = "counts"
)
# # message:
# The object classes stored in RData: Seurat.
# Class
# processed_seurat_object Seurat
# raw_seurat_object Seurat
# Detect 2 object(s) in given class(s): Seurat, seurat, SingleCellExperiment, cell_data_set, CellDataSet, DESeqDataSet, DGEList.
# Load object: processed_seurat_object (Seurat) to global environment!
# Extract count matrix and metadata (if available) from: processed_seurat_object (Seurat).
# Detect Seurat version: 3.0.0.9000, with assay(s): RNA.
# Load object: raw_seurat_object (Seurat) to global environment!
# Extract count matrix and metadata (if available) from: raw_seurat_object (Seurat).
# Detect Seurat version: 3.0.0.9000, with assay(s): RNA.
# returned value(s)
ls() # list object(s) in the global environment
# [1] "GSE249307.list" "processed_seurat_object" "raw_seurat_object"
names(GSE249307.list) # two valid objects
# [1] "processed_seurat_object" "raw_seurat_object"
head(GSE249307.list$raw_seurat_object$meta.data)
# orig.ident nCount_RNA nFeature_RNA sample species umi_count_50dup rnd1_well rnd2_well rnd3_well valid library cell_barcode cells_unique treatment1 treatment2 cytokine_timepoint
# lib1_AGATCGCACATCAAGT_42 lib1_2_analysis 356571 13917 samp43 hg38 689652.3 42 7 24 1 lib1 AGATCGCACATCAAGT_42 unique IL17+TNFa Virus_6hpi late
# lib1_ACTATGCAATAGCGAC_22 lib1_2_analysis 198322 11831 samp23 hg38 383579.2 22 79 22 1 lib1 ACTATGCAATAGCGAC_22 unique Vehicle Virus_6hpi baseline
# lib1_CGACTGGAGACAGTGC_23 lib1_2_analysis 187640 11865 samp24 hg38 362918.9 23 92 40 1 lib1 CGACTGGAGACAGTGC_23 unique IL13 Virus_6hpi early
# lib1_AACGCTTAGTACGCAA_24 lib1_2_analysis 177905 11530 samp25 hg38 344090.2 24 55 14 1 lib1 AACGCTTAGTACGCAA_24 unique IL17+TNFa Virus_6hpi early
# lib1_ACACGACCAGTCACTA_11 lib1_2_analysis 156939 10976 samp12 hg38 303539.4 11 26 73 1 lib1 ACACGACCAGTCACTA_11 unique Vehicle Virus_72hpi baseline
# lib1_TCTTCACAATAGCGAC_42 lib1_2_analysis 142185 11049 samp43 hg38 275003.3 42 79 62 1 lib1 TCTTCACAATAGCGAC_42 unique IL17+TNFa Virus_6hpi late
# donor
# lib1_AGATCGCACATCAAGT_42 C
# lib1_ACTATGCAATAGCGAC_22 B
# lib1_CGACTGGAGACAGTGC_23 B
# lib1_AACGCTTAGTACGCAA_24 B
# lib1_ACACGACCAGTCACTA_11 A
# lib1_TCTTCACAATAGCGAC_42 C
names(GSE249307.list$raw_seurat_object$count.mat) # list of available assay(s)
# [1] "RNA"
names(GSE249307.list$raw_seurat_object$count.mat$RNA) # list of slot(s)
# [1] "counts"
GSE249307.list$raw_seurat_object$count.mat$RNA$counts[1:5, 1:5] # count matrix
# 5 x 5 sparse Matrix of class "dgCMatrix"
# lib1_AGATCGCACATCAAGT_42 lib1_ACTATGCAATAGCGAC_22 lib1_CGACTGGAGACAGTGC_23 lib1_AACGCTTAGTACGCAA_24 lib1_ACACGACCAGTCACTA_11
# A1BG . . . . .
# A1BG-AS1 1 . 1 . .
# A1CF . . . . .
# A2M . . . . .
# A2ML1 . . . . .
GSE282783:
SeuratObject and non-standard objects
# download the RData file
ParseGEOProcessed(acce = "GSE282783", timeout = 360000, supp.idx = 1, file.ext = c("rdata", "rds", "h5ad", "loom"))
# process the RData file
GSE282783.list <- LoadRData(
rdata = "GSE282783/GSE282783_E16_FT_E17_hGFAP-Cre_mek12dcko.Rdata",
accept.fmt = c("Seurat", "seurat", "SingleCellExperiment", "cell_data_set", "CellDataSet", "DESeqDataSet", "DGEList"),
slot = "counts"
)
# # message:
# The object classes stored in RData: character, data.frame, IntegrationAnchorSet, Seurat, function.
# Class
# all.genes character
# Cluster data.frame
# g2m.genes data.frame
# immune.anchors IntegrationAnchorSet
# immune.combined Seurat
# Mapkdcko_CTR Seurat
# Mapkdcko_EXP Seurat
# MarkerPlot function
# s.genes data.frame
# Detect 3 object(s) in given class(s): Seurat, seurat, SingleCellExperiment, cell_data_set, CellDataSet, DESeqDataSet, DGEList.
# Load object: immune.combined (Seurat) to global environment!
# Extract count matrix and metadata (if available) from: immune.combined (Seurat).
# Detect Seurat version: 4.0.4, with assay(s): RNA, integrated.
# Load object: Mapkdcko_CTR (Seurat) to global environment!
# Extract count matrix and metadata (if available) from: Mapkdcko_CTR (Seurat).
# Detect Seurat version: 4.0.4, with assay(s): RNA.
# Load object: Mapkdcko_EXP (Seurat) to global environment!
# Extract count matrix and metadata (if available) from: Mapkdcko_EXP (Seurat).
# Detect Seurat version: 4.0.4, with assay(s): RNA.
# returned value(s)
ls() # list object(s) in the global environment
# [1] "GSE282783.list" "immune.combined" "Mapkdcko_CTR" "Mapkdcko_EXP"
names(GSE282783.list) # three valid objects
# [1] "immune.combined" "Mapkdcko_CTR" "Mapkdcko_EXP"
head(GSE282783.list$immune.combined$meta.data)
# orig.ident nCount_RNA nFeature_RNA Sample S.Score G2M.Score Phase old.ident integrated_snn_res.0.6 seurat_clusters
# AAACCCAAGCACGTCC-1_1 Mapkdcko_CTR 3584 1586 Mapkdcko_CTR -0.010667752 -0.10155295 G1 Mapkdcko_CTR 0 0
# AAACCCAAGTGCAGGT-1_1 Mapkdcko_CTR 11187 3232 Mapkdcko_CTR 0.544009405 0.92782765 G2M Mapkdcko_CTR 9 9
# AAACCCACAACGTAAA-1_1 Mapkdcko_CTR 6498 2217 Mapkdcko_CTR -0.138904079 -0.09983576 G1 Mapkdcko_CTR 2 2
# AAACCCACAACGTTAC-1_1 Mapkdcko_CTR 4531 1873 Mapkdcko_CTR -0.008647976 -0.16412585 G1 Mapkdcko_CTR 0 0
# AAACCCACACTGCGAC-1_1 Mapkdcko_CTR 6588 2568 Mapkdcko_CTR -0.175856885 -0.19708486 G1 Mapkdcko_CTR 5 5
# AAACCCAGTACGATCT-1_1 Mapkdcko_CTR 3790 1616 Mapkdcko_CTR -0.101067444 -0.06548076 G1 Mapkdcko_CTR 0 0
names(GSE282783.list$immune.combined$count.mat) # list of available assay(s)
# [1] "RNA" "integrated"
names(GSE282783.list$immune.combined$count.mat$RNA) # list of slot(s)
# [1] "counts"
GSE282783.list$immune.combined$count.mat$RNA$counts[1:5, 1:5] # count matrix
# 5 x 5 sparse Matrix of class "dgCMatrix"
# AAACCCAAGCACGTCC-1_1 AAACCCAAGTGCAGGT-1_1 AAACCCACAACGTAAA-1_1 AAACCCACAACGTTAC-1_1 AAACCCACACTGCGAC-1_1
# Xkr4 . . . . .
# Gm1992 . . . . .
# Gm37381 . . . . .
# Rp1 . . . . .
# Rp1.1 . . . . .Zenodo examples
Download the RData files:
multi.dois.parse <- ParseZenodo(
doi = c("1111", "10.5281/zenodo.7243603", "10.5281/zenodo.7244441"),
file.ext = c("rdata"), timeout = 36000000,
out.folder = "~/gefetch2r/doc/download_zenodo"
)
# # The structure of downloaded files
# tree ~/gefetch2r/doc/download_zenodo
# ~/gefetch2r/doc/download_zenodo
# ├── frog_data.RData
# └── zebrafish_data.RDataDissect and extract the RData file
(frog_data.RData):
zenodo.frog.list <- LoadRData(
rdata = "~/gefetch2r/doc/download_zenodo/frog_data.RData",
accept.fmt = c("Seurat", "seurat", "SingleCellExperiment", "cell_data_set", "CellDataSet", "DESeqDataSet", "DGEList"),
return.obj = FALSE
)
# # message:
# The object classes stored in RData: Seurat.
# Class
# frog_data Seurat
# Detect 1 object(s) in given class(s): Seurat, seurat, SingleCellExperiment, cell_data_set, CellDataSet, DESeqDataSet, DGEList.
# Extract count matrix and metadata (if available) from: frog_data (Seurat).
# Detect Seurat version: 4.0.4, with assay(s): RNA.
# returned value(s)
ls() # list object(s) in the global environment
# [1] "zenodo.frog.list"
names(zenodo.frog.list) # one valid object
# [1] "frog_data"
head(zenodo.frog.list$frog_data$meta.data)
# orig.ident nCount_RNA nFeature_RNA sample stage group cell_state cell_type
# S8_cell_1 cell 20658 5506 cell_1 S8 Clutch_1 S8:blastula blastula
# S8_cell_2 cell 17002 5209 cell_2 S8 Clutch_1 S8:blastula blastula
# S8_cell_3 cell 16190 4880 cell_3 S8 Clutch_1 S8:blastula blastula
# S8_cell_4 cell 15652 4930 cell_4 S8 Clutch_1 S8:blastula blastula
# S8_cell_5 cell 14325 4598 cell_5 S8 Clutch_1 S8:blastula blastula
# S8_cell_6 cell 12658 4242 cell_6 S8 Clutch_1 S8:blastula blastula
names(zenodo.frog.list$frog_data$count.mat) # list of available assay(s)
# [1] "RNA"
names(zenodo.frog.list$frog_data$count.mat$RNA) # list of slot(s)
# [1] "counts" "data" "scale.data"
zenodo.frog.list$frog_data$count.mat$RNA$counts[1:5, 1:5] # count matrix
# 5 x 5 sparse Matrix of class "dgCMatrix"
# S8_cell_1 S8_cell_2 S8_cell_3 S8_cell_4 S8_cell_5
# 42Sp43 4 1 1 8 3
# 42Sp50 1 . 7 . 1
# 6330408a02rik-like.1 . . . . .
# 6330408a02rik-like.2 . . . . .
# AK6 2 . . . 1Dissect and extract the RData file
(zebrafish_data.RData):
zenodo.zebrafish.list <- LoadRData(
rdata = "~/gefetch2r/doc/download_zenodo/zebrafish_data.RData",
accept.fmt = c("Seurat", "seurat", "SingleCellExperiment", "cell_data_set", "CellDataSet", "DESeqDataSet", "DGEList"),
return.obj = FALSE
)
# # message:
# The object classes stored in RData: Seurat.
# Class
# zebrafish_data Seurat
# Detect 1 object(s) in given class(s): Seurat, seurat, SingleCellExperiment, cell_data_set, CellDataSet, DESeqDataSet, DGEList.
# Extract count matrix and metadata (if available) from: zebrafish_data (Seurat).
# Detect Seurat version: 4.0.4, with assay(s): RNA.
# returned value(s)
ls() # list object(s) in the global environment
# [1] "zenodo.zebrafish.list"
names(zenodo.zebrafish.list) # one valid object
# [1] "zebrafish_data"
head(zenodo.zebrafish.list$zebrafish_data$meta.data)
# orig.ident nCount_RNA nFeature_RNA sample stage group cell_state cell_type
# hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC ZFHIGH 5773 2570 ZFHIGH_WT_DS5_AAAAGTTGCCTC hpf3.3 F_3.3 hpf3.3:blastomere blastomere
# hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT ZFHIGH 2312 1451 ZFHIGH_WT_DS5_AAACAAGTGTAT hpf3.3 F_3.3 hpf3.3:blastomere blastomere
# hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC ZFHIGH 4180 2166 ZFHIGH_WT_DS5_AAACACCTCGTC hpf3.3 F_3.3 hpf3.3:blastomere blastomere
# hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN ZFHIGH 6686 2845 ZFHIGH_WT_DS5_AAATGAGGTTTN hpf3.3 F_3.3 hpf3.3:blastomere blastomere
# hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT ZFHIGH 20095 4993 ZFHIGH_WT_DS5_AACCCTCTCGAT hpf3.3 F_3.3 hpf3.3:blastomere blastomere
# hpf3.3_ZFHIGH_WT_DS5_AACGAAAGGTAA ZFHIGH 1443 1019 ZFHIGH_WT_DS5_AACGAAAGGTAA hpf3.3 F_3.3 hpf3.3:blastomere blastomere
names(zenodo.zebrafish.list$zebrafish_data$count.mat) # list of available assay(s)
# [1] "RNA"
names(zenodo.zebrafish.list$zebrafish_data$count.mat$RNA) # list of slot(s)
# [1] "counts" "data" "scale.data"
zenodo.zebrafish.list$zebrafish_data$count.mat$RNA$counts[1:5, 1:5] # count matrix
# 5 x 5 sparse Matrix of class "dgCMatrix"
# hpf3.3_ZFHIGH_WT_DS5_AAAAGTTGCCTC hpf3.3_ZFHIGH_WT_DS5_AAACAAGTGTAT hpf3.3_ZFHIGH_WT_DS5_AAACACCTCGTC hpf3.3_ZFHIGH_WT_DS5_AAATGAGGTTTN hpf3.3_ZFHIGH_WT_DS5_AACCCTCTCGAT
# ENSDARG00000002968 . . . . .
# ENSDARG00000056314 . . . . .
# ENSDARG00000102274 . . . . .
# ENSDARG00000012468 . . . . .
# ENSDARG00000063621 . . . . .Human Cell Atlas
Download the RData files:
hca.given.download <- ParseHCA(
link = c(
"https://explore.data.humancellatlas.org/projects/c302fe54-d22d-451f-a130-e24df3d6afca",
"https://explore.data.humancellatlas.org/projects/34c9a62c-a610-4e31-b343-8fb7be676f8c"
), timeout = 360000000000000, file.ext = "rdata", parallel = F,
out.folder = "./RData"
)
# # The structure of downloaded files
# tree RData/
# RData/
# ├── GSE130560_matrix.RData.gz
# └── GSE134174_Processed_invivo_seurat.Rdata.gz
#
# 0 directories, 2 filesDissect and extract the RData file
(GSE130560_matrix.RData.gz):
hca.GSE130560.list <- LoadRData(
rdata = "RData/GSE130560_matrix.RData.gz",
accept.fmt = c("Seurat", "seurat", "SingleCellExperiment", "cell_data_set", "CellDataSet", "DESeqDataSet", "DGEList"),
return.obj = FALSE
)
# # message:
# Detect RData file in compressed format, decompressing now!
# The object classes stored in RData: dgCMatrix.
# Class
# matrix dgCMatrix
# No valid object in given class(s): Seurat, seurat, SingleCellExperiment, cell_data_set, CellDataSet, DESeqDataSet, DGEList. Now we will guess the type!
# The slot parameter does not work here!
# matrix is a sparse matrix. Most likely a count matrix!
# returned value(s)
ls() # list object(s) in the global environment
# [1] "hca.GSE130560.list"
names(hca.GSE130560.list) # two elements: count matrix and metadata
# [1] "count" "meta"
names(hca.GSE130560.list$meta) # list of available metadata (no metadata)
# NULL
names(hca.GSE130560.list$count) # list of available count matrices
# [1] "matrix"
hca.GSE130560.list$count$matrix[1:5, 1:5] # count matrix
# 5 x 5 sparse Matrix of class "dgCMatrix"
# AAACCTGGTCTAACGT_1 AACACGTGTATATGAG_1 AACTGGTAGTTAGGTA_1 AACTTTCTCATCGCTC_1 AAGACCTAGCTAGCCC_1
# FO538757.2 . . . . .
# AP006222.2 1 . . 1 .
# RP11-206L10.9 . . . . .
# LINC00115 . . . . .
# FAM41C . 1 . . .Dissect and extract the RData file
(GSE134174_Processed_invivo_seurat.Rdata.gz):
hca.GSE134174.list <- LoadRData(
rdata = "RData/GSE134174_Processed_invivo_seurat.Rdata.gz",
accept.fmt = c("Seurat", "seurat", "SingleCellExperiment", "cell_data_set", "CellDataSet", "DESeqDataSet", "DGEList"),
return.obj = FALSE
)
# # message:
# Detect RData file in compressed format, decompressing now!
# The object classes stored in RData: Seurat.
# Class
# T15_int Seurat
# Detect 1 object(s) in given class(s): Seurat, seurat, SingleCellExperiment, cell_data_set, CellDataSet, DESeqDataSet, DGEList.
# Extract count matrix and metadata (if available) from: T15_int (Seurat).
# Loading required package: Seurat
# Attaching SeuratObject
# Detect Seurat version: 3.0.3.9015, with assay(s): RNA, SCT, integrated.
# returned value(s)
ls() # list object(s) in the global environment
# [1] "hca.GSE134174.list"
names(hca.GSE134174.list) # one valid object
# [1] "T15_int"
head(hca.GSE134174.list$T15_int$meta.data)
# orig.ident nCount_RNA nFeature_RNA propMT donor smoke smoke_noT89 Smoke_status pack_years age
# AAACCCAAGGCGACAT_1 T15 13929 4785 0.13461624 T101 heavy heavy heavy 25 55
# AAACCCAGTACTCGAT_1 T15 8738 3489 0.14035088 T101 heavy heavy heavy 25 55
# AAACCCAGTATGTGTC_1 T15 3108 1671 0.01377682 T101 heavy heavy heavy 25 55
# AAACCCAGTTAGCGGA_1 T15 30747 6538 0.14076632 T101 heavy heavy heavy 25 55
# AAACCCAGTTGCCGAC_1 T15 55390 8317 0.14026411 T101 heavy heavy heavy 25 55
# AAACCCATCTTTCTAG_1 T15 27888 5982 0.13987778 T101 heavy heavy heavy 25 55
# sex clusters_10 cluster_ident clusters10_smoke clusters_16a subcluster_ident
# AAACCCAAGGCGACAT_1 M c2 Differentiating.basal c2_heavy c2 Differentiating.basal
# AAACCCAGTACTCGAT_1 M c3 SMG.basal c3_heavy c3a SMG.basal.A
# AAACCCAGTATGTGTC_1 M c4 KRT8.high c4_heavy c4 KRT8.high
# AAACCCAGTTAGCGGA_1 M c4 KRT8.high c4_heavy c4 KRT8.high
# AAACCCAGTTGCCGAC_1 M c1 Proliferating.basal c1_heavy c1 Proliferating.basal
# AAACCCATCTTTCTAG_1 M c4 KRT8.high c4_heavy c4 KRT8.high
# clusters16a_smoke
# AAACCCAAGGCGACAT_1 c2_heavy
# AAACCCAGTACTCGAT_1 c3a_heavy
# AAACCCAGTATGTGTC_1 c4_heavy
# AAACCCAGTTAGCGGA_1 c4_heavy
# AAACCCAGTTGCCGAC_1 c1_heavy
# AAACCCATCTTTCTAG_1 c4_heavy
names(hca.GSE134174.list$T15_int$count.mat) # list of available assay(s)
# [1] "RNA" "SCT" "integrated"
names(hca.GSE134174.list$T15_int$count.mat$RNA) # list of slot(s)
# [1] "counts" "data" "scale.data"
hca.GSE134174.list$T15_int$count.mat$RNA$counts[1:5, 1:5] # count matrix
# 5 x 5 sparse Matrix of class "dgCMatrix"
# AAACCCAAGGCGACAT_1 AAACCCAGTACTCGAT_1 AAACCCAGTATGTGTC_1 AAACCCAGTTAGCGGA_1 AAACCCAGTTGCCGAC_1
# AL627309.1 . . . . .
# AL669831.5 . . . 2 .
# LINC00115 . . . . .
# FAM41C . . . . .
# AL645608.3 . . . . .Simulated non-standard objects
Generate RData file containing a mixture of non-standard
objects:
# dgCMatrix
sparse.mat <- SeuratObject::GetAssayData(SeuratObject::pbmc_small, assay = "RNA", slot = "counts")
# count matrix and metadata from GSE297431
GSE297431.meta.supp <- ExtractGEOMeta(acce = "GSE297431", down.supp = TRUE, supp.idx = 2)
GSE297431.cnt <- ParseGEO(acce = "GSE297431", down.supp = TRUE, supp.idx = 1, supp.type = "count", load2R = F)
# move rownames to dataframe (first column)
GSE297431.cnt.row2col <- GSE297431.cnt %>%
tibble::rownames_to_column(var = "Gene") %>%
dplyr::relocate()
# dataframe to matrix
GSE297431.cnt.mat <- GSE297431.cnt %>% as.matrix()
# list (noise)
cc.genes <- Seurat::cc.genes
# dataframe (noise)
s.genes <- data.frame(gene = cc.genes$s.genes)
# vector (noise)
g2m.genes <- cc.genes$g2m.genes
# save
save(sparse.mat, GSE297431.meta.supp, GSE297431.cnt, GSE297431.cnt.row2col, GSE297431.cnt.mat, cc.genes, s.genes, g2m.genes,
file = "simulated_non_standard_objects.RData"
)Dissect and extract the generated RData file:
# process the object
non.standard.list <- LoadRData(
rdata = "simulated_non_standard_objects.RData",
accept.fmt = c("Seurat", "seurat", "SingleCellExperiment", "cell_data_set", "CellDataSet", "DESeqDataSet", "DGEList"),
return.obj = FALSE
)
# # message:
# The object classes stored in RData: list, character, data.frame, matrix, array, dgCMatrix.
# Class
# cc.genes list
# g2m.genes character
# GSE297431.cnt data.frame
# GSE297431.cnt.mat matrix, array
# GSE297431.cnt.row2col data.frame
# GSE297431.meta.supp data.frame
# s.genes data.frame
# sparse.mat dgCMatrix
# No valid object in given class(s): Seurat, seurat, SingleCellExperiment, cell_data_set, CellDataSet, DESeqDataSet, DGEList. Now we will guess the type!
# The slot parameter does not work here!
# cc.genes is list.
# $s.genes
# [1] "MCM5" "PCNA" "TYMS" "FEN1" "MCM2" "MCM4" "RRM1" "UNG" "GINS2" "MCM6" "CDCA7" "DTL" "PRIM1" "UHRF1" "MLF1IP" "HELLS" "RFC2" "RPA2" "NASP"
# [20] "RAD51AP1" "GMNN" "WDR76" "SLBP" "CCNE2" "UBR7" "POLD3" "MSH2" "ATAD2" "RAD51" "RRM2" "CDC45" "CDC6" "EXO1" "TIPIN" "DSCC1" "BLM" "CASP8AP2" "USP1"
# [39] "CLSPN" "POLA1" "CHAF1B" "BRIP1" "E2F8"
#
# $g2m.genes
# [1] "HMGB2" "CDK1" "NUSAP1" "UBE2C" "BIRC5" "TPX2" "TOP2A" "NDC80" "CKS2" "NUF2" "CKS1B" "MKI67" "TMPO" "CENPF" "TACC3" "FAM64A" "SMC4" "CCNB2" "CKAP2L" "CKAP2" "AURKB"
# [22] "BUB1" "KIF11" "ANP32E" "TUBB4B" "GTSE1" "KIF20B" "HJURP" "CDCA3" "HN1" "CDC20" "TTK" "CDC25C" "KIF2C" "RANGAP1" "NCAPD2" "DLGAP5" "CDCA2" "CDCA8" "ECT2" "KIF23" "HMMR"
# [43] "AURKA" "PSRC1" "ANLN" "LBR" "CKAP5" "CENPE" "CTCF" "NEK2" "G2E3" "GAS2L3" "CBX5" "CENPA"
#
# g2m.genes is character.
# [1] "HMGB2" "CDK1" "NUSAP1" "UBE2C" "BIRC5" "TPX2"
# GSE297431.cnt has 107 columns and each column is numerical! Most likely a count matrix!
# GSE297431.cnt.mat has 107 columns and each column is numerical! Most likely a count matrix!
# GSE297431.cnt.row2col has 108 columns, all of which are numerical except for the first column! Maybe a count matrix!
# Detect possible sample metadata keys: Sample_ID, Type in GSE297431.meta.supp. Maybe metadata/annotation!
# Can not determine if s.genes is metadata/annotation. Load to the global environment, please manually check!
# 'data.frame': 43 obs. of 1 variable:
# $ gene: chr "MCM5" "PCNA" "TYMS" "FEN1" ...
# NULL
# sparse.mat is a sparse matrix. Most likely a count matrix!
# returned value(s)
ls() # list object(s) in the global environment
# [1] "non.standard.list" "s.genes"
names(non.standard.list) # two elements: count matrix and metadata
# [1] "count" "meta"
names(non.standard.list$meta) # list of available metadata
# [1] "GSE297431.meta.supp"
head(non.standard.list$meta$GSE297431.meta.supp)
# Sample_ID Batch Plate Type Growths Class
# 1 Plate1_mut_A1_S70 batch2 Plate1 mutant 3 M3
# 2 Plate1_mut_A11_S135 batch2 Plate1 mutant 3 M3
# 3 Plate1_mut_A12_S141 batch2 Plate1 mutant 1 M1
# 4 Plate1_mut_A2_S77 batch2 Plate1 mutant 1 M1
# 5 Plate1_mut_A3_S84 batch2 Plate1 mutant 2 M2
# 6 Plate1_mut_A4_S91 batch2 Plate1 mutant 2 M2
names(non.standard.list$count) # list of available count matrices
# [1] "GSE297431.cnt" "GSE297431.cnt.mat" "GSE297431.cnt.row2col" "sparse.mat"
non.standard.list$count$GSE297431.cnt[1:5, 1:5] # count matrix
# Plate1_mut_A1_S70 Plate1_mut_A11_S135 Plate1_mut_A12_S141 Plate1_mut_A2_S77 Plate1_mut_A3_S84
# Gnai3 382 201 279 261 8
# Pbsn 0 0 0 0 0
# Cdc45 117 86 56 230 7
# Scml2 268 116 204 105 31
# Apoh 0 0 0 0 0