Skip to contents

Introduction

GEfetch2R provides functions for users to download count matrices and annotations (e.g. cell type annotation and composition) from GEO and some single-cell databases (e.g. PanglaoDB and UCSC Cell Browser). GEfetch2R also supports loading the downloaded data to Seurat.

Until now, the public resources supported and the returned values:

Resources URL Download Type Returned values
GEO https://www.ncbi.nlm.nih.gov/geo/ count matrix SeuratObject (scRNA-seq) or DESeqDataSet (bulk RNA-seq)
PanglaoDB https://panglaodb.se/index.html count matrix and annotation SeuratObject
UCSC Cell Browser https://cells.ucsc.edu/ count matrix and annotation SeuratObject

Check API

Check the availability of APIs used:

CheckAPI(database = c("GEO", "PanglaoDB", "UCSC Cell Browser"))
# start checking APIs to access GEO!
# The API to access the GEO object is OK!
# trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE302nnn/GSE302912/suppl//GSE302912_counts.csv.gz?tool=geoquery'
# Content type 'application/x-gzip' length 332706 bytes (324 KB)
# ==================================================
# downloaded 324 KB
#
# The API to access supplementary files is OK!
# start checking APIs to access PanglaoDB!
# The API to access all available samples is OK!
# Processing given dataset(s): SRA429320, SRS1467249
# The API to access cell type composition is OK!
#   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 00s
# start checking APIs to access UCSC Cell Browser!
# The API to access all available projects is OK!
# The API to access detailed information of a given dataset is OK!
# The API to access the available data of a given dataset is OK!
# The API to access available files is OK!

GEO

GEO is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community. It provides a very convenient way for users to explore and select interested scRNA-seq datasets.

Extract metadata (optional)

ExtractGEOMeta provides two ways to extract sample metadata:

  • user-provided sample metadata when uploading to GEO (applicable to all GEO accessions), including sample title, source name/tissue, description, cell type, treatment, paper title, paper abstract, organism, protocol, data processing methods, et al:
# library
library(tidyverse)
library(GEfetch2R)

# set VROOM_CONNECTION_SIZE to avoid error: Error: The size of the connection buffer (786432) was not large enough
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 60)
# extract metadata
GSE297431.meta <- ExtractGEOMeta(acce = "GSE297431")
GSE297431.meta[1:3, c("title", "geo_accession", "source_name_ch1", "description", "cell type")]
#                        title geo_accession source_name_ch1                       description
# 1  MOS_mutant_M3_replicate_4    GSM8991251          Oocyte   Library name: Plate1_mut_A1_S70
# 2  MOS_mutant_M3_replicate_5    GSM8991252          Oocyte Library name: Plate1_mut_A11_S135
# 3 MOS_mutant_M1_replicate_10    GSM8991253          Oocyte Library name: Plate1_mut_A12_S141
#   cell type
# 1    Oocyte
# 2    Oocyte
# 3    Oocyte
  • metadata in supplementary file:
GSE297431.meta.supp <- ExtractGEOMeta(
  acce = "GSE297431", down.supp = TRUE,
  supp.idx = 2 # specify the index of used supplementary file
)
head(GSE297431.meta.supp)
#             Sample_ID  Batch  Plate   Type Growths Class
# 1   Plate1_mut_A1_S70 batch2 Plate1 mutant       3    M3
# 2 Plate1_mut_A11_S135 batch2 Plate1 mutant       3    M3
# 3 Plate1_mut_A12_S141 batch2 Plate1 mutant       1    M1
# 4   Plate1_mut_A2_S77 batch2 Plate1 mutant       1    M1
# 5   Plate1_mut_A3_S84 batch2 Plate1 mutant       2    M2
# 6   Plate1_mut_A4_S91 batch2 Plate1 mutant       2    M2

Download matrix and load to Seurat/DESeq2

After manually check the extracted metadata, users can download count matrix and load the count matrix to Seurat/DESeq2 with ParseGEO.

For count matrix, ParseGEO supports downloading the matrix from supplementary files and extracting from ExpressionSet, users can control the source by specifying down.supp or detecting automatically (ParseGEO will extract the count matrix from ExpressionSet first, if the count matrix is NULL or contains non-integer values, ParseGEO will download supplementary files).

Smart-seq2 scRNA-seq or Bulk RNA-seq

For Smart-seq2 scRNA-seq or bulk RNA-seq, the supplementary files have two formats:

  • single file (csv(.gz)/tsv(.gz)/txt(.gz)/tab(.gz)/xlsx(.gz)/xls(.gz)) contain the count matrix of all samples, e.g. GSE297431-Smart-seq2
  • (gzip) archive (contain csv(.gz)/tsv(.gz)/txt(.gz)/tab(.gz)/xlsx(.gz)/xls(.gz) files) contain the count matrix of every sample, e.g. GSE241004
# single file (Smart-seq2)
GSE297431.seu <- ParseGEO(
  acce = "GSE297431",
  supp.idx = 1, # specify the index of used supplementary file
  down.supp = TRUE, supp.type = "count"
)
GSE297431.seu
# An object of class Seurat
# 21564 features across 107 samples within 1 assay
# Active assay: RNA (21564 features, 0 variable features)

# gzip archive of files (Smart-seq2)
GSE241004.seu <- ParseGEO(
  acce = "GSE241004",
  supp.idx = 1, # specify the index of used supplementary file
  down.supp = TRUE, supp.type = "count"
)
GSE241004.seu
# An object of class Seurat
# 61911 features across 30 samples within 1 assay
# Active assay: RNA (61911 features, 0 variable features)

# gzip archive of files (bulk RNA-seq)
GSE223226.dds <- ParseGEO(
  acce = "GSE223226",
  supp.idx = 1, # specify the index of used supplementary file
  down.supp = TRUE, supp.type = "count",
  data.type = "bulk" # set data.type = "bulk", return DESeqDataSet
)
GSE223226.dds
# class: DESeqDataSet
# dim: 32525 20
# metadata(1): version
# assays(1): counts
# rownames(32525): __alignment_not_unique __ambiguous ... ENSDARG00000117826
#   ENSDARG00000117827
# rowData names(0):
# colnames(20): GSM6943634.x GSM6943634.y ... GSM6943643.x GSM6943643.y
# colData names(1): condition

scRNA-seq

For 10x Genomics (or similiar) scRNA-seq, the supplementary files have two formats:

  • h5(.gz) file(s) or barcodes.tsv(.gz)/genes.tsv(.gz), matrix.mtx(.gz), features.tsv(.gz) files contain the count matrix, e.g. GSE278892
  • archive (tar(.gz)), the files in archive can be in zip, tar(.gz), h5(.gz) format, or separate files (barcodes.tsv(.gz)/genes.tsv(.gz), matrix.mtx(.gz), features.tsv(.gz)), e.g. GSE271836-h5, GSE292908-separate files, GSE253859-zip, GSE274058-tar.gz

With the count matrix, ParseGEO will load the matrix to Seurat automatically. If multiple samples available, users can choose to merge the SeuratObject with merge.

# separate files
GSE278892.seu <- ParseGEO(
  acce = "GSE278892", down.supp = TRUE,
  supp.type = "10xSingle", timeout = 36000,
  out.folder = "~/gefetch2r/doc/download_geo"
)
GSE278892.seu
# An object of class Seurat
# 22548 features across 6079 samples within 1 assay
# Active assay: RNA (22548 features, 0 variable features)

# archive of files
GSE292908.seu <- ParseGEO(
  acce = "GSE292908", down.supp = TRUE,
  supp.type = "10x", timeout = 36000,
  supp.idx = 1, # specify the index of used supplementary file
  out.folder = "~/gefetch2r/doc/download_geo"
)
GSE292908.seu
# An object of class Seurat
# 32285 features across 16062 samples within 1 assay
# Active assay: RNA (32285 features, 0 variable features)

# # The structure of downloaded count matrix for 10x
# tree ~/gefetch2r/doc/download_geo
# ~/gefetch2r/doc/download_geo
# ├── GSE278892
# │   └── GSE278892_BFPRFP
# │       ├── barcodes.tsv.gz
# │       ├── features.tsv.gz
# │       └── matrix.mtx.gz
# └── GSE292908
#     ├── GSM8869153_2151R_CTX
#     │   ├── barcodes.tsv.gz
#     │   ├── features.tsv.gz
#     │   └── matrix.mtx.gz
#     ├── GSM8869154_2151R_CTX_SNDX_ms6352
#     │   ├── barcodes.tsv.gz
#     │   ├── features.tsv.gz
#     │   └── matrix.mtx.gz
#     ├── GSM8869155_2151R_IgG
#     │   ├── barcodes.tsv.gz
#     │   ├── features.tsv.gz
#     │   └── matrix.mtx.gz
#     ├── GSM8869156_2151R_SNDX_ms6352
#     │   ├── barcodes.tsv.gz
#     │   ├── features.tsv.gz
#     │   └── matrix.mtx.gz
#     ├── GSM8869157_T12_CTX
#     │   ├── barcodes.tsv.gz
#     │   ├── features.tsv.gz
#     │   └── matrix.mtx.gz
#     ├── GSM8869158_T12_CTX_SNDX_ms6352
#     │   ├── barcodes.tsv.gz
#     │   ├── features.tsv.gz
#     │   └── matrix.mtx.gz
#     ├── GSM8869159_T12_IgG
#     │   ├── barcodes.tsv.gz
#     │   ├── features.tsv.gz
#     │   └── matrix.mtx.gz
#     └── GSM8869160_T12_SNDX_ms6352
#         ├── barcodes.tsv.gz
#         ├── features.tsv.gz
#         └── matrix.mtx.gz
#
# 11 directories, 27 files

PanglaoDB

PanglaoDB is a database which contains scRNA-seq datasets from mouse and human. Up to now, it contains 5,586,348 cells from 1368 datasets (1063 from Mus musculus and 305 from Homo sapiens). It has well organized metadata for every dataset, including tissue, protocol, species, number of cells and cell type annotation (computationally identified). Daniel Osorio has developed rPanglaoDB to access PanglaoDB data, the functions of GEfetch2R here are based on rPanglaoDB.

Since PanglaoDB is no longer maintained, GEfetch2R has cached all metadata and cell type composition and use these cached data by default to accelerate, users can access the cached data with PanglaoDBMeta (all metadata) and PanglaoDBComposition (all cell type composition).

Given dataset

With SRA or SRS accession, user can access the cell type composition and count matrix:

# cell type composition
lung.composition <- ExtractPanglaoDBComposition(sra = "SRA570744")
head(lung.composition)
#           SRA        SRS          Tissue     Protocol      Species Cluster Cells
# 2.1 SRA570744 SRS2253536 Lung mesenchyme 10x chromium Mus musculus       0  1606
# 2.2 SRA570744 SRS2253536 Lung mesenchyme 10x chromium Mus musculus       1   874
# 2.3 SRA570744 SRS2253536 Lung mesenchyme 10x chromium Mus musculus       2   762
# 2.4 SRA570744 SRS2253536 Lung mesenchyme 10x chromium Mus musculus       3   627
# 2.5 SRA570744 SRS2253536 Lung mesenchyme 10x chromium Mus musculus       4   400
# 2.6 SRA570744 SRS2253536 Lung mesenchyme 10x chromium Mus musculus       5    74
#               Cell Type
# 2.1         Fibroblasts
# 2.2         Fibroblasts
# 2.3 Smooth muscle cells
# 2.4         Fibroblasts
# 2.5 Smooth muscle cells
# 2.6   Mesothelial cells、

# count matrix
lung.seu <- ParsePanglaoDB(sra = "SRA570744", srs = "SRS2253536")
lung.seu
# $SRS2253536
# An object of class Seurat
# 23411 features across 4374 samples within 1 assay
# Active assay: RNA (23411 features, 0 variable features)

Filter samples based on metadata

Summary attributes

GEfetch2R provides StatDBAttribute to summary attributes of PanglaoDB:

# use cached metadata
StatDBAttribute(df = PanglaoDBMeta, filter = c("species", "protocol"), database = "PanglaoDB")
# $species
#          Value  Num     Key
# 1 Mus musculus 1063 species
# 2 Homo sapiens  305 species
#
# $protocol
#           Value  Num      Key
# 1  10x chromium 1046 protocol
# 2      drop-seq  204 protocol
# 3 microwell-seq   74 protocol
# 4    Smart-seq2   26 protocol
# 5   C1 Fluidigm   16 protocol
# 6       CEL-seq    1 protocol
# 7       inDrops    1 protocol

Filter metadata

GEfetch2R provides ExtractPanglaoDBMeta to filter interested datasets with specified species, protocol, tissue and cell number (The available values of these attributes can be obtained with StatDBAttribute). User can also choose to whether to add cell type annotation to every dataset with show.cell.type.

GEfetch2R uses cached metadata and cell type composition by default, users can change this by setting local.data = FALSE.

hsa.meta <- ExtractPanglaoDBMeta(
  species = "Homo sapiens", protocol = c("Smart-seq2", "10x chromium"),
  show.cell.type = TRUE, cell.num = c(1000, 2000)
)
head(hsa.meta)
#         SRA        SRS                             Tissue     Protocol      Species Cells
# 1 SRA550660 SRS2089635 Peripheral blood mononuclear cells 10x chromium Homo sapiens  1860
# 2 SRA550660 SRS2089636 Peripheral blood mononuclear cells 10x chromium Homo sapiens  1580
# 3 SRA550660 SRS2089638 Peripheral blood mononuclear cells 10x chromium Homo sapiens  1818
# 4 SRA605365 SRS2492922            Nasal airway epithelium 10x chromium Homo sapiens  1932
# 5 SRA608611 SRS2517316                   Lung progenitors 10x chromium Homo sapiens  1077
# 6 SRA608353 SRS2517519           Hepatocellular carcinoma 10x chromium Homo sapiens  1230
#                                                                      CellType CellNum
# 1                                                           Unknown, NK cells    1860
# 2                              Unknown, T cells, Plasmacytoid dendritic cells    1580
# 3 Unknown, Gamma delta T cells, Dendritic cells, Plasmacytoid dendritic cells    1818
# 4       Luminal epithelial cells, Basal cells, Keratinocytes, Ependymal cells    1932
# 5                                           Unknown, Hepatocytes, Basal cells    1077
# 6                                        Unknown, Hepatocytes, Foveolar cells    1230

Extract cell type composition

GEfetch2R provides ExtractPanglaoDBComposition to extract cell type annotation and composition (use cached data by default to accelerate, users can change this by setting local.data = FALSE).

hsa.composition <- ExtractPanglaoDBComposition(
  meta = hsa.meta
)
head(hsa.composition)
#            SRA        SRS           Tissue     Protocol      Species Cluster Cells   Cell Type
# 13.1 SRA608611 SRS2517316 Lung progenitors 10x chromium Homo sapiens       0   214     Unknown
# 13.2 SRA608611 SRS2517316 Lung progenitors 10x chromium Homo sapiens       1   210 Hepatocytes
# 13.3 SRA608611 SRS2517316 Lung progenitors 10x chromium Homo sapiens       2   175 Basal cells
# 13.4 SRA608611 SRS2517316 Lung progenitors 10x chromium Homo sapiens       3   121 Basal cells
# 13.5 SRA608611 SRS2517316 Lung progenitors 10x chromium Homo sapiens       4    81     Unknown
# 13.6 SRA608611 SRS2517316 Lung progenitors 10x chromium Homo sapiens       5    80     Unknown

Download matrix and load to Seurat

After manually check the extracted metadata, GEfetch2R provides ParsePanglaoDB to download count matrix and load the count matrix to Seurat. With available cell type annotation, uses can filter datasets without specified cell type with cell.type. Users can also include/exclude cells expressing specified genes with include.gene/exclude.gene.

With the count matrix, ParsePanglaoDB will load the matrix to Seurat automatically. If multiple datasets available, users can choose to merge the SeuratObject with merge.

hsa.seu <- ParsePanglaoDB(hsa.meta[1:3, ], merge = TRUE)
hsa.seu
# An object of class Seurat
# 25917 features across 4996 samples within 1 assay
# Active assay: RNA (25917 features, 0 variable features)

UCSC Cell Browser

The UCSC Cell Browser is a web-based tool that allows scientists to interactively visualize scRNA-seq datasets. It contains 1427 single cell datasets from 37 different species. And, it is organized with the hierarchical structure, which can help users quickly locate the datasets they are interested in.

Given dataset

With the collection or dataset link(s), user can access the cell type composition and count matrix:

# extract cell type composition
ut.sample.ct <- ExtractCBComposition(link = c(
  "https://cells.ucsc.edu/?ds=adult-ureter", # collection
  "https://cells.ucsc.edu/?ds=adult-testis" # dataset
))
ut.sample.ct[1:5, c("title", "CellType", "Num")]
#                                               title                    CellType  Num
# 1 The adult human testis transcriptional cell atlas                       Sperm 1885
# 2 The adult human testis transcriptional cell atlas        Elongated Spermatids  775
# 3 The adult human testis transcriptional cell atlas                Leydig cells  605
# 4 The adult human testis transcriptional cell atlas Early Primary Spermatocytes  557
# 5 The adult human testis transcriptional cell atlas            Round Spermatids  444

# extract count matrix and load to Seurat
ut.seu <- ParseCBDatasets(link = c(
  "https://cells.ucsc.edu/?ds=adult-ureter", # collection
  "https://cells.ucsc.edu/?ds=adult-testis" # dataset
), merge = TRUE)
ut.seu
# An object of class Seurat
# 43370 features across 79053 samples within 1 assay
# Active assay: RNA (43370 features, 0 variable features)

Filter samples based on metadata

Show available datasets

GEfetch2R provides ShowCBDatasets to show all available datasets. Due to the large number of datasets, ShowCBDatasets enables users to perform lazy load of dataset json files instead of downloading the json files online (time-consuming!!!). This lazy load requires users to provide json.folder to save json files and set lazy = TRUE (for the first time of run, ShowCBDatasets will download current json files to json.folder, for next time of run, with lazy = TRUE, ShowCBDatasets will load the downloaded json files from json.folder.). And, ShowCBDatasets supports updating the local datasets with update = TRUE.

# first time run, the json files are stored under json.folder
# ucsc.cb.samples = ShowCBDatasets(lazy = TRUE, json.folder = "~/gefetch2r/doc/cell_browser/json", update = TRUE)

# second time run, load the downloaded json files
ucsc.cb.samples <- ShowCBDatasets(lazy = TRUE, json.folder = "~/gefetch2r/doc/cell_browser/json", update = FALSE)

# always read online
# ucsc.cb.samples = ShowCBDatasets(lazy = FALSE)

The number of datasets and all available species:

# the number of datasets
nrow(ucsc.cb.samples)
# 1427

# available species
unlist(sapply(unique(gsub(pattern = "\\|parent", replacement = "", x = ucsc.cb.samples$organisms)), function(x) {
  unlist(strsplit(x = x, split = ", "))
})) %>%
  tolower() %>%
  unique()
#  [1] "human (h. sapiens)"                      "mouse (m. musculus)"
#  [3] "rhesus macaque (m. mulatta)"             "chimp (p. troglodytes)"
#  [5] "brine shrimp (a. franciscana)"           "canis lupus familiaris"
#  [7] "dog (c. familiaris)"                     "canis familiaris"
#  [9] "sea squirt (c. intestinalis)"            "sea squirt (c. robusta)"
# [11] "macaque (m. fascicularis)"               "rabbit (o. cuniculus)"
# [13] "rat (r. norvegicus)"                     "pig (s. scrofa)"
# [15] "ferret (m. putorius furo)"               "opossum (m. domestica)"
# [17] "sugar glider (p. breviceps)"             "zebrafish (d. rerio)"
# [19] "fruit fly (d. melanogaster)"             "horse (e. caballus)"
# [21] "freshwater hydra (h. vulgaris)"          "capitellid worm (c. teleta)"
# [23] "freshwater sponge (s. lacustris)"        "colonial hydroid (h. symbiolongicarpus)"
# [25] "western clawed frog (x. tropicalis)"     "marmoset (c. jacchus)"
# [27] "mosquito (a. aegypti)"                   "pacific oyster (crassostrea gigas)"
# [29] "freshwater worm (p. leidyi)"             "bonobo (p. paniscus)"
# [31] "blood fluke (s. mansoni)"                "starlet sea anemone (n. vectensis)"
# [33] "nematostella vectensis"                  "sea urchin (s. purpuratus)"
# [35] "mouse lemur (t. microcebus)"             "human-mouse xenograft"
# [37] "african clawed frog (x. laevis)"

Summary attributes

GEfetch2R provides StatDBAttribute to summary attributes of UCSC Cell Browser:

StatDBAttribute(
  df = ucsc.cb.samples, filter = c("organism", "organ"),
  database = "UCSC", combine = TRUE
) %>% head()
# # A tibble: 6 × 3
# # Groups:   organisms [2]
#   organisms           body_parts   Num
#   <chr>               <chr>      <int>
# 1 human (h. sapiens)  eye          350
# 2 human (h. sapiens)  retina       348
# 3 human (h. sapiens)  brain        198
# 4 mouse (m. musculus) brain        107
# 5 human (h. sapiens)  lung          72
# 6 human (h. sapiens)  cortex        50

Filter metadata

GEfetch2R provides ExtractCBDatasets to filter metadata with collection, sub-collection, organ, disease status, organism, project and cell number (The available values of these attributes can be obtained with StatDBAttribute except cell number). All attributes except cell number support fuzzy match with fuzzy.match, this is useful when selecting datasets.

hbb.sample.df <- ExtractCBDatasets(
  all.samples.df = ucsc.cb.samples, organ = c("skeletal muscle"),
  organism = "Human (H. sapiens)", cell.num = c(1000, 2000)
)
hbb.sample.df[1:5, c("title", "body_parts", "diseases", "organisms", "sampleCount")]
#                                title                     body_parts       diseases
# 1 Embryonic Week 7-8 Myogenic Subset muscle, skeletal muscle|parent Healthy|parent
# 2   Fetal Week 12-14 Hindlimb Muscle muscle, skeletal muscle|parent Healthy|parent
# 3         HX Protocol Week 4 Culture muscle, skeletal muscle|parent Healthy|parent
# 4 HX Protocol Week 6 Myogenic Subset muscle, skeletal muscle|parent Healthy|parent
# 5 HX Protocol Week 8 Myogenic Subset muscle, skeletal muscle|parent Healthy|parent
#                   organisms sampleCount
# 1 Human (H. sapiens)|parent        1448
# 2 Human (H. sapiens)|parent        1545
# 3 Human (H. sapiens)|parent        1562
# 4 Human (H. sapiens)|parent        1598
# 5 Human (H. sapiens)|parent        1350

Extract cell type composition

GEfetch2R provides ExtractCBComposition to extract cell type annotation and composition.

hbb.sample.ct <- ExtractCBComposition(
  json.folder = "~/gefetch2r/doc/cell_browser/json",
  meta = hbb.sample.df
)
hbb.sample.ct[1:5, c("title", "CellType", "Num")]
#                                title  CellType Num
# 1 Embryonic Week 7-8 Myogenic Subset        MP 785
# 2 Embryonic Week 7-8 Myogenic Subset        MB 303
# 3 Embryonic Week 7-8 Myogenic Subset SkM.Mesen 264
# 4 Embryonic Week 7-8 Myogenic Subset        MC  96
# 5   Fetal Week 12-14 Hindlimb Muscle       MSC 822

Load the online datasets to Seurat

After manually check the extracted metadata, GEfetch2R provides ParseCBDatasets to load the online count matrix to Seurat. All the attributes available in ExtractCBDatasets are also same here. Please note that the loading process provided by ParseCBDatasets will load the online count matrix instead of downloading it to local. If multiple datasets available, users can choose to merge the SeuratObject with merge.

ParseCBDatasets supports extracting subset with metadata and gene:

# parse the whole datasets
hbb.sample.seu <- ParseCBDatasets(meta = hbb.sample.df)
# subset metadata and gene
hbb.sample.seu <- ParseCBDatasets(
  meta = hbb.sample.df, obs.value.filter = "Cell.Type == 'MP' & Phase == 'G2M'",
  include.genes = c(
    "PAX7", "MYF5", "C1QTNF3", "MYOD1", "MYOG", "RASSF4", "MYH3", "MYL4",
    "TNNT3", "PDGFRA", "OGN", "COL3A1"
  )
)
hbb.sample.seu
# An object of class Seurat
# 14 features across 5684 samples within 1 assay
# Active assay: RNA (14 features, 0 variable features)