DownloadMatrices
2023-07-23
DownloadMatrices.Rmd
Introduction
scfetch
provides functions for users to download
count matrices and annotations
(e.g. cell type annotation and composition) from GEO and some
single-cell databases (e.g. PanglaoDB and UCSC Cell Browser).
scfetch
also supports loading the downloaded data to
Seurat
.
Until now, the public resources supported and the returned results:
Resources | URL | Download Type | Returned results |
---|---|---|---|
GEO | https://www.ncbi.nlm.nih.gov/geo/ | count matrix | SeuratObject or count matrix for bulk RNA-seq |
PanglaoDB | https://panglaodb.se/index.html | count matrix | SeuratObject |
UCSC Cell Browser | https://cells.ucsc.edu/ | count matrix | SeuratObject |
GEO
GEO is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community. It provides a very convenient way for users to explore and select interested scRNA-seq datasets.
Extract metadata
scfetch
provides ExtractGEOMeta
to extract
sample metadata, including sample title, source name/tissue,
description, cell type, treatment, paper title, paper abstract,
organism, protocol, data processing methods, et al.
# library
library(scfetch)
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
## Registered S3 method overwritten by 'SeuratDisk':
## method from
## as.sparse.H5Group Seurat
# extract metadata of specified platform
GSE200257.meta <- ExtractGEOMeta(acce = "GSE200257", platform = "GPL24676")
## Found 1 file(s)
## GSE200257_series_matrix.txt.gz
## Rows: 0 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (9): ID_REF, GSM6025648, GSM6025649, GSM6025650, GSM6025651, GSM6025652,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## File stored at:
##
## /var/folders/_4/k4qmvf7s2gx_6789px8n_sxh0000gn/T//RtmpeDiEtS/GPL24676.soft
# set VROOM_CONNECTION_SIZE to avoid error: Error: The size of the connection buffer (786432) was not large enough
Sys.setenv("VROOM_CONNECTION_SIZE"=131072*60)
# extract metadata of all platforms
GSE94820.meta <- ExtractGEOMeta(acce = "GSE94820", platform = NULL)
## Found 2 file(s)
## GSE94820-GPL15520_series_matrix.txt.gz
## Rows: 0 Columns: 651── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (651): ID_REF, GSM2485115, GSM2485116, GSM2485117, GSM2485118, GSM248511...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.File stored at:
## /var/folders/_4/k4qmvf7s2gx_6789px8n_sxh0000gn/T//RtmpeDiEtS/GPL15520.soft
## GSE94820-GPL16791_series_matrix.txt.gz
## Rows: 0 Columns: 1735── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1735): ID_REF, GSM2483594, GSM2483595, GSM2483596, GSM2483597, GSM24835...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.File stored at:
## /var/folders/_4/k4qmvf7s2gx_6789px8n_sxh0000gn/T//RtmpeDiEtS/GPL16791.soft
head(GSE94820.meta)
## Platform title geo_accession source_name_ch1
## 1 GPL15520 AXLSIGLEC6_Bin1_S1_S5 GSM2485115 PBMC
## 2 GPL15520 AXLSIGLEC6_Bin1_S10_S2 GSM2485116 PBMC
## 3 GPL15520 AXLSIGLEC6_Bin1_S11_S3 GSM2485117 PBMC
## 4 GPL15520 AXLSIGLEC6_Bin1_S12_S4 GSM2485118 PBMC
## 5 GPL15520 AXLSIGLEC6_Bin1_S13_S5 GSM2485119 PBMC
## 6 GPL15520 AXLSIGLEC6_Bin1_S14_S6 GSM2485120 PBMC
## description sorted_gate_identity
## 1 paired-end RNA-seq data AXLSIGLEC6_Bin1
## 2 paired-end RNA-seq data AXLSIGLEC6_Bin1
## 3 paired-end RNA-seq data AXLSIGLEC6_Bin1
## 4 paired-end RNA-seq data AXLSIGLEC6_Bin1
## 5 paired-end RNA-seq data AXLSIGLEC6_Bin1
## 6 paired-end RNA-seq data AXLSIGLEC6_Bin1
## Title
## 1 Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes and progenitors
## 2 Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes and progenitors
## 3 Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes and progenitors
## 4 Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes and progenitors
## 5 Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes and progenitors
## 6 Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes and progenitors
## Type Organism
## 1 Expression profiling by high throughput sequencing Homo sapiens
## 2 Expression profiling by high throughput sequencing Homo sapiens
## 3 Expression profiling by high throughput sequencing Homo sapiens
## 4 Expression profiling by high throughput sequencing Homo sapiens
## 5 Expression profiling by high throughput sequencing Homo sapiens
## 6 Expression profiling by high throughput sequencing Homo sapiens
## Abstract
## 1 Peripheral blood mononuclear cells (PBMCs) were isolated from fresh blood using Ficoll-Paque density gradient centrifugation. Single-cell suspensions were stained with different antibody cocktails designed to enrich for particular immune cell populations, which were then single cell sorted in 96-well plates. Single cell RNA-sequencing libraries were subsequently generated for 2422 single cells using Smart-Seq2 (Picelli et al., Nature Methods, 2014). Cells were sequenced at a depth of 1-2M reads/cell.
## 2 Peripheral blood mononuclear cells (PBMCs) were isolated from fresh blood using Ficoll-Paque density gradient centrifugation. Single-cell suspensions were stained with different antibody cocktails designed to enrich for particular immune cell populations, which were then single cell sorted in 96-well plates. Single cell RNA-sequencing libraries were subsequently generated for 2422 single cells using Smart-Seq2 (Picelli et al., Nature Methods, 2014). Cells were sequenced at a depth of 1-2M reads/cell.
## 3 Peripheral blood mononuclear cells (PBMCs) were isolated from fresh blood using Ficoll-Paque density gradient centrifugation. Single-cell suspensions were stained with different antibody cocktails designed to enrich for particular immune cell populations, which were then single cell sorted in 96-well plates. Single cell RNA-sequencing libraries were subsequently generated for 2422 single cells using Smart-Seq2 (Picelli et al., Nature Methods, 2014). Cells were sequenced at a depth of 1-2M reads/cell.
## 4 Peripheral blood mononuclear cells (PBMCs) were isolated from fresh blood using Ficoll-Paque density gradient centrifugation. Single-cell suspensions were stained with different antibody cocktails designed to enrich for particular immune cell populations, which were then single cell sorted in 96-well plates. Single cell RNA-sequencing libraries were subsequently generated for 2422 single cells using Smart-Seq2 (Picelli et al., Nature Methods, 2014). Cells were sequenced at a depth of 1-2M reads/cell.
## 5 Peripheral blood mononuclear cells (PBMCs) were isolated from fresh blood using Ficoll-Paque density gradient centrifugation. Single-cell suspensions were stained with different antibody cocktails designed to enrich for particular immune cell populations, which were then single cell sorted in 96-well plates. Single cell RNA-sequencing libraries were subsequently generated for 2422 single cells using Smart-Seq2 (Picelli et al., Nature Methods, 2014). Cells were sequenced at a depth of 1-2M reads/cell.
## 6 Peripheral blood mononuclear cells (PBMCs) were isolated from fresh blood using Ficoll-Paque density gradient centrifugation. Single-cell suspensions were stained with different antibody cocktails designed to enrich for particular immune cell populations, which were then single cell sorted in 96-well plates. Single cell RNA-sequencing libraries were subsequently generated for 2422 single cells using Smart-Seq2 (Picelli et al., Nature Methods, 2014). Cells were sequenced at a depth of 1-2M reads/cell.
## Design
## 1 The study was divided into two stages: (1) an exploratory phase, where 1140 single human blood dendritic cells and monocytes were profiled and 12 population samples; (2) a deep characterization phase, where an additional 1261 single cells and 9 population samples were profiled as part of follow-up studies. A total of 2422 single cell and population samples were processed using Smart-Seq2 protocol (Picelli et al., Nature Methods, 2014), which allows for the generation of full-length single cell cDNA, and sequencing libraries were generated using Illumina Nextera XT DNA library preparation kit.\n\nPlease note that [1] the raw data have been submitted to dbGaP (which has controlled access mechanisms at: http://www.ncbi.nlm.nih.gov/gap; phs001294.v1.p1) due to potential privacy concerns. Please contact the submitter or dbGaP to request access to controlled-access datasets.\n\n[2] a few samples (38) were profiled but excluded from the processed data file since they were either a bulk sample (21) or excluded due to QC (17). Therefore, there are 1,140 and 1,244 data columns in two processed data files respectively, corresponding to total 2384 samples described in the records.
## 2 The study was divided into two stages: (1) an exploratory phase, where 1140 single human blood dendritic cells and monocytes were profiled and 12 population samples; (2) a deep characterization phase, where an additional 1261 single cells and 9 population samples were profiled as part of follow-up studies. A total of 2422 single cell and population samples were processed using Smart-Seq2 protocol (Picelli et al., Nature Methods, 2014), which allows for the generation of full-length single cell cDNA, and sequencing libraries were generated using Illumina Nextera XT DNA library preparation kit.\n\nPlease note that [1] the raw data have been submitted to dbGaP (which has controlled access mechanisms at: http://www.ncbi.nlm.nih.gov/gap; phs001294.v1.p1) due to potential privacy concerns. Please contact the submitter or dbGaP to request access to controlled-access datasets.\n\n[2] a few samples (38) were profiled but excluded from the processed data file since they were either a bulk sample (21) or excluded due to QC (17). Therefore, there are 1,140 and 1,244 data columns in two processed data files respectively, corresponding to total 2384 samples described in the records.
## 3 The study was divided into two stages: (1) an exploratory phase, where 1140 single human blood dendritic cells and monocytes were profiled and 12 population samples; (2) a deep characterization phase, where an additional 1261 single cells and 9 population samples were profiled as part of follow-up studies. A total of 2422 single cell and population samples were processed using Smart-Seq2 protocol (Picelli et al., Nature Methods, 2014), which allows for the generation of full-length single cell cDNA, and sequencing libraries were generated using Illumina Nextera XT DNA library preparation kit.\n\nPlease note that [1] the raw data have been submitted to dbGaP (which has controlled access mechanisms at: http://www.ncbi.nlm.nih.gov/gap; phs001294.v1.p1) due to potential privacy concerns. Please contact the submitter or dbGaP to request access to controlled-access datasets.\n\n[2] a few samples (38) were profiled but excluded from the processed data file since they were either a bulk sample (21) or excluded due to QC (17). Therefore, there are 1,140 and 1,244 data columns in two processed data files respectively, corresponding to total 2384 samples described in the records.
## 4 The study was divided into two stages: (1) an exploratory phase, where 1140 single human blood dendritic cells and monocytes were profiled and 12 population samples; (2) a deep characterization phase, where an additional 1261 single cells and 9 population samples were profiled as part of follow-up studies. A total of 2422 single cell and population samples were processed using Smart-Seq2 protocol (Picelli et al., Nature Methods, 2014), which allows for the generation of full-length single cell cDNA, and sequencing libraries were generated using Illumina Nextera XT DNA library preparation kit.\n\nPlease note that [1] the raw data have been submitted to dbGaP (which has controlled access mechanisms at: http://www.ncbi.nlm.nih.gov/gap; phs001294.v1.p1) due to potential privacy concerns. Please contact the submitter or dbGaP to request access to controlled-access datasets.\n\n[2] a few samples (38) were profiled but excluded from the processed data file since they were either a bulk sample (21) or excluded due to QC (17). Therefore, there are 1,140 and 1,244 data columns in two processed data files respectively, corresponding to total 2384 samples described in the records.
## 5 The study was divided into two stages: (1) an exploratory phase, where 1140 single human blood dendritic cells and monocytes were profiled and 12 population samples; (2) a deep characterization phase, where an additional 1261 single cells and 9 population samples were profiled as part of follow-up studies. A total of 2422 single cell and population samples were processed using Smart-Seq2 protocol (Picelli et al., Nature Methods, 2014), which allows for the generation of full-length single cell cDNA, and sequencing libraries were generated using Illumina Nextera XT DNA library preparation kit.\n\nPlease note that [1] the raw data have been submitted to dbGaP (which has controlled access mechanisms at: http://www.ncbi.nlm.nih.gov/gap; phs001294.v1.p1) due to potential privacy concerns. Please contact the submitter or dbGaP to request access to controlled-access datasets.\n\n[2] a few samples (38) were profiled but excluded from the processed data file since they were either a bulk sample (21) or excluded due to QC (17). Therefore, there are 1,140 and 1,244 data columns in two processed data files respectively, corresponding to total 2384 samples described in the records.
## 6 The study was divided into two stages: (1) an exploratory phase, where 1140 single human blood dendritic cells and monocytes were profiled and 12 population samples; (2) a deep characterization phase, where an additional 1261 single cells and 9 population samples were profiled as part of follow-up studies. A total of 2422 single cell and population samples were processed using Smart-Seq2 protocol (Picelli et al., Nature Methods, 2014), which allows for the generation of full-length single cell cDNA, and sequencing libraries were generated using Illumina Nextera XT DNA library preparation kit.\n\nPlease note that [1] the raw data have been submitted to dbGaP (which has controlled access mechanisms at: http://www.ncbi.nlm.nih.gov/gap; phs001294.v1.p1) due to potential privacy concerns. Please contact the submitter or dbGaP to request access to controlled-access datasets.\n\n[2] a few samples (38) were profiled but excluded from the processed data file since they were either a bulk sample (21) or excluded due to QC (17). Therefore, there are 1,140 and 1,244 data columns in two processed data files respectively, corresponding to total 2384 samples described in the records.
## SampleCount Molecule
## 1 650 polyA RNA
## 2 650 polyA RNA
## 3 650 polyA RNA
## 4 650 polyA RNA
## 5 650 polyA RNA
## 6 650 polyA RNA
## ExtractProtocol
## 1 Enriched immune cell fractions isolated from healthy blood PBMCs were FACS sorted in 96-well plates (single cell sorted) containing lysis buffer (TCL together with 1% of 2-Mercaptoethanol).. Smart-seq2 (Picelli et al., Nature Methods, 2014). Full-length RNA-sequencing
## 2 Enriched immune cell fractions isolated from healthy blood PBMCs were FACS sorted in 96-well plates (single cell sorted) containing lysis buffer (TCL together with 1% of 2-Mercaptoethanol).. Smart-seq2 (Picelli et al., Nature Methods, 2014). Full-length RNA-sequencing
## 3 Enriched immune cell fractions isolated from healthy blood PBMCs were FACS sorted in 96-well plates (single cell sorted) containing lysis buffer (TCL together with 1% of 2-Mercaptoethanol).. Smart-seq2 (Picelli et al., Nature Methods, 2014). Full-length RNA-sequencing
## 4 Enriched immune cell fractions isolated from healthy blood PBMCs were FACS sorted in 96-well plates (single cell sorted) containing lysis buffer (TCL together with 1% of 2-Mercaptoethanol).. Smart-seq2 (Picelli et al., Nature Methods, 2014). Full-length RNA-sequencing
## 5 Enriched immune cell fractions isolated from healthy blood PBMCs were FACS sorted in 96-well plates (single cell sorted) containing lysis buffer (TCL together with 1% of 2-Mercaptoethanol).. Smart-seq2 (Picelli et al., Nature Methods, 2014). Full-length RNA-sequencing
## 6 Enriched immune cell fractions isolated from healthy blood PBMCs were FACS sorted in 96-well plates (single cell sorted) containing lysis buffer (TCL together with 1% of 2-Mercaptoethanol).. Smart-seq2 (Picelli et al., Nature Methods, 2014). Full-length RNA-sequencing
## LibraryStrategy
## 1 RNA-Seq
## 2 RNA-Seq
## 3 RNA-Seq
## 4 RNA-Seq
## 5 RNA-Seq
## 6 RNA-Seq
## DataProcessing
## 1 Read were aligned to the UCSC hg19 transcriptomee using Bowtie v0.12.7. Expression levels were quantified using RSEM v1.2.1 (TPM values). UCSC genome table browser was used to map UCSC gene ID (kgID) and the gene name (geneSymbol) for all genes in hg19. If multiple UCSC gene IDs are assigned to the same geneSymbol, the TPM values of all kgIDs that share the same geneSymbol are summed. Genome_build: hg19. Supplementary_files_format_and_content: Text file with tab delimiters
## 2 Read were aligned to the UCSC hg19 transcriptomee using Bowtie v0.12.7. Expression levels were quantified using RSEM v1.2.1 (TPM values). UCSC genome table browser was used to map UCSC gene ID (kgID) and the gene name (geneSymbol) for all genes in hg19. If multiple UCSC gene IDs are assigned to the same geneSymbol, the TPM values of all kgIDs that share the same geneSymbol are summed. Genome_build: hg19. Supplementary_files_format_and_content: Text file with tab delimiters
## 3 Read were aligned to the UCSC hg19 transcriptomee using Bowtie v0.12.7. Expression levels were quantified using RSEM v1.2.1 (TPM values). UCSC genome table browser was used to map UCSC gene ID (kgID) and the gene name (geneSymbol) for all genes in hg19. If multiple UCSC gene IDs are assigned to the same geneSymbol, the TPM values of all kgIDs that share the same geneSymbol are summed. Genome_build: hg19. Supplementary_files_format_and_content: Text file with tab delimiters
## 4 Read were aligned to the UCSC hg19 transcriptomee using Bowtie v0.12.7. Expression levels were quantified using RSEM v1.2.1 (TPM values). UCSC genome table browser was used to map UCSC gene ID (kgID) and the gene name (geneSymbol) for all genes in hg19. If multiple UCSC gene IDs are assigned to the same geneSymbol, the TPM values of all kgIDs that share the same geneSymbol are summed. Genome_build: hg19. Supplementary_files_format_and_content: Text file with tab delimiters
## 5 Read were aligned to the UCSC hg19 transcriptomee using Bowtie v0.12.7. Expression levels were quantified using RSEM v1.2.1 (TPM values). UCSC genome table browser was used to map UCSC gene ID (kgID) and the gene name (geneSymbol) for all genes in hg19. If multiple UCSC gene IDs are assigned to the same geneSymbol, the TPM values of all kgIDs that share the same geneSymbol are summed. Genome_build: hg19. Supplementary_files_format_and_content: Text file with tab delimiters
## 6 Read were aligned to the UCSC hg19 transcriptomee using Bowtie v0.12.7. Expression levels were quantified using RSEM v1.2.1 (TPM values). UCSC genome table browser was used to map UCSC gene ID (kgID) and the gene name (geneSymbol) for all genes in hg19. If multiple UCSC gene IDs are assigned to the same geneSymbol, the TPM values of all kgIDs that share the same geneSymbol are summed. Genome_build: hg19. Supplementary_files_format_and_content: Text file with tab delimiters
## SupplementaryFile
## 1 ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_DCnMono.discovery.set.submission.txt.gz, ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_deeper.characterization.set.submission.txt.gz
## 2 ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_DCnMono.discovery.set.submission.txt.gz, ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_deeper.characterization.set.submission.txt.gz
## 3 ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_DCnMono.discovery.set.submission.txt.gz, ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_deeper.characterization.set.submission.txt.gz
## 4 ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_DCnMono.discovery.set.submission.txt.gz, ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_deeper.characterization.set.submission.txt.gz
## 5 ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_DCnMono.discovery.set.submission.txt.gz, ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_deeper.characterization.set.submission.txt.gz
## 6 ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_DCnMono.discovery.set.submission.txt.gz, ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94820/suppl/GSE94820_raw.expMatrix_deeper.characterization.set.submission.txt.gz
## Contact PMID
## 1 Nir,,Hacohen; 28428369, 30405621
## 2 Nir,,Hacohen; 28428369, 30405621
## 3 Nir,,Hacohen; 28428369, 30405621
## 4 Nir,,Hacohen; 28428369, 30405621
## 5 Nir,,Hacohen; 28428369, 30405621
## 6 Nir,,Hacohen; 28428369, 30405621
Download matrix and load to Seurat
After manually check the extracted metadata, users can
download count matrix and load the count
matrix to Seurat with ParseGEO
.
For count matrix, ParseGEO
supports downloading the
matrix from supplementary files and extracting from
ExpressionSet
, users can control the source by specifying
down.supp
or detecting automatically (ParseGEO
will extract the count matrix from ExpressionSet
first, if
the count matrix is NULL or contains non-integer values,
ParseGEO
will download supplementary files). While the
supplementary files have two main types: single count matrix file
containing all cells and CellRanger-style outputs (barcode, matrix,
feature/gene), users are required to choose the type of supplementary
files with supp.type
.
With the count matrix, ParseGEO
will load the matrix to
Seurat automatically. If multiple samples available, users can choose to
merge the SeuratObject with merge
.
# for cellranger output
GSE200257.seu <- ParseGEO(
acce = "GSE200257", platform = NULL, supp.idx = 1, down.supp = TRUE, supp.type = "10x",
out.folder = "/Users/soyabean/Desktop/tmp/scdown/dwonload_geo"
)
# for count matrix, no need to specify out.folder, download count matrix to tmp folder
GSE94820.seu <- ParseGEO(acce = "GSE94820", platform = NULL, supp.idx = 1, down.supp = TRUE, supp.type = "count")
The structure of downloaded matrix for 10x:
tree /Users/soyabean/Desktop/tmp/scdown/dwonload_geo
## /Users/soyabean/Desktop/tmp/scdown/dwonload_geo
## ├── GSM6025652_1
## │ ├── barcodes.tsv.gz
## │ ├── features.tsv.gz
## │ └── matrix.mtx.gz
## ├── GSM6025653_2
## │ ├── barcodes.tsv.gz
## │ ├── features.tsv.gz
## │ └── matrix.mtx.gz
## ├── GSM6025654_3
## │ ├── barcodes.tsv.gz
## │ ├── features.tsv.gz
## │ └── matrix.mtx.gz
## └── GSM6025655_4
## ├── barcodes.tsv.gz
## ├── features.tsv.gz
## └── matrix.mtx.gz
##
## 5 directories, 12 files
For bulk RNA-seq, set
data.type = "bulk"
in ParseGEO
, this will
return count matrix.
PanglaoDB
PanglaoDB is a database
which contains scRNA-seq datasets from mouse and human. Up to now, it
contains 5,586,348 cells from 1368 datasets
(1063 from Mus musculus and 305 from Homo sapiens). It has well
organized metadata for every dataset, including tissue, protocol,
species, number of cells and cell type annotation (computationally
identified). Daniel Osorio has developed rPanglaoDB to access
PanglaoDB data, the
functions of scfetch
here are based on rPanglaoDB.
Since PanglaoDB is no
longer maintained, scfetch
has cached all metadata and cell
type composition and use these cached data by default to accelerate,
users can access the cached data with PanglaoDBMeta
(all
metadata) and PanglaoDBComposition
(all cell type
composition).
Summary attributes
scfetch
provides StatDBAttribute
to summary
attributes of PanglaoDB:
# use cached metadata
StatDBAttribute(df = PanglaoDBMeta, filter = c("species", "protocol"), database = "PanglaoDB")
## $species
## Value Num Key
## 1 Mus musculus 1063 species
## 2 Homo sapiens 305 species
##
## $protocol
## Value Num Key
## 1 10x chromium 1046 protocol
## 2 drop-seq 204 protocol
## 3 microwell-seq 74 protocol
## 4 Smart-seq2 26 protocol
## 5 C1 Fluidigm 16 protocol
## 6 CEL-seq 1 protocol
## 7 inDrops 1 protocol
Extract metadata
scfetch
provides ExtractPanglaoDBMeta
to
select interested datasets with specified species,
protocol, tissue and cell
number (The available values of these attributes can be
obtained with StatDBAttribute
). User can also choose to
whether to add cell type annotation to every dataset with
show.cell.type
.
scfetch
uses cached metadata and cell type composition
by default, users can change this by setting
local.data = FALSE
.
hsa.meta <- ExtractPanglaoDBMeta(species = "Homo sapiens", protocol = c("Smart-seq2", "10x chromium"), show.cell.type = TRUE, cell.num = c(1000, 2000))
head(hsa.meta)
## SRA SRS Tissue Protocol
## 1 SRA550660 SRS2089635 Peripheral blood mononuclear cells 10x chromium
## 2 SRA550660 SRS2089636 Peripheral blood mononuclear cells 10x chromium
## 3 SRA550660 SRS2089638 Peripheral blood mononuclear cells 10x chromium
## 4 SRA605365 SRS2492922 Nasal airway epithelium 10x chromium
## 5 SRA608611 SRS2517316 Lung progenitors 10x chromium
## 6 SRA608353 SRS2517519 Hepatocellular carcinoma 10x chromium
## Species Cells
## 1 Homo sapiens 1860
## 2 Homo sapiens 1580
## 3 Homo sapiens 1818
## 4 Homo sapiens 1932
## 5 Homo sapiens 1077
## 6 Homo sapiens 1230
## CellType
## 1 Unknown, NK cells
## 2 Unknown, T cells, Plasmacytoid dendritic cells
## 3 Unknown, Gamma delta T cells, Dendritic cells, Plasmacytoid dendritic cells
## 4 Luminal epithelial cells, Basal cells, Keratinocytes, Ependymal cells
## 5 Unknown, Hepatocytes, Basal cells
## 6 Unknown, Hepatocytes, Foveolar cells
## CellNum
## 1 1860
## 2 1580
## 3 1818
## 4 1932
## 5 1077
## 6 1230
Extract cell type composition
scfetch
provides
ExtractPanglaoDBComposition
to extract cell type annotation
and composition (use cached data by default to accelerate, users can
change this by setting local.data = FALSE
).
hsa.composition <- ExtractPanglaoDBComposition(species = "Homo sapiens", protocol = c("Smart-seq2", "10x chromium"))
head(hsa.composition)
## SRA SRS Tissue Protocol
## 1.1 SRA553822 SRS2119548 Cultured embryonic stem cells 10x chromium
## 1.2 SRA553822 SRS2119548 Cultured embryonic stem cells 10x chromium
## 1.3 SRA553822 SRS2119548 Cultured embryonic stem cells 10x chromium
## 1.4 SRA553822 SRS2119548 Cultured embryonic stem cells 10x chromium
## 1.5 SRA553822 SRS2119548 Cultured embryonic stem cells 10x chromium
## 1.6 SRA553822 SRS2119548 Cultured embryonic stem cells 10x chromium
## Species Cluster Cells Cell Type
## 1.1 Homo sapiens 0 1572 Unknown
## 1.2 Homo sapiens 1 563 Unknown
## 1.3 Homo sapiens 2 280 Unknown
## 1.4 Homo sapiens 3 270 Unknown
## 1.5 Homo sapiens 4 220 Unknown
## 1.6 Homo sapiens 5 192 Unknown
Download matrix and load to Seurat
After manually check the extracted metadata, scfetch
provides ParsePanglaoDB
to download count
matrix and load the count matrix to Seurat.
With available cell type annotation, uses can filter datasets without
specified cell type with cell.type
. Users can also
include/exclude cells expressing specified genes with
include.gene
/exclude.gene
.
With the count matrix, ParsePanglaoDB
will load the
matrix to Seurat automatically. If multiple datasets available, users
can choose to merge the SeuratObject with merge
.
hsa.seu <- ParsePanglaoDB(hsa.meta, merge = TRUE)
UCSC Cell Browser
The UCSC Cell Browser is a web-based tool that allows scientists to interactively visualize scRNA-seq datasets. It contains 1040 single cell datasets from 17 different species. And, it is organized with the hierarchical structure, which can help users quickly locate the datasets they are interested in.
Show available datasets
scfetch
provides ShowCBDatasets
to show all
available datasets. Due to the large number of datasets,
ShowCBDatasets
enables users to perform lazy load
of dataset json files instead of downloading the json files online
(time-consuming!!!). This lazy load requires users to provide
json.folder
to save json files and set
lazy = TRUE
(for the first time of run,
ShowCBDatasets
will download current json files to
json.folder
, for next time of run, with
lazy = TRUE
, ShowCBDatasets
will load the
downloaded json files from json.folder
.). And,
ShowCBDatasets
supports updating the local datasets with
update = TRUE
.
# first time run, the json files are stored under json.folder
# ucsc.cb.samples = ShowCBDatasets(lazy = TRUE, json.folder = "/Users/soyabean/Desktop/tmp/scdown/cell_browser/json", update = TRUE)
# second time run, load the downloaded json files
ucsc.cb.samples <- ShowCBDatasets(lazy = TRUE, json.folder = "/Users/soyabean/Desktop/tmp/scdown/cell_browser/json", update = FALSE)
## Lazy mode is on, load downloaded json from /Users/soyabean/Desktop/tmp/scdown/cell_browser/json
head(ucsc.cb.samples)
## name shortLabel
## 1 adult-brain-vasc/am-endothelial Vasculature of Adult Brain
## 2 adult-brain-vasc/am-immune Vasculature of Adult Brain
## 3 adult-brain-vasc/am-perivascular Vasculature of Adult Brain
## 4 adult-brain-vasc/endothelial Vasculature of Adult Brain
## 5 adult-brain-vasc/perivascular Vasculature of Adult Brain
## 6 adult-testis Adult Testis
## subLabel tags body_parts
## 1 Endothelial - Arteriovenous Malformation & Control brain
## 2 Immune - Arteriovenous Malformation & Control brain
## 3 Perivascular - Arteriovenous Malformation & Control brain
## 4 Adult Brain Endothelial brain
## 5 Adult Brain Perivascular brain
## 6 <NA> testis
## diseases organisms projects life_stages domains sources
## 1 Healthy|parent Human (H. sapiens)|parent NA NA
## 2 Healthy|parent Human (H. sapiens)|parent NA NA
## 3 Healthy|parent Human (H. sapiens)|parent NA NA
## 4 Healthy|parent Human (H. sapiens)|parent NA NA
## 5 Healthy|parent Human (H. sapiens)|parent NA NA
## 6 Healthy Human (H. sapiens)
## sampleCount assays matrix barcode feature
## 1 9541 exprMatrix.tsv.gz
## 2 55255 matrix.mtx.gz barcodes.tsv.gz features.tsv.gz
## 3 101317 matrix.mtx.gz barcodes.tsv.gz features.tsv.gz
## 4 5018 exprMatrix.tsv.gz
## 5 49553 exprMatrix.tsv.gz
## 6 6199 exprMatrix.tsv.gz
## matrixType title
## 1 matrix Arteriovenous malformation and control endothelial
## 2 10x Arteriovenous malformation and control immune
## 3 10x Arteriovenous malformation and control perivascular
## 4 matrix Adult Brain Endothelial Cell Types
## 5 matrix Adult Brain Perivascular Cell Types
## 6 matrix The adult human testis transcriptional cell atlas
## paper
## 1
## 2
## 3
## 4
## 5
## 6 https://www.nature.com/articles/s41422-018-0099-2 Guo et al. 2018. Cell Res.
## abstract
## 1 \nArteriovenous malformation and control endothelial cell types coembedded with\ntheir respective cell types: nidus, arterial, venous, and venules.\n
## 2 \nArteriovenous malformation and control immune cell types coembedding of myeloid\nand lymphoid cells.\n
## 3 \nArteriovenous malformation and control perivascular cell types coembedded with\ntheir respective cell types: smooth muscle cells, pericytes, fibroblasts, and\nfibromyocytes.\n
## 4 \nAdult brain endothelial cell types broken down into four broad cell types:\ncapillary, arterial, venous, and venules.\n
## 5 \nAdult brain perivascular cell types broken down into four broad cell\ntypes: smooth muscle cells, pericytes, fibroblasts, and fibromyocytes.\n
## 6 \n<p>\nFrom <a href="https://www.nature.com/articles/s41422-018-0099-2"\ntarget="_blank">Guo et al</a>:\n</p>\n\n<p>\nHuman adult spermatogenesis balances spermatogonial stem cell (SSC)\nself-renewal and differentiation, alongside complex germ cell-niche\ninteractions, to ensure long-term fertility and faithful genome propagation.\nHere, we performed single-cell RNA sequencing of ~6500 testicular cells from\nyoung adults. We found five niche/somatic cell types (Leydig, myoid, Sertoli,\nendothelial, macrophage), and observed germline-niche interactions and key\nhuman-mouse differences. Spermatogenesis, including meiosis, was reconstructed\ncomputationally, revealing sequential coding, non-coding, and repeat-element\ntranscriptional signatures. Interestingly, we identified five discrete\ntranscriptional/developmental spermatogonial states, including a novel early\nSSC state, termed State 0. Epigenetic features and nascent transcription\nanalyses suggested developmental plasticity within spermatogonial States. To\nunderstand the origin of State 0, we profiled testicular cells from infants,\nand identified distinct similarities between adult State 0 and infant SSCs.\nOverall, our datasets describe key transcriptional and epigenetic signatures of\nthe normal adult human testis, and provide new insights into germ cell\ndevelopmental transitions and plasticity.\n</p>\n
## unit coords
## 1 Seurat_umap.coords.tsv.gz
## 2 Seurat_umap.coords.tsv.gz
## 3 Seurat_umap.coords.tsv.gz
## 4 Seurat_umap.coords.tsv.gz
## 5 UMAP.coords.tsv.gz
## 6 umap_hm.coords.tsv.gz
## methods
## 1
## 2
## 3
## 4
## 5
## 6 \n<p>\nDataset was imported from the h5ad file available on the <a href="https://www.covid19cellatlas.org/"\ntarget="_blank"> COVID-19 Cell Atlas website</a> using the UCSC Cell Browser utility\n<code>cbImportScanpy</code>\n</p>\n\n<section>Single cell RNA-seq performance, library preparation and sequencing</section>\n<p>\nscRNA-Seq was performed using the 10× Genomics system. Briefly, each experiment\ncaptured ~1500 single cells, in order to obtain ~0.8% multiplex rate. Cells\nwere diluted following manufacturer recommendations, and mixed with 33.8 µL of\ntotal mixed buffer before being loaded into 10× Chromium Controller using\nChromium Single Cell 3’ v2 reagents. Each sequencing library was prepared\nfollowing the manufacturer’s instructions, with 13 cycles used for cDNA\namplification. Then ~100 ng of cDNA were used for library amplification by 12\ncycles. The resulting libraries were then sequenced on a 26 × 100 cycle\npaired-end run on an Illumina HiSeq 2500 instrument.\n</p>\n\n<section>Process of single cell RNA-seq data</section>\n<p>\nRaw sequencing data were demultiplexed using the mkfastq application (Cell\nRanger v1.2.1). Three types of fastq files were generated: I1 contains 8 bp\nsample index; R1 contains 26 bp (10 bp cell-BC + 16 bp UMI) index and R2\ncontains 100 bp cDNA sequence. Fastq files were then run with the cellranger\ncount application (Cell Ranger v1.2.1) using default settings, to perform\nalignment (using STAR v2.5.4a), filtering and cellular barcode and UMI\ncounting. The UMI count tables of each cellular barcode were used for further\nanalysis.\n</p>\n\n<section>Cell type identification and clustering analysis using Seurat program</section>\n<p>\nThe Seurat program (http://satijalab.org/seurat/, R package, v.2.0.0) was\nfirstly applied for analysis of RNA-Sequencing data. To start with, UMI count\ntables from each replicates and donors were loaded into R using Read10X\nfunction, and Seurat objects were built from each experiment. Each experiment\nwas filtered and normalized with default settings. Specifically, cells were\nretained only when they had greater than 500 genes expressed, and less than 20%\nreads mapped to mitochondrial genome. We first ran t-SNE and the clustering\nanalysis for each replicate, which resulted in similar t-SNE map. Next, to\nminimize variation between technical replicates, we normalized and combined\ntechnical replicates from the same donor using the 10× Genomics built-in\napplication from Cell Ranger “cellrange aggr”. Data matrices from different\ndonors were then loaded into R using Seurat. Next, cells were normalized to the\ntotal UMI read count as well as mitochondrial read percentage, as instructed in\nthe manufacturer’s manual (http://satijalab.org/seurat/). Seurat objects\n(matrices from different donors) were then combined using RunCCA function.\nt-SNE and clustering analyses were then performed on the combined dataset using\nthe top 5000 highly variable genes and PCs 1–15, which showed most significant\np-values. Given the low number of Sertoli cells (underrepresented due to size\nfiltering), the initial clustering analysis did not identify them as a separate\ncluster. We performed deeper clustering of somatic cells, identified the\nSertoli cell cluster, and projected it back to the overall clusters, which\nresulted in 13 discrete cell clusters. Correlation of different replicates was\ncalculated based on average expression (normalized UMIs by Seurat) in each\nexperiment.\n</p>\n\n<p>\nSee the source paper <a href="https://www.nature.com/articles/s41422-018-0099-2"\ntarget="_blank">Guo et al. 2018. Cell Res.</a> for more details.\n</p>\n
## geo
## 1
## 2
## 3
## 4
## 5
## 6 GSE120508
# always read online
# ucsc.cb.samples = ShowCBDatasets(lazy = FALSE)
The number of datasets and all available species:
# the number of datasets
nrow(ucsc.cb.samples)
## [1] 1040
# available species
unique(unlist(sapply(unique(gsub(pattern = "\\|parent", replacement = "", x = ucsc.cb.samples$organisms)), function(x) {
unlist(strsplit(x = x, split = ", "))
})))
## [1] "Human (H. sapiens)" "Mouse (M. musculus)"
## [3] "Rhesus macaque (M. mulatta)" "Dog (C. familiaris)"
## [5] "Human (H. Sapiens)" "C. intestinalis"
## [7] "C. robusta" "Zebrafish (D. rerio)"
## [9] "Fruit fly (D. melanogaster)" "Hydra vulgaris"
## [11] "Capitella teleta" "Spongilla lacustris"
## [13] "X. tropicalis" "Chimp (P. troglodytes)"
## [15] "Bonobo (P. paniscus)" "S. mansoni"
## [17] "Sea urchin (S. purpuratus)" "Human-Mouse Xenograft"
Summary attributes
scfetch
provides StatDBAttribute
to summary
attributes of UCSC Cell
Browser:
StatDBAttribute(df = ucsc.cb.samples, filter = c("organism", "organ"), database = "UCSC")
## $organism
## Value Num Key
## 1 human (h. sapiens) 525 organism
## 2 mouse (m. musculus) 196 organism
## 3 fruit fly (d. melanogaster) 32 organism
## 4 rhesus macaque (m. mulatta) 25 organism
## 5 capitella teleta 18 organism
## 6 hydra vulgaris 18 organism
## 7 spongilla lacustris 18 organism
## 8 zebrafish (d. rerio) 10 organism
## 9 c. intestinalis 9 organism
## 10 chimp (p. troglodytes) 8 organism
## 11 dog (c. familiaris) 6 organism
## 12 c. robusta 5 organism
## 13 bonobo (p. paniscus) 3 organism
## 14 sea urchin (s. purpuratus) 3 organism
## 15 human-mouse xenograft 1 organism
## 16 s. mansoni 1 organism
## 17 x. tropicalis 1 organism
##
## $organ
## Value Num Key
## 1 brain 175 organ
## 2 eye 136 organ
## 3 retina 133 organ
## 4 lung 72 organ
## 5 muscle 44 organ
## 6 blood 42 organ
## 7 heart 36 organ
## 8 skeletal muscle 35 organ
## 9 pancreas 24 organ
## 10 thymus 23 organ
## 11 kidney 22 organ
## 12 immune 21 organ
## 13 bone marrow 20 organ
## 14 skin 20 organ
## 15 gut 19 organ
## 16 liver 19 organ
## 17 whole organism 18 organ
## 18 embryo 17 organ
## 19 ovary 16 organ
## 20 spleen 15 organ
## 21 peripheral blood 13 organ
## 22 all 12 organ
## 23 colon 11 organ
## 24 nasal 11 organ
## 25 tumor 11 organ
## 26 testis 10 organ
## 27 organoid 9 organ
## 28 cortex 8 organ
## 29 fetal 8 organ
## 30 hippocampus 7 organ
## 31 intestine 7 organ
## 32 large intestine 7 organ
## 33 lymph node 7 organ
## 34 small intestine 6 organ
## 35 airway 5 organ
## 36 breast 5 organ
## 37 leptomeningeal metastasis 5 organ
## 38 placenta 5 organ
## 39 respiratory system 5 organ
## 40 spinal cord 5 organ
## 41 stomach 5 organ
## 42 trachea 5 organ
## 43 ureter 5 organ
## 44 balf 4 organ
## 45 bladder 4 organ
## 46 epithelium 4 organ
## 47 esophagus 4 organ
## 48 mammary gland 4 organ
## 49 striatum 4 organ
## 50 tongue 4 organ
## 51 adrenal 3 organ
## 52 cerebellum 3 organ
## 53 early develop. 3 organ
## 54 oral cavity 3 organ
## 55 prostate 3 organ
## 56 salivary gland 3 organ
## 57 cell line 2 organ
## 58 cerebrum 2 organ
## 59 cord blood 2 organ
## 60 decidua 2 organ
## 61 ectoderm 2 organ
## 62 endothelial 2 organ
## 63 epithelial 2 organ
## 64 fat 2 organ
## 65 fetal liver 2 organ
## 66 forebrain 2 organ
## 67 gingiva 2 organ
## 68 ileum 2 organ
## 69 limb 2 organ
## 70 nasopharynx 2 organ
## 71 nose 2 organ
## 72 oesophagus 2 organ
## 73 placenta/decidua 2 organ
## 74 rectum 2 organ
## 75 teeth 2 organ
## 76 antenna 1 organ
## 77 body 1 organ
## 78 body wall 1 organ
## 79 bone 1 organ
## 80 bone marroe 1 organ
## 81 brown adipose tissue 1 organ
## 82 cell culture 1 organ
## 83 cornea 1 organ
## 84 cotex 1 organ
## 85 diaphragm 1 organ
## 86 dorsolateral prefrontal cortex 1 organ
## 87 endometrium 1 organ
## 88 enteric nervous system 1 organ
## 89 epidermis 1 organ
## 90 fatbody 1 organ
## 91 fovea 1 organ
## 92 gallbladder 1 organ
## 93 gonadal adipose tissue 1 organ
## 94 haltere 1 organ
## 95 head 1 organ
## 96 leg 1 organ
## 97 male reproductive glands 1 organ
## 98 malpighian-tubule 1 organ
## 99 mesechymal adipost tissue 1 organ
## 100 nasal mucosa 1 organ
## 101 neocortex 1 organ
## 102 neural crest 1 organ
## 103 oenocyte 1 organ
## 104 peritoneal cavity 1 organ
## 105 proboscis and maxillary palps 1 organ
## 106 stromal 1 organ
## 107 subcutaneous adipose tissue 1 organ
## 108 uterus 1 organ
## 109 vasculature 1 organ
## 110 whole embryo 1 organ
## 111 wing 1 organ
## 112 yolk sac 1 organ
Extract metadata
scfetch
provides ExtractCBDatasets
to
filter metadata with collection,
sub-collection, organ, disease
status, organism, project and
cell number (The available values of these attributes
can be obtained with StatDBAttribute
except cell
number). All attributes except cell number support fuzzy match
with fuzzy.match
, this is useful when selecting
datasets.
hbb.sample.df <- ExtractCBDatasets(all.samples.df = ucsc.cb.samples, organ = c("brain", "blood"), organism = "Human (H. sapiens)", cell.num = c(1000, 2000))
## Use all shortLabel as input!
## Use all subLabel as input!
## Use all diseases as input!
## Use all projects as input!
head(hbb.sample.df)
## name
## 1 allen-celltypes/comparative-thalmus/human-lgn
## 2 allen-celltypes/comparative-thalmus/macaque-lgn
## 3 allen-celltypes/comparative-thalmus/mouse-lgd
## 4 lepto-metastasis/patient-d
## shortLabel
## 1 Allen Brain Map: Cell Types Database
## 2 Allen Brain Map: Cell Types Database
## 3 Allen Brain Map: Cell Types Database
## 4 Single-cell atlas of human leptomeningeal metastasis
## subLabel tags
## 1 Human LGN
## 2 Macaque LGN
## 3 Mouse LGd
## 4 Patient D with Lung Primary Tumor
## body_parts
## 1 brain|parent
## 2 brain|parent
## 3 brain|parent
## 4 brain, spinal cord, tumor, leptomeningeal metastasis
## diseases
## 1 Healthy|parent
## 2 Healthy|parent
## 3 Healthy|parent
## 4 Leptomeningeal Melanoma|parent
## organisms
## 1 Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta)|parent
## 2 Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta)|parent
## 3 Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta)|parent
## 4 Human (H. sapiens)|parent
## projects life_stages domains sources sampleCount assays
## 1 Allen Brain Atlas|parent NA NA 1576
## 2 Allen Brain Atlas|parent NA NA 1092
## 3 Allen Brain Atlas|parent NA NA 1996
## 4 NA NA 1682
## matrix barcode feature matrixType
## 1 exprMatrix.tsv.gz matrix
## 2 exprMatrix.tsv.gz matrix
## 3 exprMatrix.tsv.gz matrix
## 4 exprMatrix.tsv.gz matrix
## title paper
## 1 Human Lateral Geniculate Nucleus (LGN)
## 2 Macaque Lateral Geniculate Nucleus (LGN)
## 3 Comparative Thalamus - Mouse Dorsolateral Geniculate Complex (LGd)
## 4 Patient D with Lung Primary Tumor
## abstract
## 1 This dataset covers 1,576 nuclei from human samples.
## 2 This dataset covers 1,092 nuclei from macaque samples.
## 3 This dataset covers 1,996 cells from mouse samples.
## 4 CSF cell fraction isolated from Patient D (primary tumor in lung) with newly diagnosed leptomeningeal metastasis
## unit coords
## 1 UMAP.coords.tsv.gz, ForceAtlas2.coords.tsv.gz, tSNE.coords.tsv.gz
## 2 UMAP.coords.tsv.gz, ForceAtlas2.coords.tsv.gz, tSNE.coords.tsv.gz
## 3 UMAP.coords.tsv.gz, ForceAtlas2.coords.tsv.gz, tSNE.coords.tsv.gz
## 4 UMAP.coords.tsv.gz
## methods
## 1 See the Allen Brain Atlas\n<a href="http://help.brain-map.org/display/celltypes/Documentation" traget="_blank"\n>Documentation</a> for details about how the different aspects of this project\nwere carried out.\n
## 2 See the Allen Brain Atlas\n<a href="http://help.brain-map.org/display/celltypes/Documentation" traget="_blank"\n>Documentation</a> for details about how the different aspects of this project\nwere carried out.\n
## 3 See the Allen Brain Atlas\n<a href="http://help.brain-map.org/display/celltypes/Documentation" traget="_blank"\n>Documentation</a> for details about how the different aspects of this project\nwere carried out.\n
## 4 <section>Human CSF Single-Cell Transcriptomic Analysis</section>\n<p>Single-cell and bulk RNA-sequencing data have been deposited to NCBI GEO as\nGSE150681 SuperSeries. CSF, collected with informed consent from patients under\nprotocol IRB 13-039, was processed to isolate the sample into cell-free CSF and the cellular contents of the CSF: The whole CSF sample was centrifuged at 600 x g for 5 minutes without brake at 4 ºC to pellet the cells, and the supernatant was saved as cellfree CSF. The pellet was resuspended, washed with PBS supplemented with 0.4% BSA twice and processed immediately. The cells were manually counted with a hematocytometer. scRNA-Seq was performed with 10X genomics system using Chromium Single Cell 3' Library and Gel Bead Kit V2 (catalog no. 120234). Briefly,8,700 cells (viability 70-80%) were processed per sample, targeting recovery of ~5,000\ncells with 3.9% multiplet rate. In cases, where cell count was too low to target 5,000\ncells, maximum volume (34 µl) was loaded in the microfluidic droplet generation device.\nAfter reverse transcription reaction emulsions were broken, barcoded cDNA was purified\nwith DynaBeads, followed by 12-cycles of PCR amplification. The resulting amplified\nbarcoded-cDNA library was fragmented to ~400-600 bp, ligated to sequencing adapter\nand PCR-amplified to obtain sufficient amount of material for next-generation\nsequencing. The final libraries were sequenced on an Illumina NovaSeq 6000 system\n(Read 1 - 28 cycles, Index Read 8 cycles, and Read 2 - 96 cycles).\n</p> \n<section>Annotation of cell types</section>\n<p>Cell quality control steps described earlier are quite permissive and sometimes fail to\neliminate all low-quality cells. These cells typically do not express any particular\ngenes and frequently have low library size. In our dataset, after careful consideration,\ncluster 10 was identified as low-quality, dying cells and ultimately eliminated.\nMoreover, cluster 5 and 17 exhibited both T cell and myeloid lineage phenotypic\nprofiles and were eliminated from the dataset. To facilitate annotation of remaining 15\nclusters identified by PhenoGraph, we have examined MAGIC imputed gene expression\nof known marker genes (Figure S6A) and MAST derived differentially expressed genes\nbetween PhenoGraph clusters. Markers used to identify major cell types included MS4A1\nand CD79A (B cells), IL3RA and CLEC4C (plasmacytoid DC), EPCAM and KRT18\n(Cancer cells), CD3D and CD8A (CD8 T cells), CD3D, IL7R and CD4 (CD4 T cells),\nGNLY, NKG7, KLRB1 and NCAM1 (NK cells), CD14, CD68 and CST3\n(Macrophages), FCGR3A, LYZ and CST3 (Monocytes), CST3 and FCER1A (conventional DC).</p>\n\n\n
## geo
## 1
## 2
## 3
## 4
Extract cell type composition
scfetch
provides ExtractCBComposition
to
extract cell type annotation and composition.
hbb.sample.ct <- ExtractCBComposition(json.folder = "/Users/soyabean/Desktop/tmp/scdown/cell_browser/json", sample.df = hbb.sample.df)
head(hbb.sample.ct)
## shortLabel subLabel CellType Num tags
## 1 Allen Brain Map: Cell Types Database Human LGN LGN Exc BTNL9 908
## 2 Allen Brain Map: Cell Types Database Human LGN Oligo MAG 188
## 3 Allen Brain Map: Cell Types Database Human LGN LGN Exc PRKCG BCHE 102
## 4 Allen Brain Map: Cell Types Database Human LGN LGN Inh CTXN3 96
## 5 Allen Brain Map: Cell Types Database Human LGN LGN Inh LAMP5 72
## 6 Allen Brain Map: Cell Types Database Human LGN LGN Inh NTRK1 42
## body_parts diseases
## 1 brain|parent Healthy|parent
## 2 brain|parent Healthy|parent
## 3 brain|parent Healthy|parent
## 4 brain|parent Healthy|parent
## 5 brain|parent Healthy|parent
## 6 brain|parent Healthy|parent
## organisms
## 1 Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta)|parent
## 2 Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta)|parent
## 3 Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta)|parent
## 4 Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta)|parent
## 5 Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta)|parent
## 6 Human (H. sapiens), Mouse (M. musculus), Rhesus macaque (M. mulatta)|parent
## projects life_stages domains sources sampleCount assays
## 1 Allen Brain Atlas|parent NA NA 1576
## 2 Allen Brain Atlas|parent NA NA 1576
## 3 Allen Brain Atlas|parent NA NA 1576
## 4 Allen Brain Atlas|parent NA NA 1576
## 5 Allen Brain Atlas|parent NA NA 1576
## 6 Allen Brain Atlas|parent NA NA 1576
## title paper
## 1 Human Lateral Geniculate Nucleus (LGN)
## 2 Human Lateral Geniculate Nucleus (LGN)
## 3 Human Lateral Geniculate Nucleus (LGN)
## 4 Human Lateral Geniculate Nucleus (LGN)
## 5 Human Lateral Geniculate Nucleus (LGN)
## 6 Human Lateral Geniculate Nucleus (LGN)
## abstract
## 1 This dataset covers 1,576 nuclei from human samples.
## 2 This dataset covers 1,576 nuclei from human samples.
## 3 This dataset covers 1,576 nuclei from human samples.
## 4 This dataset covers 1,576 nuclei from human samples.
## 5 This dataset covers 1,576 nuclei from human samples.
## 6 This dataset covers 1,576 nuclei from human samples.
## methods
## 1 See the Allen Brain Atlas\n<a href="http://help.brain-map.org/display/celltypes/Documentation" traget="_blank"\n>Documentation</a> for details about how the different aspects of this project\nwere carried out.\n
## 2 See the Allen Brain Atlas\n<a href="http://help.brain-map.org/display/celltypes/Documentation" traget="_blank"\n>Documentation</a> for details about how the different aspects of this project\nwere carried out.\n
## 3 See the Allen Brain Atlas\n<a href="http://help.brain-map.org/display/celltypes/Documentation" traget="_blank"\n>Documentation</a> for details about how the different aspects of this project\nwere carried out.\n
## 4 See the Allen Brain Atlas\n<a href="http://help.brain-map.org/display/celltypes/Documentation" traget="_blank"\n>Documentation</a> for details about how the different aspects of this project\nwere carried out.\n
## 5 See the Allen Brain Atlas\n<a href="http://help.brain-map.org/display/celltypes/Documentation" traget="_blank"\n>Documentation</a> for details about how the different aspects of this project\nwere carried out.\n
## 6 See the Allen Brain Atlas\n<a href="http://help.brain-map.org/display/celltypes/Documentation" traget="_blank"\n>Documentation</a> for details about how the different aspects of this project\nwere carried out.\n
## geo
## 1
## 2
## 3
## 4
## 5
## 6
Load the online datasets to Seurat
After manually check the extracted metadata, scfetch
provides ParseCBDatasets
to load the online count
matrix to Seurat. All the attributes available in
ExtractCBDatasets
are also same here. Please note that the
loading process provided by ParseCBDatasets
will load the
online count matrix instead of downloading it to local. If multiple
datasets available, users can choose to merge the SeuratObject with
merge
.
hbb.sample.seu <- ParseCBDatasets(sample.df = hbb.sample.df)