Introduction
A common situation is that we need to use a unified software version
(e.g. CellRanger) to obtain the count matrix, in order to better
integrate and compare multiple datasets. Here, we will use
GEfetch2R
to download raw data
(sra/fastq/bam
). With bam
files,
GEfetch2R
also provides function for user to convert the
bam
to fastq
files.
GEfetch2R
supports downloading raw data (sra/fastq/bam
) fromSRA
andENA
with GEO accessions. In general, downloading raw data from ENA is much faster than SRA, because ofascp
and parallel support.
#> Warning: replacing previous import 'LoomExperiment::import' by
#> 'reticulate::import' when loading 'GEfetch2R'
Download sra
Extract all samples (runs)
For fastq
files stored in SRA/ENA
,
GEfetch2R
can extract sample information and run number
with GEO accessions or users can also provide a
dataframe containing the run number of interested samples.
Extract all samples under GSE130636
and the platform is
GPL20301
(use platform = NULL
for all
platforms):
# library
library(GEfetch2R)
GSE130636.runs <- ExtractRun(acce = "GSE130636", platform = "GPL20301")
# a small test
GSE130636.runs <- GSE130636.runs[GSE130636.runs$run %in% c("SRR9004346", "SRR9004351"), ]
Show the sample information:
head(GSE130636.runs)
#> run experiment gsm_name title geo_accession
#> SRR9004346 SRR9004346 SRX5783052 GSM3745993 Fovea donor 2 GSM3745993
#> SRR9004351 SRR9004351 SRX5783052 GSM3745993 Fovea donor 2 GSM3745993
#> status submission_date last_update_date type
#> SRR9004346 Public on May 17 2019 May 02 2019 Dec 20 2019 SRA
#> SRR9004351 Public on May 17 2019 May 02 2019 Dec 20 2019 SRA
#> channel_count source_name_ch1 organism_ch1 taxid_ch1
#> SRR9004346 1 Retina Homo sapiens 9606
#> SRR9004351 1 Retina Homo sapiens 9606
#> characteristics_ch1 characteristics_ch1.1 molecule_ch1
#> SRR9004346 location: Fovea donor: Donor 2 total RNA
#> SRR9004351 location: Fovea donor: Donor 2 total RNA
#> extract_protocol_ch1
#> SRR9004346 A 2-mm foveal centered and a 4-mm peripheral punch from the inferotemporal region were acquired from three clinically normal human donors. Tissue was dissociated using the Papain Dissociation System (Worthington Biochemical Corporation, Lakewood NJ). Dissociated cells were resuspended in DMSO-based Recovery Cell Culture Freezing Media (Life Technologies Corporation, Grand Island NY). Suspensions were placed in a Cryo-Safe cooler (CryoSafe, Summerville SC) to cool at 1°C/minute in a -80°C freezer for 3-8 hours before storage in liquid nitrogen.
#> SRR9004351 A 2-mm foveal centered and a 4-mm peripheral punch from the inferotemporal region were acquired from three clinically normal human donors. Tissue was dissociated using the Papain Dissociation System (Worthington Biochemical Corporation, Lakewood NJ). Dissociated cells were resuspended in DMSO-based Recovery Cell Culture Freezing Media (Life Technologies Corporation, Grand Island NY). Suspensions were placed in a Cryo-Safe cooler (CryoSafe, Summerville SC) to cool at 1°C/minute in a -80°C freezer for 3-8 hours before storage in liquid nitrogen.
#> extract_protocol_ch1.1
#> SRR9004346 Single-cell RNA libraries were prepared for sequencing using standard 10X genomics protocols. Briefly, cryopreserved samples were thawed, and single cells were captured and barcoded using the Chromium System with the v3 single cell-reagent kit (10x Genomics, Pleasanton CA). Sequencing was performed on pooled libraries using the Illumina HiSeq 4000 platform (San Diego, CA), generating 150 base pair paired-end reads.
#> SRR9004351 Single-cell RNA libraries were prepared for sequencing using standard 10X genomics protocols. Briefly, cryopreserved samples were thawed, and single cells were captured and barcoded using the Chromium System with the v3 single cell-reagent kit (10x Genomics, Pleasanton CA). Sequencing was performed on pooled libraries using the Illumina HiSeq 4000 platform (San Diego, CA), generating 150 base pair paired-end reads.
#> data_processing
#> SRR9004346 FASTQ files were generated from the raw BCL files using Illumina’s bcl2fastq conversion program.
#> SRR9004351 FASTQ files were generated from the raw BCL files using Illumina’s bcl2fastq conversion program.
#> data_processing.1
#> SRR9004346 Sequenced reads were mapped to the CellRanger human genome build hg19 (v3.0.0) with CellRanger (v3.0.1) using the CellRanger default human GTF and the following parameter: --expect-cells=8000.
#> SRR9004351 Sequenced reads were mapped to the CellRanger human genome build hg19 (v3.0.0) with CellRanger (v3.0.1) using the CellRanger default human GTF and the following parameter: --expect-cells=8000.
#> data_processing.2
#> SRR9004346 The six samples were collectively aggregated with the cellranger aggr function with the following parameter: --normalized=mapped.
#> SRR9004351 The six samples were collectively aggregated with the cellranger aggr function with the following parameter: --normalized=mapped.
#> data_processing = Cells were filtered with Seurat (v2.3.4) FilterCells function. Cells with nUMIs less than 200 (to remove cells with poor read quality) or greater than 2500 (to remove cells likely to be doublets) were removed. Cells with greater tha ...
#> SRR9004346 TRUE).
#> SRR9004351 TRUE).
#> data_processing = Aggregated reads were normalized with Seurat (v2.3.4) with the following command: NormalizeData(object = seurat_object, normalization.method = "LogNormalize", scale.factor = 10000). Variable genes were identified from downstream no ...
#> SRR9004346 c("nUMI", "percent.mito")).
#> SRR9004351 c("nUMI", "percent.mito")).
#> data_processing.3
#> SRR9004346 Clustering was performed with Seurat (v2.3.4) FindClusters. In order to generate the shared nearest neighbor (SNN) graph, the principal component analysis reduction technique was utilized for the first 10 principal components. A granularity resolution value of 0.6 was used to discriminate clusters.
#> SRR9004351 Clustering was performed with Seurat (v2.3.4) FindClusters. In order to generate the shared nearest neighbor (SNN) graph, the principal component analysis reduction technique was utilized for the first 10 principal components. A granularity resolution value of 0.6 was used to discriminate clusters.
#> data_processing.4
#> SRR9004346 Genome_build: hg19
#> SRR9004351 Genome_build: hg19
#> data_processing.5
#> SRR9004346 Supplementary_files_format_and_content: Processed expression data matrix files are provided in tab-delimited format. Log-normalized expression values (from the seurat_object@data slot) were appended to relevant metadata (barcode and cluster label from the manuscript). Each row represents a unique cell, and columns correspond to metadata and log normalized gene expression values.
#> SRR9004351 Supplementary_files_format_and_content: Processed expression data matrix files are provided in tab-delimited format. Log-normalized expression values (from the seurat_object@data slot) were appended to relevant metadata (barcode and cluster label from the manuscript). Each row represents a unique cell, and columns correspond to metadata and log normalized gene expression values.
#> platform_id contact_name contact_email contact_institute
#> SRR9004346 GPL20301 Todd,,Scheetz todd-scheetz@uiowa.edu UNIVERSITY OF IOWA
#> SRR9004351 GPL20301 Todd,,Scheetz todd-scheetz@uiowa.edu UNIVERSITY OF IOWA
#> contact_address contact_city contact_state contact_zip/postal_code
#> SRR9004346 3181B MERF Iowa City IA 52242
#> SRR9004351 3181B MERF Iowa City IA 52242
#> contact_country instrument_model library_selection library_source
#> SRR9004346 USA Illumina HiSeq 4000 cDNA TRANSCRIPTOMIC
#> SRR9004351 USA Illumina HiSeq 4000 cDNA TRANSCRIPTOMIC
#> library_strategy relation
#> SRR9004346 RNA-Seq Reanalyzed by: GSE142449
#> SRR9004351 RNA-Seq Reanalyzed by: GSE142449
#> relation.1
#> SRR9004346 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN11566805
#> SRR9004351 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN11566805
#> relation.2
#> SRR9004346 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX5783052
#> SRR9004351 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX5783052
#> supplementary_file_1
#> SRR9004346 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3745nnn/GSM3745993/suppl/GSM3745993_fovea_donor_2_expression.tsv.gz
#> SRR9004351 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3745nnn/GSM3745993/suppl/GSM3745993_fovea_donor_2_expression.tsv.gz
#> series_id data_row_count library_layout taxon_id
#> SRR9004346 GSE130636 0 PAIRED 9606
#> SRR9004351 GSE130636 0 PAIRED 9606
#> ebi_dir ncbi_dir
#> SRR9004346 SRR900/006/SRR9004346 SRR900/006/SRR9004346
#> SRR9004351 SRR900/001/SRR9004351 SRR900/001/SRR9004351
Download sra
With the dataframe contains gsm and run number,
GEfetch2R
will use prefetch
to download
sra
files from SRA
or using
ascp/download.file
to download sra
files from
ENA
. The returned value is a dataframe contains failed
runs. If not NULL
, users can re-run
DownloadSRA
by setting gsm.df
to the returned
value.
Download from SRA
:
# download
GSE130636.down <- DownloadSRA(
gsm.df = GSE130636.runs,
prefetch.path = "/Users/soyabean/software/sratoolkit.3.0.6-mac64/bin/prefetch",
out.folder = "/Volumes/soyabean/GEfetch2R/download_fastq"
)
# GSE130636.down is null or dataframe contains failed runs
Download from ENA
(parallel):
out.folder <- tempdir()
# download from ENA using download.file
GSE130636.down <- DownloadSRA(
gsm.df = GSE130636.runs, download.method = "download.file",
timeout = 3600, out.folder = "/path/to/download_fastq",
parallel = TRUE, use.cores = 2
)
# download from ENA using ascp
GSE130636.down <- DownloadSRA(
gsm.df = GSE130636.runs, download.method = "ascp",
ascp.path = "~/.aspera/connect/bin/ascp", max.rate = "300m",
rename = TRUE, out.folder = "/path/to/download_fastq",
parallel = TRUE, use.cores = 2
)
# GSE130636.down is null or dataframe contains failed runs
The out.folder
structure will be:
gsm_number/run_number
.
Download fastq
Split sra to generate fastq
After obtaining the sra
files, GEfetch2R
provides function SplitSRA
to split sra
files
to fastq files using parallel-fastq-dump
(parallel,
fastest and gzip output), fasterq-dump
(parallel, fast but unzipped output) and
fastq-dump
(slowest and gzip output).
For fastqs generated with 10x Genomics, SplitSRA
can
identify read1, read2 and index files and format the read1 and read2 to
10x required format (sample1_S1_L001_R1_001.fastq.gz
and
sample1_S1_L001_R2_001.fastq.gz
). In detail, the file with
read length 26 or 28 is considered as read1, the files with read length
8 or 10 are considered as index files and the remain file is considered
as read2. The read length rules is from Sequencing
Requirements for Single Cell 3’ and Sequencing
Requirements for Single Cell V(D)J.
The returned value is a vector of failed sra
files. If
not NULL
, users can re-run SplitSRA
by setting
sra.path
to the returned value.
# parallel-fastq-dump requires sratools.path
GSE130636.split <- SplitSRA(
sra.folder = "/Volumes/soyabean/GEfetch2R/download_fastq",
fastq.type = "10x",
split.cmd.path = "/Applications/anaconda3/bin/parallel-fastq-dump",
sratools.path = "/usr/local/bin", split.cmd.paras = "--gzip",
split.cmd.threads = 4
)
The final out.folder
structure will be:
tree /Volumes/soyabean/GEfetch2R/download_fastq
#>
[01;34m/Volumes/soyabean/GEfetch2R/download_fastq
[0m
#> └──
[01;34mGSM3745993
[0m
#> ├──
[01;34mSRR9004346
[0m
#> │ ├──
[01;32mSRR9004346.sra
[0m
#> │ ├──
[01;32mSRR9004346_1.fastq.gz
[0m
#> │ ├──
[01;32mSRR9004346_2.fastq.gz
[0m
#> │ ├──
[01;32mSRR9004346_S1_L001_R1_001.fastq.gz
[0m
#> │ └──
[01;32mSRR9004346_S1_L001_R2_001.fastq.gz
[0m
#> └──
[01;34mSRR9004351
[0m
#> ├──
[01;32mSRR9004351.sra
[0m
#> ├──
[01;32mSRR9004351_1.fastq.gz
[0m
#> ├──
[01;32mSRR9004351_2.fastq.gz
[0m
#> ├──
[01;32mSRR9004351_S1_L001_R1_001.fastq.gz
[0m
#> └──
[01;32mSRR9004351_S1_L001_R2_001.fastq.gz
[0m
#>
#> 4 directories, 10 files
Download fastq directly from ENA
Alternatively, GEfetch2R
provides function
DownloadFastq
to download fastq
files directly
from ENA
(parallel, faster than the above
method). The returned value is a dataframe contains failed
runs. If not NULL
, users can re-run
DownloadFastq
by setting gsm.df
to the
returned value.
# use download.file
GSE130636.down.fastq <- DownloadFastq(
gsm.df = GSE130636.runs, out.folder = "/path/to/download_fastq",
download.method = "download.file",
parallel = TRUE, use.cores = 2
)
# use ascp
GSE130636.down.fastq <- DownloadFastq(
gsm.df = GSE130636.runs, out.folder = "/path/to/download_fastq",
download.method = "ascp", ascp.path = "~/.aspera/connect/bin/ascp", max.rate = "300m",
parallel = TRUE, use.cores = 2
)
Download bam
Extract all samples (runs)
GEfetch2R
can extract sample information and run number
with GEO accessions or users can also provide a
dataframe containing the run number of interested samples.
GSE138266.runs <- ExtractRun(acce = "GSE138266", platform = "GPL18573")
Show the sample information:
head(GSE138266.runs)
#> run experiment gsm_name title geo_accession
#> SRR10211566 SRR10211566 SRX6931254 GSM4104137 MS60249_PBMC GSM4104137
#> status submission_date last_update_date type
#> SRR10211566 Public on Dec 10 2019 Oct 01 2019 Dec 10 2019 SRA
#> channel_count source_name_ch1 organism_ch1
#> SRR10211566 1 peripheral blood mononuclear cell Homo sapiens
#> taxid_ch1 characteristics_ch1 molecule_ch1
#> SRR10211566 9606 disease condition: Multiple Sclerosis total RNA
#> extract_protocol_ch1
#> SRR10211566 Patient Inclusion Criterior: 1) treatment naive patients with a first episode suggestive of MS (i.e. clinically isolated syndrome (CIS)) or with relapsing-remitting (RR)MS diagnosed based on MAGNIMS criteria86, 2) patients receiving LP for diagnostic purposes and consenting to participate.
#> extract_protocol_ch1.1
#> SRR10211566 Patient Exclusion Criterior: 1) questionable diagnosis of MS by clinical signs or magnetic resonance imaging (MRI) findings, 2) secondary chronic progressive MS or primary progressive MS. IIH patients were included, if they gave informed consent. Exclusion criteria for all patients were: 1) immunologically relevant co-morbidities (e.g. rheumatologic diseases), 2) severe concomitant infectious diseases (e.g. HIV, meningitis, encephalitis), 3) pregnancy or breastfeeding, 4) younger than 18 years, 5) mental illness impairing the ability to give informed consent, 6) artificial blood contamination during the lumbar puncture resulting in >200 red blood cells / μl.
#> extract_protocol_ch1.2
#> SRR10211566 Chromium Single Cell Controller using the Chromium Single Cell 3' Library & Gel Bead Kit v2
#> extract_protocol_ch1.3
#> SRR10211566 AMPure beads (Beckman Coulter)
#> extract_protocol_ch1.4
#> SRR10211566 Illumina Nextseq 500 using the High-Out 75 cycle kit with a 26-8-0-57 read setup
#> description
#> SRR10211566 MS60249_PBMC
#> data_processing
#> SRR10211566 Raw bcl files were de-multiplexed using cellranger v2.0.2 mkfastq
#> data_processing.1
#> SRR10211566 Subsequent read alignments and transcript counting was done individually for each sample using cellranger count with standard parameters.
#> data_processing.2
#> SRR10211566 Cellranger aggr was employed, to ensure that all samples had the same number of confidently mapped reads per cell.
#> data_processing.3
#> SRR10211566 Genome_build: GRCh38
#> data_processing.4
#> SRR10211566 Supplementary_files_format_and_content: Standard cellranger output format
#> platform_id contact_name contact_email
#> SRR10211566 GPL18573 Chenling,,Xu chenlingantelope@berkeley.edu
#> contact_laboratory contact_department contact_institute
#> SRR10211566 Yosef Lab Computational Biology UC Berkeley
#> contact_address contact_city contact_state contact_zip/postal_code
#> SRR10211566 Stanley Hall Berkeley CA 94704
#> contact_country instrument_model library_selection
#> SRR10211566 USA Illumina NextSeq 500 cDNA
#> library_source library_strategy
#> SRR10211566 TRANSCRIPTOMIC RNA-Seq
#> relation
#> SRR10211566 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN12880976
#> relation.1
#> SRR10211566 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX6931254
#> supplementary_file_1
#> SRR10211566 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4104nnn/GSM4104137/suppl/GSM4104137_MS60249_PBMCs_GRCh38_barcodes.tsv.gz
#> supplementary_file_2
#> SRR10211566 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4104nnn/GSM4104137/suppl/GSM4104137_MS60249_PBMCs_GRCh38_genes.tsv.gz
#> supplementary_file_3
#> SRR10211566 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4104nnn/GSM4104137/suppl/GSM4104137_MS60249_PBMCs_GRCh38_matrix.mtx.gz
#> series_id data_row_count library_layout taxon_id
#> SRR10211566 GSE138266 0 PAIRED 9606
#> ebi_dir ncbi_dir
#> SRR10211566 SRR102/066/SRR10211566 SRR102/066/SRR10211566
Download bam from SRA
With the dataframe contains gsm and run number,
GEfetch2R
provides DownloadBam
to download
bam
files using prefetch
. It supports 10x
generated bam
files and normal bam
files.
-
10x generated bam: While
bam
files generated from 10x softwares (e.g. CellRanger) contain custom tags which are not kept when using default parameters ofprefetch
,GEfetch2R
adds--type TenX
to make sure the downloadedbam
files contain these tags. -
normal bam: For normal bam files,
DownloadBam
will downloadsra
files first and then convertsra
files tobam
files withsam-dump
. After testing the efficiency ofprefetch
+sam-dump
andsam-dump
, the former is much faster than the latter (52Gsra
and 72Gbam
files):
# # use prefetch to download sra file
# prefetch -X 60G SRR1976036
# # real 117m26.334s
# # user 16m42.062s
# # sys 3m28.295s
# # use sam-dump to convert sra to bam
# time (sam-dump SRR1976036.sra | samtools view -bS - -o SRR1976036.bam)
# # real 536m2.721s
# # user 749m41.421s
# # sys 20m49.069s
# use sam-dump to download bam directly
# time (sam-dump SRR1976036 | samtools view -bS - -o SRR1976036.bam)
# # more than 36hrs only get ~3G bam files, too slow
The returned value is a dataframe containing failed runs (either
failed to download sra
files or failed to convert to
bam
files for normal bam
; failed to download
bam
files for 10x generated bam
). If not
NULL
, users can re-run DownloadBam
by setting
gsm.df
to the returned value. The following is an example
to download 10x generated bam
file:
# a small test
GSE138266.runs <- GSE138266.runs[GSE138266.runs$run %in% c("SRR10211566"), ]
# download
GSE138266.down <- DownloadBam(
gsm.df = GSE138266.runs,
prefetch.path = "/Users/soyabean/software/sratoolkit.3.0.6-mac64/bin/prefetch",
out.folder = "/Volumes/soyabean/GEfetch2R/download_bam"
)
# GSE138266.down is null or dataframe contains failed runs
The out.folder
structure will be:
gsm_number/run_number
.
Download bam from ENA
The returned value is a dataframe containing failed runs. If not
NULL
, users can re-run DownloadBam
by setting
gsm.df
to the returned value. The following is an example
to download 10x generated bam
file from
ENA
:
# download.file
GSE138266.down <- DownloadBam(
gsm.df = GSE138266.runs, download.method = "download.file",
timeout = 3600, out.folder = "/path/to/download_bam",
parallel = TRUE, use.cores = 2
)
# ascp
GSE138266.down <- DownloadBam(
gsm.df = GSE138266.runs, download.method = "ascp",
ascp.path = "~/.aspera/connect/bin/ascp", max.rate = "300m",
rename = TRUE, out.folder = "/path/to/download_bam",
parallel = TRUE, use.cores = 2
)
Convert bam to fastq
With downloaded bam
files, GEfetch2R
provides function Bam2Fastq
to convert bam
files to fastq
files. For bam
files generated
from 10x softwares, Bam2Fastq
utilizes
bamtofastq
tool developed by 10x Genomics, otherwise,
samtools
is utilized.
The returned value is a vector of bam
files failed to
convert to fastq
files. If not NULL
, users can
re-run Bam2Fastq
by setting bam.path
to the
returned value.
GSE138266.convert <- Bam2Fastq(
bam.folder = "/Volumes/soyabean/GEfetch2R/download_bam", bam.type = "10x",
bamtofastq.path = "/Users/soyabean/software/bamtofastq_macos",
bamtofastq.paras = "--nthreads 4"
)
The final out.folder
structure will be:
tree /Volumes/soyabean/GEfetch2R/download_bam
#>
[01;34m/Volumes/soyabean/GEfetch2R/download_bam
[0m
#> └──
[01;34mGSM4104137
[0m
#> └──
[01;34mSRR10211566
[0m
#> ├──
[01;34mbam2fastq
[0m
#> │ └──
[01;34mMS60249_PBMC_2_0_MissingLibrary_1_H72VGBGX2
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_I1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_I1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_R1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_R1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_R2_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_R2_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_I1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_I1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_R1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_R1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_R2_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_R2_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_I1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_I1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_R1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_R1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_R2_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_R2_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L004_I1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L004_I1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L004_R1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L004_R1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L004_R2_001.fastq.gz
[0m
#> │ └──
[01;32mbamtofastq_S1_L004_R2_002.fastq.gz
[0m
#> └──
[01;32mbamfiles_MS60249_PBMC_possorted_genome_bam.bam
[0m
#>
#> 5 directories, 25 files
Load fastq to R
With downloaded/converted fastq
files,
GEfetch2R
provides function Fastq2R
to align
them to reference genome with CellRanger
(10x-generated fastq
files) or STAR
(Smart-seq2 or bulk RNA-seq data), and load the output to
Seurat
(10x-generated fastq
files) or
DESEq2
(Smart-seq2 or bulk RNA-seq data).
Here, we use the downloaded fastq
files as an example.
There are two runs (SRR9004346
and SRR9004351
)
corresponding to sample name GSM3745993
. When running
CellRanger
, we will process SRR9004346
and
SRR9004351
as a single merged sample by specifying
--sample=SRR9004346,SRR9004351
:
# run CellRanger (10x Genomics)
seu <- Fastq2R(
sample.dir = "/Volumes/soyabean/GEfetch2R/download_fastq",
ref = "/path/to/10x/ref",
method = "CellRanger",
out.folder = "/path/to/results",
st.path = "/path/to/cellranger",
st.paras = "--chemistry=auto --jobmode=local"
)
# run STAR (Smart-seq2 or bulk RNA-seq)
deobj <- Fastq2R(
sample.dir = "/path/to/fastq",
ref = "/path/to/star/ref",
method = "STAR",
out.folder = "/path/to/results",
st.path = "/path/to/STAR",
st.paras = "--outBAMsortingThreadN 4 --twopassMode None"
)
Since fastq
files from Smart-seq2 or bulk RNA-seq data
are usually included in a single run, the sample.dir
should
specify the parent directory of run
(sample.dir = "/Volumes/soyabean/GEfetch2R/download_fastq/GSM3745993"
in the above example).