DownloadRaw
DownloadRaw.Rmd
Introduction
A common situation is that we need to use a unified software version
(e.g. CellRanger) to obtain the count matrix, in order to better
integrate and compare multiple datasets. Here, we will use
scfetch
to download fastq and bam files. With bam files,
scfetch
also provides function for user to convert the bam
to fastq.
Download fastq
Prepare run number
For fastq files stored in SRA, scfetch
can extract
sample information and run number with GEO accession number or users can
also provide a dataframe contains the run number of interested
samples.
Extract all samples under GSE130636
and the platform is
GPL20301
(use platform = NULL
for all
platforms):
# library
library(scfetch)
#> Setting options('download.file.method.GEOquery'='auto')
#> Setting options('GEOquery.inmemory.gpl'=FALSE)
#> Registered S3 method overwritten by 'SeuratDisk':
#> method from
#> as.sparse.H5Group Seurat
GSE130636.runs <- ExtractRun(acce = "GSE130636", platform = "GPL20301")
#> Extract all GSM with acce: GSE130636 and platform: GPL20301
#> Found 1 file(s)
#> GSE130636_series_matrix.txt.gz
#> Rows: 0 Columns: 7
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (7): ID_REF, GSM3745992, GSM3745993, GSM3745994, GSM3745995, GSM3745996,...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> File stored at:
#>
#> /var/folders/_4/k4qmvf7s2gx_6789px8n_sxh0000gn/T//RtmpXDGHCE/GPL20301.soft
#> 6 GSMs to process
head(GSE130636.runs)
#> title gsm_name experiment run
#> SRR9004325 Fovea donor 1 GSM3745992 SRX5783051 SRR9004325
#> SRR9004326 Fovea donor 1 GSM3745992 SRX5783051 SRR9004326
#> SRR9004327 Fovea donor 1 GSM3745992 SRX5783051 SRR9004327
#> SRR9004328 Fovea donor 1 GSM3745992 SRX5783051 SRR9004328
#> SRR9004329 Fovea donor 1 GSM3745992 SRX5783051 SRR9004329
#> SRR9004330 Fovea donor 1 GSM3745992 SRX5783051 SRR9004330
Download sra
With the dataframe contains gsm and run number, scfetch
will download sra files using prefetch
. The returned result
is a dataframe contains failed runs. If not NULL
, users can
re-run DownloadSRA
by setting gsm.df
to the
returned result.
# a small test
GSE130636.runs <- GSE130636.runs[GSE130636.runs$run %in% c("SRR9004346", "SRR9004351"), ]
# download
GSE130636.down <- DownloadSRA(
gsm.df = GSE130636.runs,
prefetch.path = "/Users/soyabean/software/sratoolkit.3.0.6-mac64/bin/prefetch",
out.folder = "/Users/soyabean/Desktop/tmp/scdown/download_fastq"
)
# GSE130636.down is null or dataframe contains failed runs
The out.folder
structure will be:
gsm_number/run_number
.
Split fastq
After obtaining the sra files, scfetch
provides function
SplitSRA
to split sra files to fastq files using
parallel-fastq-dump
(parallel, fastest and gzip output),
fasterq-dump
(parallel, fast but unzipped output) and
fastq-dump
(slowest and gzip output).
For fastqs generated with 10x Genomics, SplitSRA
can
identify read1, read2 and index files and format the read1 and read2 to
10x required format (sample1_S1_L001_R1_001.fastq.gz
and
sample1_S1_L001_R2_001.fastq.gz
). In detail, the file with
read length 26 or 28 is considered as read1, the files with read length
8 or 10 are considered as index files and the remain file is considered
as read2. The read length rules is from Sequencing
Requirements for Single Cell 3’ and Sequencing
Requirements for Single Cell V(D)J.
The returned result is a vector of failed sra files. If not
NULL
, users can re-run SplitSRA
by setting
sra.path
to the returned result.
# parallel-fastq-dump requires sratools.path
GSE130636.split <- SplitSRA(
sra.folder = "/Users/soyabean/Desktop/tmp/scdown/download_fastq",
fastq.type = "10x",
split.cmd.path = "/Applications/anaconda3/bin/parallel-fastq-dump",
sratools.path = "/usr/local/bin", split.cmd.threads = 4
)
The final out.folder
structure will be:
tree /Users/soyabean/Desktop/tmp/scdown/download_fastq
#>
[01;34m/Users/soyabean/Desktop/tmp/scdown/download_fastq
[0m
#> └──
[01;34mGSM3745993
[0m
#> ├──
[01;34mSRR9004346
[0m
#> │ ├──
[00mSRR9004346.sra
[0m
#> │ ├──
[01;31mSRR9004346_1.fastq.gz
[0m
#> │ ├──
[01;31mSRR9004346_2.fastq.gz
[0m
#> │ ├──
[01;31mSRR9004346_S1_L001_R1_001.fastq.gz
[0m
#> │ └──
[01;31mSRR9004346_S1_L001_R2_001.fastq.gz
[0m
#> └──
[01;34mSRR9004351
[0m
#> ├──
[00mSRR9004351.sra
[0m
#> ├──
[01;31mSRR9004351_1.fastq.gz
[0m
#> ├──
[01;31mSRR9004351_2.fastq.gz
[0m
#> ├──
[01;31mSRR9004351_S1_L001_R1_001.fastq.gz
[0m
#> └──
[01;31mSRR9004351_S1_L001_R2_001.fastq.gz
[0m
#>
#> 4 directories, 10 files
Download bam
Prepare run number
scfetch
can extract sample information and run number
with GEO accession number or users can also provide a dataframe contains
the run number of interested samples.
GSE138266.runs <- ExtractRun(acce = "GSE138266", platform = "GPL18573")
#> Extract all GSM with acce: GSE138266 and platform: GPL18573
#> Found 1 file(s)
#> GSE138266_series_matrix.txt.gz
#> Rows: 0 Columns: 23
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (23): ID_REF, GSM4104122, GSM4104123, GSM4104124, GSM4104125, GSM4104126...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> File stored at:
#>
#> /var/folders/_4/k4qmvf7s2gx_6789px8n_sxh0000gn/T//RtmpXDGHCE/GPL18573.soft
#> 22 GSMs to process
head(GSE138266.runs)
#> title gsm_name experiment run
#> SRR10211551 MS19270_CSF GSM4104122 SRX6931239 SRR10211551
#> SRR10211552 MS58637_CSF GSM4104123 SRX6931240 SRR10211552
#> SRR10211553 MS71658_CSF GSM4104124 SRX6931241 SRR10211553
#> SRR10211554 MS49131_CSF GSM4104125 SRX6931242 SRR10211554
#> SRR10211555 MS60249_CSF GSM4104126 SRX6931243 SRR10211555
#> SRR10211556 MS74594_CSF GSM4104127 SRX6931244 SRR10211556
Download bam
With the dataframe contains gsm and run number, scfetch
provides DownloadBam
to download bam files using
prefetch
. It suooorts 10x generated bam files and normal
bam files.
- 10x generated bam: While bam files generated from 10x softwares
(e.g. CellRanger) contain custom tags which are not kept when using
default parameters of
prefetch
,scfetch
adds--type TenX
to make sure the downloaded bam files contain these tags. - normal bam: For normal bam files,
DownloadBam
will download sra files first and then convert sra files to bam files withsam-dump
. After testing the efficiency ofprefetch
+sam-dump
andsam-dump
, the former is much faster than the latter (52G sra and 72G bam files):
# # use prefetch to download sra file
# prefetch -X 60G SRR1976036
# # real 117m26.334s
# # user 16m42.062s
# # sys 3m28.295s
# # use sam-dump to convert sra to bam
# time (sam-dump SRR1976036.sra | samtools view -bS - -o SRR1976036.bam)
# # real 536m2.721s
# # user 749m41.421s
# # sys 20m49.069s
# use sam-dump to download bam directly
# time (sam-dump SRR1976036 | samtools view -bS - -o SRR1976036.bam)
# # more than 36hrs only get ~3G bam files, too slow
The returned result is a dataframe containing failed runs (either
failed to download sra files or failed to convert to bam files for
normal bam; failed to download bam files for 10x generated bam). If not
NULL
, users can re-run DownloadBam
by setting
gsm.df
to the returned result. The following is an example
to download 10x generated bam file:
# a small test
GSE138266.runs <- GSE138266.runs[GSE138266.runs$run %in% c("SRR10211566"), ]
# download
GSE138266.down <- DownloadBam(
gsm.df = GSE138266.runs,
prefetch.path = "/Users/soyabean/software/sratoolkit.3.0.6-mac64/bin/prefetch",
out.folder = "/Users/soyabean/Desktop/tmp/scdown/download_bam"
)
# GSE138266.down is null or dataframe contains failed runs
The out.folder
structure will be:
gsm_number/run_number
.
Convert bam to fastq
With downloaded bam files, scfetch
provides function
Bam2Fastq
to convert bam files to fastq files. For bam
files generated from 10x softwares, Bam2Fastq
utilizes
bamtofastq
tool developed by 10x Genomics.
The returned result is a vector of bam files failed to convert to
fastq files. If not NULL
, users can re-run
Bam2Fastq
by setting bam.path
to the returned
result.
GSE138266.convert <- Bam2Fastq(
bam.folder = "/Users/soyabean/Desktop/tmp/scdown/download_bam",
bamtofastq.path = "/Users/soyabean/software/bamtofastq_macos",
bamtofastq.paras = "--nthreads 4"
)
The final out.folder
structure will be:
tree /Users/soyabean/Desktop/tmp/scdown/download_bam
#>
[01;34m/Users/soyabean/Desktop/tmp/scdown/download_bam
[0m
#> └──
[01;34mGSM4104137
[0m
#> └──
[01;34mSRR10211566
[0m
#> ├──
[01;34mbam2fastq
[0m
#> │ └──
[01;34mMS60249_PBMC_2_0_MissingLibrary_1_H72VGBGX2
[0m
#> │ ├──
[01;31mbamtofastq_S1_L001_I1_001.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L001_I1_002.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L001_R1_001.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L001_R1_002.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L001_R2_001.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L001_R2_002.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L002_I1_001.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L002_I1_002.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L002_R1_001.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L002_R1_002.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L002_R2_001.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L002_R2_002.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L003_I1_001.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L003_I1_002.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L003_R1_001.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L003_R1_002.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L003_R2_001.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L003_R2_002.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L004_I1_001.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L004_I1_002.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L004_R1_001.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L004_R1_002.fastq.gz
[0m
#> │ ├──
[01;31mbamtofastq_S1_L004_R2_001.fastq.gz
[0m
#> │ └──
[01;31mbamtofastq_S1_L004_R2_002.fastq.gz
[0m
#> └──
[00mbamfiles_MS60249_PBMC_possorted_genome_bam.bam
[0m
#>
#> 5 directories, 25 files