DownloadRaw • scfetch

Introduction

A common situation is that we need to use a unified software version (e.g. CellRanger) to obtain the count matrix, in order to better integrate and compare multiple datasets. Here, we will use scfetch to download fastq and bam files. With bam files, scfetch also provides function for user to convert the bam to fastq.

Download fastq

Prepare run number

For fastq files stored in SRA, scfetch can extract sample information and run number with GEO accession number or users can also provide a dataframe contains the run number of interested samples.

Extract all samples under GSE130636 and the platform is GPL20301 (use platform = NULL for all platforms):

# library
library(scfetch)
#> Setting options('download.file.method.GEOquery'='auto')
#> Setting options('GEOquery.inmemory.gpl'=FALSE)
#> Registered S3 method overwritten by 'SeuratDisk':
#>   method            from  
#>   as.sparse.H5Group Seurat

GSE130636.runs <- ExtractRun(acce = "GSE130636", platform = "GPL20301")
#> Extract all GSM with acce: GSE130636 and platform: GPL20301
#> Found 1 file(s)
#> GSE130636_series_matrix.txt.gz
#> Rows: 0 Columns: 7
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (7): ID_REF, GSM3745992, GSM3745993, GSM3745994, GSM3745995, GSM3745996,...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> File stored at: 
#> 
#> /var/folders/_4/k4qmvf7s2gx_6789px8n_sxh0000gn/T//RtmpXDGHCE/GPL20301.soft
#> 6 GSMs to process
head(GSE130636.runs)
#>                    title   gsm_name experiment        run
#> SRR9004325 Fovea donor 1 GSM3745992 SRX5783051 SRR9004325
#> SRR9004326 Fovea donor 1 GSM3745992 SRX5783051 SRR9004326
#> SRR9004327 Fovea donor 1 GSM3745992 SRX5783051 SRR9004327
#> SRR9004328 Fovea donor 1 GSM3745992 SRX5783051 SRR9004328
#> SRR9004329 Fovea donor 1 GSM3745992 SRX5783051 SRR9004329
#> SRR9004330 Fovea donor 1 GSM3745992 SRX5783051 SRR9004330

Download sra

With the dataframe contains gsm and run number, scfetch will download sra files using prefetch. The returned result is a dataframe contains failed runs. If not NULL, users can re-run DownloadSRA by setting gsm.df to the returned result.

# a small test
GSE130636.runs <- GSE130636.runs[GSE130636.runs$run %in% c("SRR9004346", "SRR9004351"), ]
# download
GSE130636.down <- DownloadSRA(
  gsm.df = GSE130636.runs,
  prefetch.path = "/Users/soyabean/software/sratoolkit.3.0.6-mac64/bin/prefetch",
  out.folder = "/Users/soyabean/Desktop/tmp/scdown/download_fastq"
)
# GSE130636.down is null or dataframe contains failed runs

The out.folder structure will be: gsm_number/run_number.

Split fastq

After obtaining the sra files, scfetch provides function SplitSRA to split sra files to fastq files using parallel-fastq-dump (parallel, fastest and gzip output), fasterq-dump (parallel, fast but unzipped output) and fastq-dump (slowest and gzip output).

For fastqs generated with 10x Genomics, SplitSRA can identify read1, read2 and index files and format the read1 and read2 to 10x required format (sample1_S1_L001_R1_001.fastq.gz and sample1_S1_L001_R2_001.fastq.gz). In detail, the file with read length 26 or 28 is considered as read1, the files with read length 8 or 10 are considered as index files and the remain file is considered as read2. The read length rules is from Sequencing Requirements for Single Cell 3’ and Sequencing Requirements for Single Cell V(D)J.

The returned result is a vector of failed sra files. If not NULL, users can re-run SplitSRA by setting sra.path to the returned result.

# parallel-fastq-dump requires sratools.path
GSE130636.split <- SplitSRA(
  sra.folder = "/Users/soyabean/Desktop/tmp/scdown/download_fastq",
  fastq.type = "10x",
  split.cmd.path = "/Applications/anaconda3/bin/parallel-fastq-dump",
  sratools.path = "/usr/local/bin", split.cmd.threads = 4
)

The final out.folder structure will be:

tree /Users/soyabean/Desktop/tmp/scdown/download_fastq
#>  [01;34m/Users/soyabean/Desktop/tmp/scdown/download_fastq [0m
#> └──  [01;34mGSM3745993 [0m
#>     ├──  [01;34mSRR9004346 [0m
#>     │   ├──  [00mSRR9004346.sra [0m
#>     │   ├──  [01;31mSRR9004346_1.fastq.gz [0m
#>     │   ├──  [01;31mSRR9004346_2.fastq.gz [0m
#>     │   ├──  [01;31mSRR9004346_S1_L001_R1_001.fastq.gz [0m
#>     │   └──  [01;31mSRR9004346_S1_L001_R2_001.fastq.gz [0m
#>     └──  [01;34mSRR9004351 [0m
#>         ├──  [00mSRR9004351.sra [0m
#>         ├──  [01;31mSRR9004351_1.fastq.gz [0m
#>         ├──  [01;31mSRR9004351_2.fastq.gz [0m
#>         ├──  [01;31mSRR9004351_S1_L001_R1_001.fastq.gz [0m
#>         └──  [01;31mSRR9004351_S1_L001_R2_001.fastq.gz [0m
#> 
#> 4 directories, 10 files

Download bam

Prepare run number

scfetch can extract sample information and run number with GEO accession number or users can also provide a dataframe contains the run number of interested samples.

GSE138266.runs <- ExtractRun(acce = "GSE138266", platform = "GPL18573")
#> Extract all GSM with acce: GSE138266 and platform: GPL18573
#> Found 1 file(s)
#> GSE138266_series_matrix.txt.gz
#> Rows: 0 Columns: 23
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (23): ID_REF, GSM4104122, GSM4104123, GSM4104124, GSM4104125, GSM4104126...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> File stored at: 
#> 
#> /var/folders/_4/k4qmvf7s2gx_6789px8n_sxh0000gn/T//RtmpXDGHCE/GPL18573.soft
#> 22 GSMs to process
head(GSE138266.runs)
#>                   title   gsm_name experiment         run
#> SRR10211551 MS19270_CSF GSM4104122 SRX6931239 SRR10211551
#> SRR10211552 MS58637_CSF GSM4104123 SRX6931240 SRR10211552
#> SRR10211553 MS71658_CSF GSM4104124 SRX6931241 SRR10211553
#> SRR10211554 MS49131_CSF GSM4104125 SRX6931242 SRR10211554
#> SRR10211555 MS60249_CSF GSM4104126 SRX6931243 SRR10211555
#> SRR10211556 MS74594_CSF GSM4104127 SRX6931244 SRR10211556

Download bam

With the dataframe contains gsm and run number, scfetch provides DownloadBam to download bam files using prefetch. It suooorts 10x generated bam files and normal bam files.

10x generated bam: While bam files generated from 10x softwares (e.g. CellRanger) contain custom tags which are not kept when using default parameters of prefetch, scfetch adds --type TenX to make sure the downloaded bam files contain these tags.
normal bam: For normal bam files, DownloadBam will download sra files first and then convert sra files to bam files with sam-dump. After testing the efficiency of prefetch + sam-dump and sam-dump, the former is much faster than the latter (52G sra and 72G bam files):

# # use prefetch to download sra file
# prefetch -X 60G SRR1976036
# # real    117m26.334s
# # user    16m42.062s
# # sys 3m28.295s

# # use sam-dump to convert sra to bam
# time (sam-dump SRR1976036.sra | samtools view -bS - -o SRR1976036.bam)
# # real    536m2.721s
# # user    749m41.421s
# # sys 20m49.069s


# use sam-dump to download bam directly
# time (sam-dump SRR1976036 | samtools view -bS - -o SRR1976036.bam)
# # more than 36hrs only get ~3G bam files, too slow

The returned result is a dataframe containing failed runs (either failed to download sra files or failed to convert to bam files for normal bam; failed to download bam files for 10x generated bam). If not NULL, users can re-run DownloadBam by setting gsm.df to the returned result. The following is an example to download 10x generated bam file:

# a small test
GSE138266.runs <- GSE138266.runs[GSE138266.runs$run %in% c("SRR10211566"), ]
# download
GSE138266.down <- DownloadBam(
  gsm.df = GSE138266.runs,
  prefetch.path = "/Users/soyabean/software/sratoolkit.3.0.6-mac64/bin/prefetch",
  out.folder = "/Users/soyabean/Desktop/tmp/scdown/download_bam"
)
# GSE138266.down is null or dataframe contains failed runs

The out.folder structure will be: gsm_number/run_number.

Convert bam to fastq

With downloaded bam files, scfetch provides function Bam2Fastq to convert bam files to fastq files. For bam files generated from 10x softwares, Bam2Fastq utilizes bamtofastq tool developed by 10x Genomics.

The returned result is a vector of bam files failed to convert to fastq files. If not NULL, users can re-run Bam2Fastq by setting bam.path to the returned result.

GSE138266.convert <- Bam2Fastq(
  bam.folder = "/Users/soyabean/Desktop/tmp/scdown/download_bam",
  bamtofastq.path = "/Users/soyabean/software/bamtofastq_macos",
  bamtofastq.paras = "--nthreads 4"
)

The final out.folder structure will be:

tree /Users/soyabean/Desktop/tmp/scdown/download_bam
#>  [01;34m/Users/soyabean/Desktop/tmp/scdown/download_bam [0m
#> └──  [01;34mGSM4104137 [0m
#>     └──  [01;34mSRR10211566 [0m
#>         ├──  [01;34mbam2fastq [0m
#>         │   └──  [01;34mMS60249_PBMC_2_0_MissingLibrary_1_H72VGBGX2 [0m
#>         │       ├──  [01;31mbamtofastq_S1_L001_I1_001.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L001_I1_002.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L001_R1_001.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L001_R1_002.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L001_R2_001.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L001_R2_002.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L002_I1_001.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L002_I1_002.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L002_R1_001.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L002_R1_002.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L002_R2_001.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L002_R2_002.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L003_I1_001.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L003_I1_002.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L003_R1_001.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L003_R1_002.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L003_R2_001.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L003_R2_002.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L004_I1_001.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L004_I1_002.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L004_R1_001.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L004_R1_002.fastq.gz [0m
#>         │       ├──  [01;31mbamtofastq_S1_L004_R2_001.fastq.gz [0m
#>         │       └──  [01;31mbamtofastq_S1_L004_R2_002.fastq.gz [0m
#>         └──  [00mbamfiles_MS60249_PBMC_possorted_genome_bam.bam [0m
#> 
#> 5 directories, 25 files