DownloadRaw • GEfetch2R

Introduction

A common situation is that we need to use a unified software version (e.g. CellRanger) to obtain the count matrix, in order to better integrate and compare multiple datasets. Here, we will use GEfetch2R to download raw data (sra/fastq/bam). With bam files, GEfetch2R also provides function for user to convert the bam to fastq files.

GEfetch2R supports downloading raw data (sra/fastq/bam) from SRA and ENA with GEO accessions. In general, downloading raw data from ENA is much faster than SRA, because of ascp and parallel support.

#> Warning: replacing previous import 'LoomExperiment::import' by
#> 'reticulate::import' when loading 'GEfetch2R'

Download sra

Extract all samples (runs)

For fastq files stored in SRA/ENA, GEfetch2R can extract sample information and run number with GEO accessions or users can also provide a dataframe containing the run number of interested samples.

Extract all samples under GSE130636 and the platform is GPL20301 (use platform = NULL for all platforms):

# library
library(GEfetch2R)
GSE130636.runs <- ExtractRun(acce = "GSE130636", platform = "GPL20301")
# a small test
GSE130636.runs <- GSE130636.runs[GSE130636.runs$run %in% c("SRR9004346", "SRR9004351"), ]

Show the sample information:

head(GSE130636.runs)
#>                   run experiment   gsm_name         title geo_accession
#> SRR9004346 SRR9004346 SRX5783052 GSM3745993 Fovea donor 2    GSM3745993
#> SRR9004351 SRR9004351 SRX5783052 GSM3745993 Fovea donor 2    GSM3745993
#>                           status submission_date last_update_date type
#> SRR9004346 Public on May 17 2019     May 02 2019      Dec 20 2019  SRA
#> SRR9004351 Public on May 17 2019     May 02 2019      Dec 20 2019  SRA
#>            channel_count source_name_ch1 organism_ch1 taxid_ch1
#> SRR9004346             1          Retina Homo sapiens      9606
#> SRR9004351             1          Retina Homo sapiens      9606
#>            characteristics_ch1 characteristics_ch1.1 molecule_ch1
#> SRR9004346     location: Fovea        donor: Donor 2    total RNA
#> SRR9004351     location: Fovea        donor: Donor 2    total RNA
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             extract_protocol_ch1
#> SRR9004346 A 2-mm foveal centered and a 4-mm peripheral punch from the inferotemporal region were acquired from three clinically normal human donors. Tissue was dissociated using the Papain Dissociation System (Worthington Biochemical Corporation, Lakewood NJ). Dissociated cells were resuspended in DMSO-based Recovery Cell Culture Freezing Media (Life Technologies Corporation, Grand Island NY). Suspensions were placed in a Cryo-Safe cooler (CryoSafe, Summerville SC) to cool at 1°C/minute in a -80°C freezer for 3-8 hours before storage in liquid nitrogen.
#> SRR9004351 A 2-mm foveal centered and a 4-mm peripheral punch from the inferotemporal region were acquired from three clinically normal human donors. Tissue was dissociated using the Papain Dissociation System (Worthington Biochemical Corporation, Lakewood NJ). Dissociated cells were resuspended in DMSO-based Recovery Cell Culture Freezing Media (Life Technologies Corporation, Grand Island NY). Suspensions were placed in a Cryo-Safe cooler (CryoSafe, Summerville SC) to cool at 1°C/minute in a -80°C freezer for 3-8 hours before storage in liquid nitrogen.
#>                                                                                                                                                                                                                                                                                                                                                                                                                        extract_protocol_ch1.1
#> SRR9004346 Single-cell RNA libraries were prepared for sequencing using standard 10X genomics protocols. Briefly, cryopreserved samples were thawed, and single cells were captured and barcoded using the Chromium System with the v3 single cell-reagent kit (10x Genomics, Pleasanton CA). Sequencing was performed on pooled libraries using the Illumina HiSeq 4000 platform (San Diego, CA), generating 150 base pair paired-end reads.
#> SRR9004351 Single-cell RNA libraries were prepared for sequencing using standard 10X genomics protocols. Briefly, cryopreserved samples were thawed, and single cells were captured and barcoded using the Chromium System with the v3 single cell-reagent kit (10x Genomics, Pleasanton CA). Sequencing was performed on pooled libraries using the Illumina HiSeq 4000 platform (San Diego, CA), generating 150 base pair paired-end reads.
#>                                                                                             data_processing
#> SRR9004346 FASTQ files were generated from the raw BCL files using Illumina’s bcl2fastq conversion program.
#> SRR9004351 FASTQ files were generated from the raw BCL files using Illumina’s bcl2fastq conversion program.
#>                                                                                                                                                                                           data_processing.1
#> SRR9004346 Sequenced reads were mapped to the CellRanger human genome build hg19 (v3.0.0) with CellRanger (v3.0.1) using the CellRanger default human GTF and the following parameter: --expect-cells=8000.
#> SRR9004351 Sequenced reads were mapped to the CellRanger human genome build hg19 (v3.0.0) with CellRanger (v3.0.1) using the CellRanger default human GTF and the following parameter: --expect-cells=8000.
#>                                                                                                                            data_processing.2
#> SRR9004346 The six samples were collectively aggregated with the cellranger aggr function with the following parameter: --normalized=mapped.
#> SRR9004351 The six samples were collectively aggregated with the cellranger aggr function with the following parameter: --normalized=mapped.
#>            data_processing = Cells were filtered with Seurat (v2.3.4) FilterCells function. Cells with nUMIs less than 200 (to remove cells with poor read quality) or greater than 2500 (to remove cells likely to be doublets) were removed. Cells with greater tha ...
#> SRR9004346                                                                                                                                                                                                                                                         TRUE).
#> SRR9004351                                                                                                                                                                                                                                                         TRUE).
#>            data_processing = Aggregated reads were normalized with Seurat (v2.3.4) with the following command: NormalizeData(object = seurat_object, normalization.method = "LogNormalize",  scale.factor = 10000). Variable genes were identified from downstream no ...
#> SRR9004346                                                                                                                                                                                                                                    c("nUMI", "percent.mito")).
#> SRR9004351                                                                                                                                                                                                                                    c("nUMI", "percent.mito")).
#>                                                                                                                                                                                                                                                                                                      data_processing.3
#> SRR9004346 Clustering was performed with Seurat (v2.3.4) FindClusters. In order to generate the shared nearest neighbor (SNN) graph, the principal component analysis reduction technique was utilized for the first 10 principal components. A granularity resolution value of 0.6 was used to discriminate clusters.
#> SRR9004351 Clustering was performed with Seurat (v2.3.4) FindClusters. In order to generate the shared nearest neighbor (SNN) graph, the principal component analysis reduction technique was utilized for the first 10 principal components. A granularity resolution value of 0.6 was used to discriminate clusters.
#>             data_processing.4
#> SRR9004346 Genome_build: hg19
#> SRR9004351 Genome_build: hg19
#>                                                                                                                                                                                                                                                                                                                                                                                        data_processing.5
#> SRR9004346 Supplementary_files_format_and_content: Processed expression data matrix files are provided in tab-delimited format. Log-normalized expression values (from the seurat_object@data slot) were appended to relevant metadata (barcode and cluster label from the manuscript). Each row represents a unique cell, and columns correspond to metadata and log normalized gene expression values.
#> SRR9004351 Supplementary_files_format_and_content: Processed expression data matrix files are provided in tab-delimited format. Log-normalized expression values (from the seurat_object@data slot) were appended to relevant metadata (barcode and cluster label from the manuscript). Each row represents a unique cell, and columns correspond to metadata and log normalized gene expression values.
#>            platform_id  contact_name          contact_email  contact_institute
#> SRR9004346    GPL20301 Todd,,Scheetz todd-scheetz@uiowa.edu UNIVERSITY OF IOWA
#> SRR9004351    GPL20301 Todd,,Scheetz todd-scheetz@uiowa.edu UNIVERSITY OF IOWA
#>            contact_address contact_city contact_state contact_zip/postal_code
#> SRR9004346      3181B MERF    Iowa City            IA                   52242
#> SRR9004351      3181B MERF    Iowa City            IA                   52242
#>            contact_country    instrument_model library_selection library_source
#> SRR9004346             USA Illumina HiSeq 4000              cDNA TRANSCRIPTOMIC
#> SRR9004351             USA Illumina HiSeq 4000              cDNA TRANSCRIPTOMIC
#>            library_strategy                 relation
#> SRR9004346          RNA-Seq Reanalyzed by: GSE142449
#> SRR9004351          RNA-Seq Reanalyzed by: GSE142449
#>                                                                relation.1
#> SRR9004346 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN11566805
#> SRR9004351 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN11566805
#>                                                       relation.2
#> SRR9004346 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX5783052
#> SRR9004351 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX5783052
#>                                                                                                     supplementary_file_1
#> SRR9004346 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3745nnn/GSM3745993/suppl/GSM3745993_fovea_donor_2_expression.tsv.gz
#> SRR9004351 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3745nnn/GSM3745993/suppl/GSM3745993_fovea_donor_2_expression.tsv.gz
#>            series_id data_row_count library_layout taxon_id
#> SRR9004346 GSE130636              0         PAIRED     9606
#> SRR9004351 GSE130636              0         PAIRED     9606
#>                          ebi_dir              ncbi_dir
#> SRR9004346 SRR900/006/SRR9004346 SRR900/006/SRR9004346
#> SRR9004351 SRR900/001/SRR9004351 SRR900/001/SRR9004351

Download sra

With the dataframe contains gsm and run number, GEfetch2R will use prefetch to download sra files from SRA or using ascp/download.file to download sra files from ENA. The returned value is a dataframe contains failed runs. If not NULL, users can re-run DownloadSRA by setting gsm.df to the returned value.

Download from SRA:


# download
GSE130636.down <- DownloadSRA(
  gsm.df = GSE130636.runs,
  prefetch.path = "/Users/soyabean/software/sratoolkit.3.0.6-mac64/bin/prefetch",
  out.folder = "/Volumes/soyabean/GEfetch2R/download_fastq"
)
# GSE130636.down is null or dataframe contains failed runs

Download from ENA (parallel):

out.folder <- tempdir()
# download from ENA using download.file
GSE130636.down <- DownloadSRA(
  gsm.df = GSE130636.runs, download.method = "download.file",
  timeout = 3600, out.folder = "/path/to/download_fastq",
  parallel = TRUE, use.cores = 2
)

# download from ENA using ascp
GSE130636.down <- DownloadSRA(
  gsm.df = GSE130636.runs, download.method = "ascp",
  ascp.path = "~/.aspera/connect/bin/ascp", max.rate = "300m",
  rename = TRUE, out.folder = "/path/to/download_fastq",
  parallel = TRUE, use.cores = 2
)

# GSE130636.down is null or dataframe contains failed runs

The out.folder structure will be: gsm_number/run_number.

Download fastq

Split sra to generate fastq

After obtaining the sra files, GEfetch2R provides function SplitSRA to split sra files to fastq files using parallel-fastq-dump (parallel, fastest and gzip output), fasterq-dump (parallel, fast but unzipped output) and fastq-dump (slowest and gzip output).

For fastqs generated with 10x Genomics, SplitSRA can identify read1, read2 and index files and format the read1 and read2 to 10x required format (sample1_S1_L001_R1_001.fastq.gz and sample1_S1_L001_R2_001.fastq.gz). In detail, the file with read length 26 or 28 is considered as read1, the files with read length 8 or 10 are considered as index files and the remain file is considered as read2. The read length rules is from Sequencing Requirements for Single Cell 3’ and Sequencing Requirements for Single Cell V(D)J.

The returned value is a vector of failed sra files. If not NULL, users can re-run SplitSRA by setting sra.path to the returned value.

# parallel-fastq-dump requires sratools.path
GSE130636.split <- SplitSRA(
  sra.folder = "/Volumes/soyabean/GEfetch2R/download_fastq",
  fastq.type = "10x",
  split.cmd.path = "/Applications/anaconda3/bin/parallel-fastq-dump",
  sratools.path = "/usr/local/bin", split.cmd.paras = "--gzip",
  split.cmd.threads = 4
)

The final out.folder structure will be:

tree /Volumes/soyabean/GEfetch2R/download_fastq
#>  [01;34m/Volumes/soyabean/GEfetch2R/download_fastq [0m
#> └──  [01;34mGSM3745993 [0m
#>     ├──  [01;34mSRR9004346 [0m
#>     │   ├──  [01;32mSRR9004346.sra [0m
#>     │   ├──  [01;32mSRR9004346_1.fastq.gz [0m
#>     │   ├──  [01;32mSRR9004346_2.fastq.gz [0m
#>     │   ├──  [01;32mSRR9004346_S1_L001_R1_001.fastq.gz [0m
#>     │   └──  [01;32mSRR9004346_S1_L001_R2_001.fastq.gz [0m
#>     └──  [01;34mSRR9004351 [0m
#>         ├──  [01;32mSRR9004351.sra [0m
#>         ├──  [01;32mSRR9004351_1.fastq.gz [0m
#>         ├──  [01;32mSRR9004351_2.fastq.gz [0m
#>         ├──  [01;32mSRR9004351_S1_L001_R1_001.fastq.gz [0m
#>         └──  [01;32mSRR9004351_S1_L001_R2_001.fastq.gz [0m
#> 
#> 4 directories, 10 files

Download fastq directly from ENA

Alternatively, GEfetch2R provides function DownloadFastq to download fastq files directly from ENA (parallel, faster than the above method). The returned value is a dataframe contains failed runs. If not NULL, users can re-run DownloadFastq by setting gsm.df to the returned value.

# use download.file
GSE130636.down.fastq <- DownloadFastq(
  gsm.df = GSE130636.runs, out.folder = "/path/to/download_fastq",
  download.method = "download.file",
  parallel = TRUE, use.cores = 2
)

# use ascp
GSE130636.down.fastq <- DownloadFastq(
  gsm.df = GSE130636.runs, out.folder = "/path/to/download_fastq",
  download.method = "ascp", ascp.path = "~/.aspera/connect/bin/ascp", max.rate = "300m",
  parallel = TRUE, use.cores = 2
)

Download bam

Extract all samples (runs)

GEfetch2R can extract sample information and run number with GEO accessions or users can also provide a dataframe containing the run number of interested samples.

GSE138266.runs <- ExtractRun(acce = "GSE138266", platform = "GPL18573")

Show the sample information:

head(GSE138266.runs)
#>                     run experiment   gsm_name        title geo_accession
#> SRR10211566 SRR10211566 SRX6931254 GSM4104137 MS60249_PBMC    GSM4104137
#>                            status submission_date last_update_date type
#> SRR10211566 Public on Dec 10 2019     Oct 01 2019      Dec 10 2019  SRA
#>             channel_count                   source_name_ch1 organism_ch1
#> SRR10211566             1 peripheral blood mononuclear cell Homo sapiens
#>             taxid_ch1                   characteristics_ch1 molecule_ch1
#> SRR10211566      9606 disease condition: Multiple Sclerosis    total RNA
#>                                                                                                                                                                                                                                                                                           extract_protocol_ch1
#> SRR10211566 Patient Inclusion Criterior: 1) treatment naive patients with a first episode suggestive of MS (i.e. clinically isolated syndrome (CIS)) or with relapsing-remitting (RR)MS diagnosed based on MAGNIMS criteria86, 2) patients receiving LP for diagnostic purposes and consenting to participate.
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 extract_protocol_ch1.1
#> SRR10211566 Patient Exclusion Criterior: 1) questionable diagnosis of MS by clinical signs or magnetic resonance imaging (MRI) findings, 2) secondary chronic progressive MS or primary progressive MS. IIH patients were included, if they gave informed consent. Exclusion criteria for all patients were: 1) immunologically relevant co-morbidities (e.g. rheumatologic diseases), 2) severe concomitant infectious diseases (e.g. HIV, meningitis, encephalitis), 3) pregnancy or breastfeeding, 4) younger than 18 years, 5) mental illness impairing the ability to give informed consent, 6) artificial blood contamination during the lumbar puncture resulting in >200 red blood cells / μl.
#>                                                                                  extract_protocol_ch1.2
#> SRR10211566 Chromium Single Cell Controller using the Chromium Single Cell 3' Library & Gel Bead Kit v2
#>                     extract_protocol_ch1.3
#> SRR10211566 AMPure beads (Beckman Coulter)
#>                                                                        extract_protocol_ch1.4
#> SRR10211566  Illumina Nextseq 500 using the High-Out 75 cycle kit with a 26-8-0-57 read setup
#>              description
#> SRR10211566 MS60249_PBMC
#>                                                               data_processing
#> SRR10211566 Raw bcl files were de-multiplexed using cellranger v2.0.2 mkfastq
#>                                                                                                                                     data_processing.1
#> SRR10211566 Subsequent read alignments and transcript counting was done individually for each sample using cellranger count with standard parameters.
#>                                                                                                              data_processing.2
#> SRR10211566 Cellranger aggr was employed, to ensure that all samples had the same number of confidently mapped reads per cell.
#>                data_processing.3
#> SRR10211566 Genome_build: GRCh38
#>                                                                     data_processing.4
#> SRR10211566 Supplementary_files_format_and_content: Standard cellranger output format
#>             platform_id contact_name                 contact_email
#> SRR10211566    GPL18573 Chenling,,Xu chenlingantelope@berkeley.edu
#>             contact_laboratory    contact_department contact_institute
#> SRR10211566          Yosef Lab Computational Biology       UC Berkeley
#>             contact_address contact_city contact_state contact_zip/postal_code
#> SRR10211566    Stanley Hall     Berkeley            CA                   94704
#>             contact_country     instrument_model library_selection
#> SRR10211566             USA Illumina NextSeq 500              cDNA
#>             library_source library_strategy
#> SRR10211566 TRANSCRIPTOMIC          RNA-Seq
#>                                                                   relation
#> SRR10211566 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN12880976
#>                                                        relation.1
#> SRR10211566 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX6931254
#>                                                                                                           supplementary_file_1
#> SRR10211566 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4104nnn/GSM4104137/suppl/GSM4104137_MS60249_PBMCs_GRCh38_barcodes.tsv.gz
#>                                                                                                        supplementary_file_2
#> SRR10211566 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4104nnn/GSM4104137/suppl/GSM4104137_MS60249_PBMCs_GRCh38_genes.tsv.gz
#>                                                                                                         supplementary_file_3
#> SRR10211566 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4104nnn/GSM4104137/suppl/GSM4104137_MS60249_PBMCs_GRCh38_matrix.mtx.gz
#>             series_id data_row_count library_layout taxon_id
#> SRR10211566 GSE138266              0         PAIRED     9606
#>                            ebi_dir               ncbi_dir
#> SRR10211566 SRR102/066/SRR10211566 SRR102/066/SRR10211566

Download bam from SRA

With the dataframe contains gsm and run number, GEfetch2R provides DownloadBam to download bam files using prefetch. It supports 10x generated bam files and normal bam files.

10x generated bam: While bam files generated from 10x softwares (e.g. CellRanger) contain custom tags which are not kept when using default parameters of prefetch, GEfetch2R adds --type TenX to make sure the downloaded bam files contain these tags.
normal bam: For normal bam files, DownloadBam will download sra files first and then convert sra files to bam files with sam-dump. After testing the efficiency of prefetch + sam-dump and sam-dump, the former is much faster than the latter (52G sra and 72G bam files):

# # use prefetch to download sra file
# prefetch -X 60G SRR1976036
# # real    117m26.334s
# # user    16m42.062s
# # sys 3m28.295s

# # use sam-dump to convert sra to bam
# time (sam-dump SRR1976036.sra | samtools view -bS - -o SRR1976036.bam)
# # real    536m2.721s
# # user    749m41.421s
# # sys 20m49.069s


# use sam-dump to download bam directly
# time (sam-dump SRR1976036 | samtools view -bS - -o SRR1976036.bam)
# # more than 36hrs only get ~3G bam files, too slow

The returned value is a dataframe containing failed runs (either failed to download sra files or failed to convert to bam files for normal bam; failed to download bam files for 10x generated bam). If not NULL, users can re-run DownloadBam by setting gsm.df to the returned value. The following is an example to download 10x generated bam file:

# a small test
GSE138266.runs <- GSE138266.runs[GSE138266.runs$run %in% c("SRR10211566"), ]
# download
GSE138266.down <- DownloadBam(
  gsm.df = GSE138266.runs,
  prefetch.path = "/Users/soyabean/software/sratoolkit.3.0.6-mac64/bin/prefetch",
  out.folder = "/Volumes/soyabean/GEfetch2R/download_bam"
)
# GSE138266.down is null or dataframe contains failed runs

The out.folder structure will be: gsm_number/run_number.

Download bam from ENA

The returned value is a dataframe containing failed runs. If not NULL, users can re-run DownloadBam by setting gsm.df to the returned value. The following is an example to download 10x generated bam file from ENA:

# download.file
GSE138266.down <- DownloadBam(
  gsm.df = GSE138266.runs, download.method = "download.file",
  timeout = 3600, out.folder = "/path/to/download_bam",
  parallel = TRUE, use.cores = 2
)
# ascp
GSE138266.down <- DownloadBam(
  gsm.df = GSE138266.runs, download.method = "ascp",
  ascp.path = "~/.aspera/connect/bin/ascp", max.rate = "300m",
  rename = TRUE, out.folder = "/path/to/download_bam",
  parallel = TRUE, use.cores = 2
)

Convert bam to fastq

With downloaded bam files, GEfetch2R provides function Bam2Fastq to convert bam files to fastq files. For bam files generated from 10x softwares, Bam2Fastq utilizes bamtofastq tool developed by 10x Genomics, otherwise, samtools is utilized.

The returned value is a vector of bam files failed to convert to fastq files. If not NULL, users can re-run Bam2Fastq by setting bam.path to the returned value.

GSE138266.convert <- Bam2Fastq(
  bam.folder = "/Volumes/soyabean/GEfetch2R/download_bam", bam.type = "10x",
  bamtofastq.path = "/Users/soyabean/software/bamtofastq_macos",
  bamtofastq.paras = "--nthreads 4"
)

The final out.folder structure will be:

tree /Volumes/soyabean/GEfetch2R/download_bam
#>  [01;34m/Volumes/soyabean/GEfetch2R/download_bam [0m
#> └──  [01;34mGSM4104137 [0m
#>     └──  [01;34mSRR10211566 [0m
#>         ├──  [01;34mbam2fastq [0m
#>         │   └──  [01;34mMS60249_PBMC_2_0_MissingLibrary_1_H72VGBGX2 [0m
#>         │       ├──  [01;32mbamtofastq_S1_L001_I1_001.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L001_I1_002.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L001_R1_001.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L001_R1_002.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L001_R2_001.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L001_R2_002.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L002_I1_001.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L002_I1_002.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L002_R1_001.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L002_R1_002.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L002_R2_001.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L002_R2_002.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L003_I1_001.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L003_I1_002.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L003_R1_001.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L003_R1_002.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L003_R2_001.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L003_R2_002.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L004_I1_001.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L004_I1_002.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L004_R1_001.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L004_R1_002.fastq.gz [0m
#>         │       ├──  [01;32mbamtofastq_S1_L004_R2_001.fastq.gz [0m
#>         │       └──  [01;32mbamtofastq_S1_L004_R2_002.fastq.gz [0m
#>         └──  [01;32mbamfiles_MS60249_PBMC_possorted_genome_bam.bam [0m
#> 
#> 5 directories, 25 files

Load fastq to R

With downloaded/converted fastq files, GEfetch2R provides function Fastq2R to align them to reference genome with CellRanger (10x-generated fastq files) or STAR (Smart-seq2 or bulk RNA-seq data), and load the output to Seurat (10x-generated fastq files) or DESEq2 (Smart-seq2 or bulk RNA-seq data).

Here, we use the downloaded fastq files as an example. There are two runs (SRR9004346 and SRR9004351) corresponding to sample name GSM3745993. When running CellRanger, we will process SRR9004346 and SRR9004351 as a single merged sample by specifying --sample=SRR9004346,SRR9004351:

# run CellRanger (10x Genomics)
seu <- Fastq2R(
  sample.dir = "/Volumes/soyabean/GEfetch2R/download_fastq",
  ref = "/path/to/10x/ref",
  method = "CellRanger",
  out.folder = "/path/to/results",
  st.path = "/path/to/cellranger",
  st.paras = "--chemistry=auto --jobmode=local"
)
# run STAR (Smart-seq2 or bulk RNA-seq)
deobj <- Fastq2R(
  sample.dir = "/path/to/fastq",
  ref = "/path/to/star/ref",
  method = "STAR",
  out.folder = "/path/to/results",
  st.path = "/path/to/STAR",
  st.paras = "--outBAMsortingThreadN 4 --twopassMode None"
)

Since fastq files from Smart-seq2 or bulk RNA-seq data are usually included in a single run, the sample.dir should specify the parent directory of run (sample.dir = "/Volumes/soyabean/GEfetch2R/download_fastq/GSM3745993" in the above example).