Introduction
A common situation is that we need to use a unified software version
(e.g. CellRanger) to obtain the count matrix, in order to better
integrate and compare multiple datasets. Here, we will use
GEfetch2R to download raw data
(sra/fastq/bam). With bam files,
GEfetch2R also provides function for user to convert the
bam to fastq files.
GEfetch2Rsupports downloading raw data (sra/fastq/bam) fromSRAandENAwith GEO accessions. In general, downloading raw data from ENA is much faster than SRA, because ofascpand parallel support.
#> Warning: replacing previous import 'LoomExperiment::import' by
#> 'reticulate::import' when loading 'GEfetch2R'
Download sra
Extract all samples (runs)
For fastq files stored in SRA/ENA,
GEfetch2R can extract sample information and run number
with GEO accessions or users can also provide a
dataframe containing the run number of interested samples.
Extract all samples under GSE130636 and the platform is
GPL20301 (use platform = NULL for all
platforms):
# library
library(GEfetch2R)
GSE130636.runs <- ExtractRun(acce = "GSE130636", platform = "GPL20301")
# a small test
GSE130636.runs <- GSE130636.runs[GSE130636.runs$run %in% c("SRR9004346", "SRR9004351"), ]Show the sample information:
head(GSE130636.runs)
#> run experiment gsm_name title geo_accession
#> SRR9004346 SRR9004346 SRX5783052 GSM3745993 Fovea donor 2 GSM3745993
#> SRR9004351 SRR9004351 SRX5783052 GSM3745993 Fovea donor 2 GSM3745993
#> status submission_date last_update_date type
#> SRR9004346 Public on May 17 2019 May 02 2019 Dec 20 2019 SRA
#> SRR9004351 Public on May 17 2019 May 02 2019 Dec 20 2019 SRA
#> channel_count source_name_ch1 organism_ch1 taxid_ch1
#> SRR9004346 1 Retina Homo sapiens 9606
#> SRR9004351 1 Retina Homo sapiens 9606
#> characteristics_ch1 characteristics_ch1.1 molecule_ch1
#> SRR9004346 location: Fovea donor: Donor 2 total RNA
#> SRR9004351 location: Fovea donor: Donor 2 total RNA
#> extract_protocol_ch1
#> SRR9004346 A 2-mm foveal centered and a 4-mm peripheral punch from the inferotemporal region were acquired from three clinically normal human donors. Tissue was dissociated using the Papain Dissociation System (Worthington Biochemical Corporation, Lakewood NJ). Dissociated cells were resuspended in DMSO-based Recovery Cell Culture Freezing Media (Life Technologies Corporation, Grand Island NY). Suspensions were placed in a Cryo-Safe cooler (CryoSafe, Summerville SC) to cool at 1°C/minute in a -80°C freezer for 3-8 hours before storage in liquid nitrogen.
#> SRR9004351 A 2-mm foveal centered and a 4-mm peripheral punch from the inferotemporal region were acquired from three clinically normal human donors. Tissue was dissociated using the Papain Dissociation System (Worthington Biochemical Corporation, Lakewood NJ). Dissociated cells were resuspended in DMSO-based Recovery Cell Culture Freezing Media (Life Technologies Corporation, Grand Island NY). Suspensions were placed in a Cryo-Safe cooler (CryoSafe, Summerville SC) to cool at 1°C/minute in a -80°C freezer for 3-8 hours before storage in liquid nitrogen.
#> extract_protocol_ch1.1
#> SRR9004346 Single-cell RNA libraries were prepared for sequencing using standard 10X genomics protocols. Briefly, cryopreserved samples were thawed, and single cells were captured and barcoded using the Chromium System with the v3 single cell-reagent kit (10x Genomics, Pleasanton CA). Sequencing was performed on pooled libraries using the Illumina HiSeq 4000 platform (San Diego, CA), generating 150 base pair paired-end reads.
#> SRR9004351 Single-cell RNA libraries were prepared for sequencing using standard 10X genomics protocols. Briefly, cryopreserved samples were thawed, and single cells were captured and barcoded using the Chromium System with the v3 single cell-reagent kit (10x Genomics, Pleasanton CA). Sequencing was performed on pooled libraries using the Illumina HiSeq 4000 platform (San Diego, CA), generating 150 base pair paired-end reads.
#> data_processing
#> SRR9004346 FASTQ files were generated from the raw BCL files using Illumina’s bcl2fastq conversion program.
#> SRR9004351 FASTQ files were generated from the raw BCL files using Illumina’s bcl2fastq conversion program.
#> data_processing.1
#> SRR9004346 Sequenced reads were mapped to the CellRanger human genome build hg19 (v3.0.0) with CellRanger (v3.0.1) using the CellRanger default human GTF and the following parameter: --expect-cells=8000.
#> SRR9004351 Sequenced reads were mapped to the CellRanger human genome build hg19 (v3.0.0) with CellRanger (v3.0.1) using the CellRanger default human GTF and the following parameter: --expect-cells=8000.
#> data_processing.2
#> SRR9004346 The six samples were collectively aggregated with the cellranger aggr function with the following parameter: --normalized=mapped.
#> SRR9004351 The six samples were collectively aggregated with the cellranger aggr function with the following parameter: --normalized=mapped.
#> data_processing = Cells were filtered with Seurat (v2.3.4) FilterCells function. Cells with nUMIs less than 200 (to remove cells with poor read quality) or greater than 2500 (to remove cells likely to be doublets) were removed. Cells with greater tha ...
#> SRR9004346 TRUE).
#> SRR9004351 TRUE).
#> data_processing = Aggregated reads were normalized with Seurat (v2.3.4) with the following command: NormalizeData(object = seurat_object, normalization.method = "LogNormalize", scale.factor = 10000). Variable genes were identified from downstream no ...
#> SRR9004346 c("nUMI", "percent.mito")).
#> SRR9004351 c("nUMI", "percent.mito")).
#> data_processing.3
#> SRR9004346 Clustering was performed with Seurat (v2.3.4) FindClusters. In order to generate the shared nearest neighbor (SNN) graph, the principal component analysis reduction technique was utilized for the first 10 principal components. A granularity resolution value of 0.6 was used to discriminate clusters.
#> SRR9004351 Clustering was performed with Seurat (v2.3.4) FindClusters. In order to generate the shared nearest neighbor (SNN) graph, the principal component analysis reduction technique was utilized for the first 10 principal components. A granularity resolution value of 0.6 was used to discriminate clusters.
#> data_processing.4
#> SRR9004346 Genome_build: hg19
#> SRR9004351 Genome_build: hg19
#> data_processing.5
#> SRR9004346 Supplementary_files_format_and_content: Processed expression data matrix files are provided in tab-delimited format. Log-normalized expression values (from the seurat_object@data slot) were appended to relevant metadata (barcode and cluster label from the manuscript). Each row represents a unique cell, and columns correspond to metadata and log normalized gene expression values.
#> SRR9004351 Supplementary_files_format_and_content: Processed expression data matrix files are provided in tab-delimited format. Log-normalized expression values (from the seurat_object@data slot) were appended to relevant metadata (barcode and cluster label from the manuscript). Each row represents a unique cell, and columns correspond to metadata and log normalized gene expression values.
#> platform_id contact_name contact_email contact_institute
#> SRR9004346 GPL20301 Todd,,Scheetz todd-scheetz@uiowa.edu UNIVERSITY OF IOWA
#> SRR9004351 GPL20301 Todd,,Scheetz todd-scheetz@uiowa.edu UNIVERSITY OF IOWA
#> contact_address contact_city contact_state contact_zip/postal_code
#> SRR9004346 3181B MERF Iowa City IA 52242
#> SRR9004351 3181B MERF Iowa City IA 52242
#> contact_country instrument_model library_selection library_source
#> SRR9004346 USA Illumina HiSeq 4000 cDNA TRANSCRIPTOMIC
#> SRR9004351 USA Illumina HiSeq 4000 cDNA TRANSCRIPTOMIC
#> library_strategy relation
#> SRR9004346 RNA-Seq Reanalyzed by: GSE142449
#> SRR9004351 RNA-Seq Reanalyzed by: GSE142449
#> relation.1
#> SRR9004346 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN11566805
#> SRR9004351 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN11566805
#> relation.2
#> SRR9004346 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX5783052
#> SRR9004351 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX5783052
#> supplementary_file_1
#> SRR9004346 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3745nnn/GSM3745993/suppl/GSM3745993_fovea_donor_2_expression.tsv.gz
#> SRR9004351 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3745nnn/GSM3745993/suppl/GSM3745993_fovea_donor_2_expression.tsv.gz
#> series_id data_row_count library_layout taxon_id
#> SRR9004346 GSE130636 0 PAIRED 9606
#> SRR9004351 GSE130636 0 PAIRED 9606
#> ebi_dir ncbi_dir
#> SRR9004346 SRR900/006/SRR9004346 SRR900/006/SRR9004346
#> SRR9004351 SRR900/001/SRR9004351 SRR900/001/SRR9004351Download sra
With the dataframe contains gsm and run number,
GEfetch2R will use prefetch to download
sra files from SRA or using
ascp/download.file to download sra files from
ENA. The returned value is a dataframe contains failed
runs. If not NULL, users can re-run
DownloadSRA by setting gsm.df to the returned
value.
Download from SRA:
# download
GSE130636.down <- DownloadSRA(
gsm.df = GSE130636.runs,
prefetch.path = "/Users/soyabean/software/sratoolkit.3.0.6-mac64/bin/prefetch",
out.folder = "/Volumes/soyabean/GEfetch2R/download_fastq"
)
# GSE130636.down is null or dataframe contains failed runsDownload from ENA (parallel):
out.folder <- tempdir()
# download from ENA using download.file
GSE130636.down <- DownloadSRA(
gsm.df = GSE130636.runs, download.method = "download.file",
timeout = 3600, out.folder = "/path/to/download_fastq",
parallel = TRUE, use.cores = 2
)
# download from ENA using ascp
GSE130636.down <- DownloadSRA(
gsm.df = GSE130636.runs, download.method = "ascp",
ascp.path = "~/.aspera/connect/bin/ascp", max.rate = "300m",
rename = TRUE, out.folder = "/path/to/download_fastq",
parallel = TRUE, use.cores = 2
)
# GSE130636.down is null or dataframe contains failed runsThe out.folder structure will be:
gsm_number/run_number.
Download fastq
Split sra to generate fastq
After obtaining the sra files, GEfetch2R
provides function SplitSRA to split sra files
to fastq files using parallel-fastq-dump (parallel,
fastest and gzip output), fasterq-dump
(parallel, fast but unzipped output) and
fastq-dump (slowest and gzip output).
For fastqs generated with 10x Genomics, SplitSRA can
identify read1, read2 and index files and format the read1 and read2 to
10x required format (sample1_S1_L001_R1_001.fastq.gz and
sample1_S1_L001_R2_001.fastq.gz). In detail, the file with
read length 26 or 28 is considered as read1, the files with read length
8 or 10 are considered as index files and the remain file is considered
as read2. The read length rules is from Sequencing
Requirements for Single Cell 3’ and Sequencing
Requirements for Single Cell V(D)J.
The returned value is a vector of failed sra files. If
not NULL, users can re-run SplitSRA by setting
sra.path to the returned value.
# parallel-fastq-dump requires sratools.path
GSE130636.split <- SplitSRA(
sra.folder = "/Volumes/soyabean/GEfetch2R/download_fastq",
fastq.type = "10x",
split.cmd.path = "/Applications/anaconda3/bin/parallel-fastq-dump",
sratools.path = "/usr/local/bin", split.cmd.paras = "--gzip",
split.cmd.threads = 4
)The final out.folder structure will be:
tree /Volumes/soyabean/GEfetch2R/download_fastq
#>
[01;34m/Volumes/soyabean/GEfetch2R/download_fastq
[0m
#> └──
[01;34mGSM3745993
[0m
#> ├──
[01;34mSRR9004346
[0m
#> │ ├──
[01;32mSRR9004346.sra
[0m
#> │ ├──
[01;32mSRR9004346_1.fastq.gz
[0m
#> │ ├──
[01;32mSRR9004346_2.fastq.gz
[0m
#> │ ├──
[01;32mSRR9004346_S1_L001_R1_001.fastq.gz
[0m
#> │ └──
[01;32mSRR9004346_S1_L001_R2_001.fastq.gz
[0m
#> └──
[01;34mSRR9004351
[0m
#> ├──
[01;32mSRR9004351.sra
[0m
#> ├──
[01;32mSRR9004351_1.fastq.gz
[0m
#> ├──
[01;32mSRR9004351_2.fastq.gz
[0m
#> ├──
[01;32mSRR9004351_S1_L001_R1_001.fastq.gz
[0m
#> └──
[01;32mSRR9004351_S1_L001_R2_001.fastq.gz
[0m
#>
#> 4 directories, 10 filesDownload fastq directly from ENA
Alternatively, GEfetch2R provides function
DownloadFastq to download fastq files directly
from ENA (parallel, faster than the above
method). The returned value is a dataframe contains failed
runs. If not NULL, users can re-run
DownloadFastq by setting gsm.df to the
returned value.
# use download.file
GSE130636.down.fastq <- DownloadFastq(
gsm.df = GSE130636.runs, out.folder = "/path/to/download_fastq",
download.method = "download.file",
parallel = TRUE, use.cores = 2
)
# use ascp
GSE130636.down.fastq <- DownloadFastq(
gsm.df = GSE130636.runs, out.folder = "/path/to/download_fastq",
download.method = "ascp", ascp.path = "~/.aspera/connect/bin/ascp", max.rate = "300m",
parallel = TRUE, use.cores = 2
)Download bam
Extract all samples (runs)
GEfetch2R can extract sample information and run number
with GEO accessions or users can also provide a
dataframe containing the run number of interested samples.
GSE138266.runs <- ExtractRun(acce = "GSE138266", platform = "GPL18573")Show the sample information:
head(GSE138266.runs)
#> run experiment gsm_name title geo_accession
#> SRR10211566 SRR10211566 SRX6931254 GSM4104137 MS60249_PBMC GSM4104137
#> status submission_date last_update_date type
#> SRR10211566 Public on Dec 10 2019 Oct 01 2019 Dec 10 2019 SRA
#> channel_count source_name_ch1 organism_ch1
#> SRR10211566 1 peripheral blood mononuclear cell Homo sapiens
#> taxid_ch1 characteristics_ch1 molecule_ch1
#> SRR10211566 9606 disease condition: Multiple Sclerosis total RNA
#> extract_protocol_ch1
#> SRR10211566 Patient Inclusion Criterior: 1) treatment naive patients with a first episode suggestive of MS (i.e. clinically isolated syndrome (CIS)) or with relapsing-remitting (RR)MS diagnosed based on MAGNIMS criteria86, 2) patients receiving LP for diagnostic purposes and consenting to participate.
#> extract_protocol_ch1.1
#> SRR10211566 Patient Exclusion Criterior: 1) questionable diagnosis of MS by clinical signs or magnetic resonance imaging (MRI) findings, 2) secondary chronic progressive MS or primary progressive MS. IIH patients were included, if they gave informed consent. Exclusion criteria for all patients were: 1) immunologically relevant co-morbidities (e.g. rheumatologic diseases), 2) severe concomitant infectious diseases (e.g. HIV, meningitis, encephalitis), 3) pregnancy or breastfeeding, 4) younger than 18 years, 5) mental illness impairing the ability to give informed consent, 6) artificial blood contamination during the lumbar puncture resulting in >200 red blood cells / μl.
#> extract_protocol_ch1.2
#> SRR10211566 Chromium Single Cell Controller using the Chromium Single Cell 3' Library & Gel Bead Kit v2
#> extract_protocol_ch1.3
#> SRR10211566 AMPure beads (Beckman Coulter)
#> extract_protocol_ch1.4
#> SRR10211566 Illumina Nextseq 500 using the High-Out 75 cycle kit with a 26-8-0-57 read setup
#> description
#> SRR10211566 MS60249_PBMC
#> data_processing
#> SRR10211566 Raw bcl files were de-multiplexed using cellranger v2.0.2 mkfastq
#> data_processing.1
#> SRR10211566 Subsequent read alignments and transcript counting was done individually for each sample using cellranger count with standard parameters.
#> data_processing.2
#> SRR10211566 Cellranger aggr was employed, to ensure that all samples had the same number of confidently mapped reads per cell.
#> data_processing.3
#> SRR10211566 Genome_build: GRCh38
#> data_processing.4
#> SRR10211566 Supplementary_files_format_and_content: Standard cellranger output format
#> platform_id contact_name contact_email
#> SRR10211566 GPL18573 Chenling,,Xu chenlingantelope@berkeley.edu
#> contact_laboratory contact_department contact_institute
#> SRR10211566 Yosef Lab Computational Biology UC Berkeley
#> contact_address contact_city contact_state contact_zip/postal_code
#> SRR10211566 Stanley Hall Berkeley CA 94704
#> contact_country instrument_model library_selection
#> SRR10211566 USA Illumina NextSeq 500 cDNA
#> library_source library_strategy
#> SRR10211566 TRANSCRIPTOMIC RNA-Seq
#> relation
#> SRR10211566 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN12880976
#> relation.1
#> SRR10211566 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX6931254
#> supplementary_file_1
#> SRR10211566 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4104nnn/GSM4104137/suppl/GSM4104137_MS60249_PBMCs_GRCh38_barcodes.tsv.gz
#> supplementary_file_2
#> SRR10211566 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4104nnn/GSM4104137/suppl/GSM4104137_MS60249_PBMCs_GRCh38_genes.tsv.gz
#> supplementary_file_3
#> SRR10211566 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4104nnn/GSM4104137/suppl/GSM4104137_MS60249_PBMCs_GRCh38_matrix.mtx.gz
#> series_id data_row_count library_layout taxon_id
#> SRR10211566 GSE138266 0 PAIRED 9606
#> ebi_dir ncbi_dir
#> SRR10211566 SRR102/066/SRR10211566 SRR102/066/SRR10211566Download bam from SRA
With the dataframe contains gsm and run number,
GEfetch2R provides DownloadBam to download
bam files using prefetch. It supports 10x
generated bam files and normal bam files.
-
10x generated bam: While
bamfiles generated from 10x softwares (e.g. CellRanger) contain custom tags which are not kept when using default parameters ofprefetch,GEfetch2Radds--type TenXto make sure the downloadedbamfiles contain these tags. -
normal bam: For normal bam files,
DownloadBamwill downloadsrafiles first and then convertsrafiles tobamfiles withsam-dump. After testing the efficiency ofprefetch+sam-dumpandsam-dump, the former is much faster than the latter (52Gsraand 72Gbamfiles):
# # use prefetch to download sra file
# prefetch -X 60G SRR1976036
# # real 117m26.334s
# # user 16m42.062s
# # sys 3m28.295s
# # use sam-dump to convert sra to bam
# time (sam-dump SRR1976036.sra | samtools view -bS - -o SRR1976036.bam)
# # real 536m2.721s
# # user 749m41.421s
# # sys 20m49.069s
# use sam-dump to download bam directly
# time (sam-dump SRR1976036 | samtools view -bS - -o SRR1976036.bam)
# # more than 36hrs only get ~3G bam files, too slowThe returned value is a dataframe containing failed runs (either
failed to download sra files or failed to convert to
bam files for normal bam; failed to download
bam files for 10x generated bam). If not
NULL, users can re-run DownloadBam by setting
gsm.df to the returned value. The following is an example
to download 10x generated bam file:
# a small test
GSE138266.runs <- GSE138266.runs[GSE138266.runs$run %in% c("SRR10211566"), ]
# download
GSE138266.down <- DownloadBam(
gsm.df = GSE138266.runs,
prefetch.path = "/Users/soyabean/software/sratoolkit.3.0.6-mac64/bin/prefetch",
out.folder = "/Volumes/soyabean/GEfetch2R/download_bam"
)
# GSE138266.down is null or dataframe contains failed runsThe out.folder structure will be:
gsm_number/run_number.
Download bam from ENA
The returned value is a dataframe containing failed runs. If not
NULL, users can re-run DownloadBam by setting
gsm.df to the returned value. The following is an example
to download 10x generated bam file from
ENA:
# download.file
GSE138266.down <- DownloadBam(
gsm.df = GSE138266.runs, download.method = "download.file",
timeout = 3600, out.folder = "/path/to/download_bam",
parallel = TRUE, use.cores = 2
)
# ascp
GSE138266.down <- DownloadBam(
gsm.df = GSE138266.runs, download.method = "ascp",
ascp.path = "~/.aspera/connect/bin/ascp", max.rate = "300m",
rename = TRUE, out.folder = "/path/to/download_bam",
parallel = TRUE, use.cores = 2
)Convert bam to fastq
With downloaded bam files, GEfetch2R
provides function Bam2Fastq to convert bam
files to fastq files. For bam files generated
from 10x softwares, Bam2Fastq utilizes
bamtofastq tool developed by 10x Genomics, otherwise,
samtools is utilized.
The returned value is a vector of bam files failed to
convert to fastq files. If not NULL, users can
re-run Bam2Fastq by setting bam.path to the
returned value.
GSE138266.convert <- Bam2Fastq(
bam.folder = "/Volumes/soyabean/GEfetch2R/download_bam", bam.type = "10x",
bamtofastq.path = "/Users/soyabean/software/bamtofastq_macos",
bamtofastq.paras = "--nthreads 4"
)The final out.folder structure will be:
tree /Volumes/soyabean/GEfetch2R/download_bam
#>
[01;34m/Volumes/soyabean/GEfetch2R/download_bam
[0m
#> └──
[01;34mGSM4104137
[0m
#> └──
[01;34mSRR10211566
[0m
#> ├──
[01;34mbam2fastq
[0m
#> │ └──
[01;34mMS60249_PBMC_2_0_MissingLibrary_1_H72VGBGX2
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_I1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_I1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_R1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_R1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_R2_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L001_R2_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_I1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_I1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_R1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_R1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_R2_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L002_R2_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_I1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_I1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_R1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_R1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_R2_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L003_R2_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L004_I1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L004_I1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L004_R1_001.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L004_R1_002.fastq.gz
[0m
#> │ ├──
[01;32mbamtofastq_S1_L004_R2_001.fastq.gz
[0m
#> │ └──
[01;32mbamtofastq_S1_L004_R2_002.fastq.gz
[0m
#> └──
[01;32mbamfiles_MS60249_PBMC_possorted_genome_bam.bam
[0m
#>
#> 5 directories, 25 filesLoad fastq to R
With downloaded/converted fastq files,
GEfetch2R provides function Fastq2R to align
them to reference genome with CellRanger
(10x-generated fastq files) or STAR
(Smart-seq2 or bulk RNA-seq data), and load the output to
Seurat (10x-generated fastq files) or
DESEq2 (Smart-seq2 or bulk RNA-seq data).
Here, we use the downloaded fastq files as an example.
There are two runs (SRR9004346 and SRR9004351)
corresponding to sample name GSM3745993. When running
CellRanger, we will process SRR9004346 and
SRR9004351 as a single merged sample by specifying
--sample=SRR9004346,SRR9004351:
# run CellRanger (10x Genomics)
seu <- Fastq2R(
sample.dir = "/Volumes/soyabean/GEfetch2R/download_fastq",
ref = "/path/to/10x/ref",
method = "CellRanger",
out.folder = "/path/to/results",
st.path = "/path/to/cellranger",
st.paras = "--chemistry=auto --jobmode=local"
)
# run STAR (Smart-seq2 or bulk RNA-seq)
deobj <- Fastq2R(
sample.dir = "/path/to/fastq",
ref = "/path/to/star/ref",
method = "STAR",
out.folder = "/path/to/results",
st.path = "/path/to/STAR",
st.paras = "--outBAMsortingThreadN 4 --twopassMode None"
)Since fastq files from Smart-seq2 or bulk RNA-seq data
are usually included in a single run, the sample.dir should
specify the parent directory of run
(sample.dir = "/Volumes/soyabean/GEfetch2R/download_fastq/GSM3745993"
in the above example).