Upstream

The upstream process for single-cell DNA methylation data analysis starts with the raw sequencing files, followed by a series of steps such as quality control, alignment, and methylation calling. In scMethQ, an end-to-end snakemake workflow is used for data analysis.

Optional Step¶

Reference Genome: You can build the reference genome index for use in subsequent analyses with the following command, or you can choose to manually embed this step into the pipeline (though it is not recommended since this step is not required for every analysis):

Text Only

if [ ! -d $genome_ref/Bisulfite_Genome ]; then
    bismark_genome_preparation --verbose $genome_ref/

RAW data download:

Text Only

prefetch --max-size 100000000 --progress --output-directory rawfastq {params.srrid} > {log} 2>&1

Configuration file preparation¶

Prepare a Configuration file
Text Only
```
scMethq make_yaml -i ./ -o test/ -r ./
```
This file only requires one, which is copied from the template and needs to be manually modified or specified through parameters.
Although it can be specified via the command line, some complex parameters may need to be manually adjusted based on your data, such as the required number of CPUs and the software versions used. Therefore, it is recommended to check the configuration of the YAML file after automatic generation.
Check the folders and files in the conf file, as well as the existence of the software. Update the conf file based on the findings.
Run the upstream process
It is recommended to have a batch-splitting feature at the submission step, allowing for batch-wise submission and execution (especially in job management systems with single-core environments).
Text Only
```
scMethq make_snakefile -yaml test/config.yaml -out test/ -batch 1
```
The batch size can be specified using the batch parameter or by using a regex pattern to match file names.
You can create a command-line package to generate both Snakefile and YAML files directly from the command line.

Pipeline

Manual modification of the configuration file¶

Open the configuration file (my.conf) in a text editor.
Modify the relevant paths for input files, output directories, and software locations based on your environment.
Ensure that all necessary software and file paths are correctly specified to avoid errors during execution.
Save the changes and use the updated configuration file for the next steps.

About mapping method for scBS-seq Library¶

The current mainstream single-cell DNA methylation methods include Fuchou Tang's research group's scRRBS method (which requires default directional alignment) and the scBS-seq method, along with its parallel expression and accessibility derivatives： - scBS-seq： https: //www.ncbi.nlm.nih.gov/pubmed/25042786 - scBS-seq（协议）： https: //www.ncbi.nlm.nih.gov/pubmed/28182018 - scM&T-seq：https: //www.ncbi.nlm.nih.gov/pubmed/26752769 - scNMT-seq： https: //www.nature.com/articles/s41467-018-03149-4

The single-cell methylation library also uses the Post-Bisulfite Adapter Tagging (PBAT) method. The parameter selection is as follows:

The option --pbat aligns reads to only two strands: CTOT and CTOB. The option --non_directional aligns reads to all four possible strands: OT (original top), CTOT (complementary to OT), OB (original bottom), and CTOB (complementary to OB).
For libraries built according to the scBS-Seq protocol, a typical alignment ratio is about 1:2:2:1 (OT:CTOT:CTOB:OB).

We typically align data in single-end mode (after trimming) and then merge methylation calls from R1 and R2 to generate bedGraph and coverage files.

In all of these studies, we did not begin with paired-end alignment, but instead treated Read 1 and Read 2 independently, merging the data after mapping (these steps should be mentioned in the methods of the referenced papers). Therefore, the tool sequence for R1 and R2 is as follows:

Trim with Trim Galore (SE mode)
Align with Bismark (--non_directional, SE mode)
Deduplicate using deduplicate_bismark (SE mode)
Methylation extraction
Merge all CpG calls from R1 and R2, then run bismark2bedGraph
Perform data analysis using coverage files

During merging：

bismark2bedGraph CpG* -o CpG_meth_level_PE

will generate two files: a merged bedGraph file and a merged cov file. These files are:

CpG_meth_level_PE.gz
CpG_meth_level_PE.gz.bismark.cov.gz (6 columns)
Since R1 and R2 contain overlapping regions, the merging process will combine the data based on the genomic positions.

Requirements:¶

R >= 4.1.3, python2 and python3
Trim Galore
Samtools
cutadapt
bowtie, bowtie2
Bismark
BS-Seeker2
Or you can use msPIPE on docker without having to prepare the environment. **Recommended **

tools:¶

fastqc: /p300s/baoym_group/zongwt/zongwt/software/FastQC/fastqc fastq-dump: /p300s/baoym_group/zongwt/zongwt/software/sratoolkit.2.10.9-centos_linux64/bin/fastq-dump trim_galore: /p300s/baoym_group/zongwt/zongwt/software/TrimGalore-0.6.1/trim_galore bismark: /p300s/baoym_group/zongwt/zongwt/software/Bismark_v0.22.3/bismark cutadapt: /p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/bin/cutadapt samtools: /p300s/baoym_group/zongwt/zongwt/software/samtools-1.10/samtools

Output file structure after upstream analysis:¶

input_dir: SRA data or FASTQ data

A The first step is to check whether the input file is SRA data or FASTQ data. If it is SRA data, a conversion step is required. - Output:
Text Only
│ │ │ │ │ │ │ │ │ │ │ │ │ │ ├── │ │ │ │ │ │ ├── │ │ │ │ │ ├── │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └──        
neno-4-1">├── 0-data └── SRRXXXX ├── fastq │ ├── SRRXXXX_1.fastq.gz │ └── SRRXXXX_2.fastq.gz ├── fastqc │ ├── SRRXXXX_1_fastqc.html │ ├── SRRXXXX_1_fastqc.zip │ ├── SRRXXXX_2_fastqc.html │ └── SRRXXXX_2_fastqc.zip └── trim ├── SRRXXXX_1.fastq.gz_trimming_report.txt ├── SRRXXXX_1_val_1.fq.gz ├── SRRXXXX_2.fastq.gz_trimming_report.txt └── SRRXXXX_2_val_2.fq.gz 1-mapping └── SRRXXXX └── bam ├── SRRXXXX_1_val_1_bismark_bt2.bam ├── SRRXXXX_1_val_1_bismark_bt2_SE_report.txt ├── SRRXXXX_2_val_2_bismark_bt2.bam └── SRRXXXX_2_val_2_bismark_bt2_SE_report.txt 2-deduplicate └── SRRXXXX ├── SRRXXXX_1_val_1_bismark_bt2.deduplicated.bam ├── SRRXXXX_1_val_1_bismark_bt2.deduplication_report.txt ├── SRRXXXX_2_val_2_bismark_bt2.deduplicated.bam └── SRRXXXX_2_val_2_bismark_bt2.deduplication_report.txt 3-bedGraph └── SRRXXXX ├── CpG_context_SRRXXXX_1_val_1_bismark_bt2.deduplicated.txt.gz ├── CpG_context_SRRXXXX_2_val_2_bismark_bt2.deduplicated.txt.gz ├── CpG_meth_level_PE.gz ├── CpG_meth_level_PE.gz.bismark.cov.gz ├── Non_CpG_context_SRRXXXX_1_val_1_bismark_bt2.deduplicated.txt.gz ├── Non_CpG_context_SRRXXXX_2_val_2_bismark_bt2.deduplicated.txt.gz ├── SRRXXXX_1_val_1_bismark_bt2.deduplicated.bedGraph.gz ├── SRRXXXX_1_val_1_bismark_bt2.deduplicated.bismark.cov.gz ├── SRRXXXX_1_val_1_bismark_bt2.deduplicated.CpG_report.txt.gz ├── SRRXXXX_1_val_1_bismark_bt2.deduplicated.M-bias.txt ├── SRRXXXX_1_val_1_bismark_bt2.deduplicated_splitting_report.txt ├── SRRXXXX_2_val_2_bismark_bt2.deduplicated.bedGraph.gz ├── SRRXXXX_2_val_2_bismark_bt2.deduplicated.bismark.cov.gz ├── SRRXXXX_2_val_2_bismark_bt2.deduplicated.CpG_report.txt.gz ├── SRRXXXX_2_val_2_bismark_bt2.deduplicated.M-bias.txt └── SRRXXXX_2_val_2_bismark_bt2.deduplicated_splitting_report.txt logs ├── SRRXXXX_deduplicate.log ├── SRRXXXX_fastqc.log ├── SRRXXXX_fastqdump.log ├── SRRXXXX_mapping.log ├── SRRXXXX_methylation_extract.log ├── SRRXXXX_trim.log └── SRRXXXX_upstream.log