Preprocessing(upstream)
Build methylation matrix and generate scm object¶
After completing the upstream processing for all the required single-cell samples, each sample will generate a methylation file that records the methylation status at specific genomic loci. For instance, in Bismark, these files may be in the form of cov files or bedGraph files.
In downstream analyses, we need to compare the methylation states across all samples to identify distinct methylation patterns or to quantify features such as regulatory elements. Given the large number of genomic loci and the substantial file sizes in single-cell methylation data, constructing a methylation matrix is typically necessary to facilitate subsequent computations. Unlike transcriptomic data, where genes or transcripts serve as natural units for matrix construction, methylation data lacks such predefined structures, requiring an additional computational process to generate the matrix.
- Here, we compare matrix construction using EpiScanpy and scMethQ.
import scMethtools as scm
import scanpy as sc
input_dir = '/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/raw/scanpy/'
out_dir = '/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/scanpy0220/'
tmp_dir = out_dir + "tmp"
smooth = True
For scMethQ, there are three ways to build the methylation features matrix¶
- First, using sliding window method to bulid a average genome window matrix
- Second, using genomic regions file to build a element matrix
- Third, using denovo method to find genomic windows and build a matrix
First of all, we should read and import all single-cell methylation files¶
- Generally , the methylation data file formats obtained by using different upstream analysis software for methylation calls are different. In order to be compatible with a variety of starting input file formats, scMethtools has built-in file format parsing functions in the software to adapt to methylation files generated by different upstream processing software, such as Bismark (Krueger & Andrews, 2011), BSseeker2 (Guo et al., 2018), MethylPy (Schultz et al., 2015), etc. In addition, in order to adapt to any other possible software packages and command line tools to generate input file formats, scMethtools also supports parsing of custom format files in the form of parameters. Custom parameters need to be marked with colon-delimited strings to indicate the field position.
scm.pp.import_cells(data_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/raw/scanpy",output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/100k/", cpu=30, pipeline="1:2:5:6c:4:\t:0")
#after import
├── basic_stats.csv
├── data
│ ├── chr*.npz
├── tmp
│ ├── chr*.coo
#tmp dir can be remove by specifying the parameter keep=False
Then build cell * feature methylation matrix¶
# method 1
windows = scm.pp.sliding_windows(window_size=10000,step_size=10000,ref=scm.ref.hg38)
# method 2: loading feature files
enhancers = scm.pp.load_features(feature_file="/xtdisk/methbank_baoym/zongwt/single/genome/human/GRCh38_vista_enhancers.bed")
promoters = scm.pp.load_features('/xtdisk/methbank_baoym/zongwt/single/genome/human/Human_hg38_prom_1000_final.bed')
bodys = scm.pp.load_features('/xtdisk/methbank_baoym/zongwt/single/genome/human/Human_hg38_gene_body.bed')
#Then generate scm object in anndata format, using different feature method can be Quantificated simultaneously
scm.pp.features_to_scm(features=[windows,enhancers,promoters,bodys],feature_names=["windows","enhancers","promoters","bodys"],meta_df=None,npz_path="/xtdisk/methbank_baoym/zongwt/single/data/GSE97179/out/scm",output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE97179/out/scm",out_file="mu_1000k_enhancer",cpu=30,relative=True,smooth=False,copy=False)
Time Usage¶
import episcanpy as epi
import glob
import os
import anndata as ad
import pandas as pd
import datetime
samples = glob.glob(os.path.join(input_dir, "*.bed"))
cells = [os.path.basename(sample) for sample in samples]
windows = epi.ct.make_windows(100000) #100kb
w_names = epi.ct.name_features(windows)
# promoters = epi.ct.load_features('../methylation_play_data/mouse_epd_promoters.bed') # generate features
# p_names = epi.ct.name_features(promoters)
st = datetime.datetime.now()
print(st)
w_mtx= epi.ct.build_count_mtx(cells,
annotation=[windows],
path=input_dir,
output_file=['test_windows_CG.txt'],
feature_names= [w_names],
meth_context='CG',
threshold=[3])
adata_p = ad.AnnData(w_mtx, obs=pd.DataFrame(index=cells), var=pd.DataFrame(index=w_names))
adata_p.write(out_dir + 'promoter_matrix_test_CG.h5ad')
et = datetime.datetime.now()
time_episcanpy_GSE56879 = et-st
print(time_episcanpy_GSE56879)
2024-04-16 08:30:10.691630 0 GSM1370539.bed 1 GSM1370571.bed 2 GSM1370556.bed 3 GSM1370533.bed 4 GSM1370575.bed 5 GSM1370574.bed 6 GSM1370545.bed 7 GSM1370537.bed 8 GSM1370538.bed 9 GSM1370555.bed 10 GSM1370526.bed 11 GSM1370570.bed 12 GSM1370527.bed 13 GSM1413827.bed 14 GSM1370531.bed 15 GSM1370560.bed 16 GSM1370548.bed 17 GSM1370562.bed 18 GSM1370568.bed 19 GSM1370567.bed 20 GSM1370563.bed 21 GSM1370557.bed 22 GSM1370546.bed 23 GSM1370551.bed 24 GSM1370573.bed 25 GSM1370552.bed 26 GSM1370524.bed 27 GSM1370558.bed 28 GSM1370541.bed 29 GSM1370550.bed 30 GSM1370528.bed 31 GSM1370542.bed 32 GSM1370536.bed 33 GSM1370554.bed 34 GSM1370530.bed 35 GSM1370529.bed 36 GSM1370549.bed 37 GSM1370540.bed 38 GSM1370559.bed 39 GSM1370553.bed 40 GSM1370525.bed 41 GSM1370572.bed 42 GSM1370532.bed 43 GSM1370564.bed 44 GSM1370523.bed 45 GSM1370561.bed 46 GSM1370566.bed 47 GSM1370565.bed 48 GSM1370535.bed 49 GSM1370543.bed 50 GSM1370544.bed 51 GSM1370569.bed 0:07:34.855782
- scMethQ using 1 CPU
#scMethQ
import datetime
start_time = datetime.datetime.now()
print(start_time)
scm.pp.import_cells(data_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/raw/scanpy",output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/100k/", cpu=30, pipeline="1:2:5:6c:4:\t:0")
#scm.pp.matrix_npz(tmp_path="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/100k/tmp",output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/sscm1",smooth=True,keep_tmp=False)
start_time1 = datetime.datetime.now()
print(start_time1)
scm.pp.matrix_npz_all(tmp_path="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/100k/tmp",output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/sscm2",smooth=True,keep_tmp=False,cpu=20)
start_time2 = datetime.datetime.now()
print(start_time2)
## BED column format: Custom # Temp coo data writing to /xtdisk/methbank_baoym/zongwt/single/data/GSE56789/100k/tmp
2024-04-16 09:56:45.020291
## Basic summary writing to /xtdisk/methbank_baoym/zongwt/single/data/GSE56789/100k/basic_stats.csv ...
2024-04-16 09:58:08.017304
...smoothing chr23 ...smoothing chr23 end ...smoothing chrY ...smoothing chr19 ...smoothing chrY end ...smoothing chr16 ...smoothing chr18 ...smoothing chrX ...smoothing chr12 ...smoothing chr15 ...smoothing chr17 ...smoothing chr14 ...smoothing chr13 ...smoothing chr1 ...smoothing chr3 ...smoothing chr10 ...smoothing chr5 ...smoothing chr6 ...smoothing chr9 ...smoothing chr8 ...smoothing chr11 ...smoothing chr2 ...smoothing chr4 ...smoothing chr7 ...smoothing chr19 end ...smoothing chrM ...smoothing chrM end ...smoothing chr16 end ...smoothing chr18 end ...smoothing chr15 end ...smoothing chr17 end ...smoothing chr12 end ...smoothing chrX end ...smoothing chr14 end ...smoothing chr13 end ...smoothing chr9 end ...smoothing chr10 end ...smoothing chr8 end ...smoothing chr6 end ...smoothing chr3 end ...smoothing chr11 end ...smoothing chr5 end ...smoothing chr7 end ...smoothing chr4 end ...smoothing chr1 end ...smoothing chr2 end
2024-04-16 10:02:41.578738
time_scm_GSE56879 = start_time2-start_time
print(time_scm_GSE56879)
0:09:09.202474
time_episcanpy_GSE56879 = et-st
print(time_episcanpy_GSE56879)
0:07:34.855782
- scMethQ using 30 CPUs
start_time = datetime.datetime.now()
print(start_time)
scm.pp.import_cells(data_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/raw/scanpy",output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/100k/", cpu=30, pipeline="1:2:5:6c:4:\t:0")
#scm.pp.matrix_npz(tmp_path="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/100k/tmp",output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/sscm1",smooth=True,keep_tmp=False)
start_time1 = datetime.datetime.now()
print(start_time1)
scm.pp.matrix_npz_all(tmp_path="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/100k/tmp",output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/sscm2",smooth=True,keep_tmp=False,cpu=1)
start_time2 = datetime.datetime.now()
print(start_time2)
## BED column format: Custom # Temp coo data writing to /xtdisk/methbank_baoym/zongwt/single/data/GSE56789/100k/tmp
2024-04-16 08:47:10.393944
## Basic summary writing to /xtdisk/methbank_baoym/zongwt/single/data/GSE56789/100k/basic_stats.csv ...
2024-04-16 08:48:29.081493
...smoothing chr5 ...smoothing chr5 end ...smoothing chr18 ...smoothing chr18 end ...smoothing chr14 ...smoothing chr14 end ...smoothing chr1 ...smoothing chr1 end ...smoothing chr23 ...smoothing chr23 end ...smoothing chr8 ...smoothing chr8 end ...smoothing chr11 ...smoothing chr11 end ...smoothing chr2 ...smoothing chr2 end ...smoothing chr4 ...smoothing chr4 end ...smoothing chrX ...smoothing chrX end ...smoothing chrY ...smoothing chrY end ...smoothing chr13 ...smoothing chr13 end ...smoothing chr16 ...smoothing chr16 end ...smoothing chr6 ...smoothing chr6 end ...smoothing chr15 ...smoothing chr15 end ...smoothing chr10 ...smoothing chr10 end ...smoothing chr9 ...smoothing chr9 end ...smoothing chr19 ...smoothing chr19 end ...smoothing chr17 ...smoothing chr17 end ...smoothing chr12 ...smoothing chr12 end ...smoothing chr3 ...smoothing chr3 end ...smoothing chr7 ...smoothing chr7 end ...smoothing chrM ...smoothing chrM end
2024-04-16 09:45:12.581494