Build Matrix
In [2]:
Copied!
import pandas as pd
import numpy as np
import scanpy as sc
import scMethtools as scm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scanpy as sc
import scMethtools as scm
import matplotlib.pyplot as plt
#### #### # # ##### # # ##### #### #### # #### # # # ## ## # # # # # # # # # # #### # # ## # # ###### # # # # # # #### # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #### #### # # # # # # #### #### ###### #### Version: 1.0.0, Tutorials: https://ngdc.cncb.ac.cn/methbank/scm
In [3]:
Copied!
import os
import os
- After upstream preprocessing, scMethtools constructs a matrix for the generated methylation files. For how to perform upstream construction, please refer to (**************).
- For the analysis of single-cell DNA methylation data, although it is ideal to directly calculate a single base, due to the high dimensionality of site-based data, it is usually necessary to aggregate the methylation information at the single base level to the regional level and construct a feature matrix in different cells.
In [3]:
Copied!
scm.settings.set_figure_params(dpi=150,figsize=[5,5],fontsize=16)
scm.settings.set_figure_params(dpi=150,figsize=[5,5],fontsize=16)
Import files¶
- Here, we use a toy dataset to display how to build the methylation matrix
In [4]:
Copied!
#setting workdir
workdir = '/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/toy_data/'
os.chdir(workdir)
#setting workdir
workdir = '/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/toy_data/'
os.chdir(workdir)
In [8]:
Copied!
scm.pp.import_cells(data_dir="./raw",output_dir="./out",suffix="bed")
#by default, import_cells will only read in CG context cytosines, and the suffix of file should be .bed, you can change by parameters
scm.pp.import_cells(data_dir="./raw",output_dir="./out",suffix="bed")
#by default, import_cells will only read in CG context cytosines, and the suffix of file should be .bed, you can change by parameters
Out[8]:
( cell_id cell_name sites meth n_total global_meth_level 0 0 GSM1370527 5478455 1774128 7718445 0.323837 1 1 GSM1370524 9056780 2967168 14749706 0.327618 2 2 GSM1370523 6389935 2001457 10058585 0.313220, './out/tmp')
Average fixed windows¶
- A commonly used simple method to construct a matrix is to evenly divide the genome into windows of e.g. 100 kb (Kilobase, kb) size (Luo et al., 2017), and calculate methylation level of all cells within each window.
In [5]:
Copied!
windows = scm.pp.sliding_windows(window_size=100000,step_size=100000,ref=scm.ref.mm10)
windows = scm.pp.sliding_windows(window_size=100000,step_size=100000,ref=scm.ref.mm10)
window_size=100000: This specifies the size of each genomic window. In this case, each window will cover 100,000 base pairs of the genome.
step_size=100000: This defines the step between adjacent windows. Here, a step size of 100,000 means that the windows will not overlap, and each new window will start exactly 100,000 base pairs after the previous one.
ref=scm.ref.mm10: This parameter specifies the reference genome used for windowing. In this case, scm.ref.mm10 refers to the mouse genome (mm10) as the reference. The tool will divide this reference genome into windows based on the specified window and step sizes.
In [19]:
Copied!
w100k = scm.pp.feature_to_scm(feature=windows,output_dir="./out",npz_path=None,out_file="toy_100k",cpu=10,smooth=False,relative=True,copy=True)
w100k = scm.pp.feature_to_scm(feature=windows,output_dir="./out",npz_path=None,out_file="toy_100k",cpu=10,smooth=False,relative=True,copy=True)
Saving results in: ./out
In [20]:
Copied!
w100k
w100k
Out[20]:
AnnData object with n_obs × n_vars = 3 × 27268 var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var' uns: 'workdir' layers: 'relative'
Genomic element features¶
- We recommend defining these features through existing pre-annotated contexts (e.g., promoters, enhancers), as these features not only facilitate downstream interpretation but also require less computation. scMethtools supports user-defined genomic feature files for matrix and object construction.
In [6]:
Copied!
promoters = scm.pp.load_features('/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/genome/Mouse_mm10_prom_1000_final.bed')
promoters = scm.pp.load_features('/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/genome/Mouse_mm10_prom_1000_final.bed')
In [ ]:
Copied!
w_promoter = scm.pp.feature_to_scm(feature=promoters,output_dir="./out",npz_path=None,out_file="w_promoter",cpu=10,smooth=False,relative=True,copy=True)
w_promoter = scm.pp.feature_to_scm(feature=promoters,output_dir="./out",npz_path=None,out_file="w_promoter",cpu=10,smooth=False,relative=True,copy=True)
- For GTF annotation files:
- Here’s a more flexible approach where you can use additional parameters to filter the annotation file. For example, you can extract upstream regions of protein-coding genes by specifying
gene_type='protein_coding'
,tss_up=1000
, andtss_down=1000
, which are typically considered potential promoter regions. Similarly, you can exclude unwanted chromosomes from the GTF file by specifyingexclude_chromosomes='chrMT'
.
In [13]:
Copied!
gtf_file='/xtdisk/methbank_baoym/zongwt/single/genome/mouse/Mus_musculus.GRCm39.111.chr.gtf'
annotation_dict = scm.io.read_gtf(file_path=gtf_file,feature_type="gene", gene_type='protein_coding',tss_up=1000,tss_down=1000,exclude_chromosomes='chrMT')
gtf_file='/xtdisk/methbank_baoym/zongwt/single/genome/mouse/Mus_musculus.GRCm39.111.chr.gtf'
annotation_dict = scm.io.read_gtf(file_path=gtf_file,feature_type="gene", gene_type='protein_coding',tss_up=1000,tss_down=1000,exclude_chromosomes='chrMT')
['1', 'havana', 'gene', '108344807', '108347562', '.', '+', '.', 'gene_id "ENSMUSG00000104478"; gene_version "2"; gene_name "Gm38212"; gene_source "havana"; gene_biotype "TEC";'] TEC Gm38212
In [15]:
Copied!
w_promoter = scm.pp.feature_to_scm(feature=annotation_dict,output_dir="./out",npz_path=None,out_file="w_promoter",cpu=10,smooth=False,relative=True,copy=True)
w_promoter = scm.pp.feature_to_scm(feature=annotation_dict,output_dir="./out",npz_path=None,out_file="w_promoter",cpu=10,smooth=False,relative=True,copy=True)
/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/anndata/_core/anndata.py:1908: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`. utils.warn_names_duplicates("var") Saving results in: ./out
In [16]:
Copied!
w_promoter
w_promoter
Out[16]:
AnnData object with n_obs × n_vars = 3 × 21630 var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var' uns: 'workdir' layers: 'relative'
De novo features¶
- Typically, some functional regions in chromosomes have very similar methylation states in all cells, while other regions show differences in methylation in different cells. In the standard approach, each chromosome is divided into non-overlapping, equal-sized intervals and methylation in each interval is quantified, but this strict window division of interval boundaries may not be the best feature extraction method. Therefore, we implemented a de novo function in the scMethtools program to perform de novo identification of methylation features. The specific process is to divide the genomic chromosome into many overlapping windows, which start with multiples of a fixed small step size. The default is a 200 bp window with a step size of 10 bp. The relative methylation level of each cell in each window is quantified by averaging the methylation residuals of all CpG sites in the window. If there is no CpG site in the window, the calculation is skipped directly. Next, the variance between these values in all cells in each window is calculated. The window with the highest variance is selected and marked as HVMR. If there is overlap between windows, they are merged into a larger HVMR.
In [7]:
Copied!
features = scm.pp.denovo(input_dir='./out',cpu=10,exclude_chroms=['chr23'])
#chr23 is lambda reads
features = scm.pp.denovo(input_dir='./out',cpu=10,exclude_chroms=['chr23'])
#chr23 is lambda reads
Finding denovo windows for chromosome chr18 ... Finding denovo windows for chromosome chr12 ... Finding denovo windows for chromosome chr8 ... Finding denovo windows for chromosome chr5 ... Finding denovo windows for chromosome chr10 ... Finding denovo windows for chromosome chr1 ... Finding denovo windows for chromosome chr4 ... Finding denovo windows for chromosome chr13 ... Finding denovo windows for chromosome chr14 ... Finding denovo windows for chromosome chrX ... Finding denovo windows for chromosome chr17 ... Finding denovo windows for chromosome chr16 ... Finding denovo windows for chromosome chr11 ... Finding denovo windows for chromosome chr3 ... Finding denovo windows for chromosome chr15 ... Finding denovo windows for chromosome chr7 ... Finding denovo windows for chromosome chr19 ... Finding denovo windows for chromosome chr9 ... Finding denovo windows for chromosome chr2 ... Finding denovo windows for chromosome chrM ... Finding denovo windows for chromosome chr23 ... Finding denovo windows for chromosome chrY ... Finding denovo windows for chromosome chr6 ...
In [ ]:
Copied!
denovos = scm.pp.feature_to_scm(feature= features,output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/toy_data/out",out_file="test_denovo",cpu=10,relative=True,smooth=False,copy=True)
denovos = scm.pp.feature_to_scm(feature= features,output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/toy_data/out",out_file="test_denovo",cpu=10,relative=True,smooth=False,copy=True)
Multiple features¶
In [9]:
Copied!
toy_100k = scm.pp.features_to_scm(features=[windows,promoters,features],feature_names=["windows","promoters","denovos"],output_dir="./out",out_file="toy_100k",cpu=10,relative=True,copy=True)
#relative=True will compute the relative methylation levels if smooth files exit
toy_100k = scm.pp.features_to_scm(features=[windows,promoters,features],feature_names=["windows","promoters","denovos"],output_dir="./out",out_file="toy_100k",cpu=10,relative=True,copy=True)
#relative=True will compute the relative methylation levels if smooth files exit
Saving results in: ./out Saving results in: ./out Saving results in: ./out
In [10]:
Copied!
toy_100k
toy_100k
Out[10]:
MuData object with n_obs × n_vars = 3 × 183714 var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var' uns: 'description', 'features' 3 modalities mod_windows: 3 x 27268 obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level' var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var' uns: 'workdir' layers: 'relative' mod_promoters: 3 x 55361 obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level' var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var' uns: 'workdir' layers: 'relative' mod_denovos: 3 x 101085 obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level' var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var' uns: 'workdir' layers: 'relative'
In [ ]:
Copied!
# using multiple features will generate a muon object, which is a multi-omics object
# using multiple features will generate a muon object, which is a multi-omics object