Build Matrix

In [2]:

Copied!





import pandas as pd
import numpy as np
import scanpy as sc
import scMethtools as scm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scanpy as sc
import scMethtools as scm
import matplotlib.pyplot as plt

 ####   ####  #    # ##### #    # #####  ####   ####  #       ####  
 #      #    # ##  ##   #   #    #   #   #    # #    # #      #      
  ####  #      # ## #   #   ######   #   #    # #    # #       ####  
      # #      #    #   #   #    #   #   #    # #    # #           # 
 #    # #    # #    #   #   #    #   #   #    # #    # #      #    # 
  ####   ####  #    #   #   #    #   #    ####   ####  ######  ####                                                

Version: 1.0.0, Tutorials: https://ngdc.cncb.ac.cn/methbank/scm

In [3]:

Copied!

import os
import os

After upstream preprocessing, scMethtools constructs a matrix for the generated methylation files. For how to perform upstream construction, please refer to (**************).
For the analysis of single-cell DNA methylation data, although it is ideal to directly calculate a single base, due to the high dimensionality of site-based data, it is usually necessary to aggregate the methylation information at the single base level to the regional level and construct a feature matrix in different cells.

In [3]:

Copied!

scm.settings.set_figure_params(dpi=150,figsize=[5,5],fontsize=16)
scm.settings.set_figure_params(dpi=150,figsize=[5,5],fontsize=16)

Import files¶

Here, we use a toy dataset to display how to build the methylation matrix

In [4]:

Copied!

#setting workdir
workdir = '/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/toy_data/'
os.chdir(workdir)
#setting workdir
workdir = '/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/toy_data/'
os.chdir(workdir)

In [8]:

Copied!

scm.pp.import_cells(data_dir="./raw",output_dir="./out",suffix="bed")
#by default, import_cells will only read in CG context cytosines, and the suffix of file should be .bed, you can change by parameters
scm.pp.import_cells(data_dir="./raw",output_dir="./out",suffix="bed")
#by default, import_cells will only read in CG context cytosines, and the suffix of file should be .bed, you can change by parameters

Out[8]:

(   cell_id   cell_name    sites     meth   n_total  global_meth_level
 0        0  GSM1370527  5478455  1774128   7718445           0.323837
 1        1  GSM1370524  9056780  2967168  14749706           0.327618
 2        2  GSM1370523  6389935  2001457  10058585           0.313220,
 './out/tmp')

Average fixed windows¶

A commonly used simple method to construct a matrix is to evenly divide the genome into windows of e.g. 100 kb (Kilobase, kb) size (Luo et al., 2017), and calculate methylation level of all cells within each window.

In [5]:

Copied!

windows = scm.pp.sliding_windows(window_size=100000,step_size=100000,ref=scm.ref.mm10)
windows = scm.pp.sliding_windows(window_size=100000,step_size=100000,ref=scm.ref.mm10)

window_size=100000: This specifies the size of each genomic window. In this case, each window will cover 100,000 base pairs of the genome.
step_size=100000: This defines the step between adjacent windows. Here, a step size of 100,000 means that the windows will not overlap, and each new window will start exactly 100,000 base pairs after the previous one.
ref=scm.ref.mm10: This parameter specifies the reference genome used for windowing. In this case, scm.ref.mm10 refers to the mouse genome (mm10) as the reference. The tool will divide this reference genome into windows based on the specified window and step sizes.

In [19]:

Copied!

w100k = scm.pp.feature_to_scm(feature=windows,output_dir="./out",npz_path=None,out_file="toy_100k",cpu=10,smooth=False,relative=True,copy=True)
w100k = scm.pp.feature_to_scm(feature=windows,output_dir="./out",npz_path=None,out_file="toy_100k",cpu=10,smooth=False,relative=True,copy=True)

Saving results in: ./out

In [20]:

Copied!

w100k
w100k

Out[20]:

AnnData object with n_obs × n_vars = 3 × 27268
    var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var'
    uns: 'workdir'
    layers: 'relative'

Genomic element features¶

We recommend defining these features through existing pre-annotated contexts (e.g., promoters, enhancers), as these features not only facilitate downstream interpretation but also require less computation. scMethtools supports user-defined genomic feature files for matrix and object construction.

In [6]:

Copied!

promoters = scm.pp.load_features('/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/genome/Mouse_mm10_prom_1000_final.bed')
promoters = scm.pp.load_features('/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/genome/Mouse_mm10_prom_1000_final.bed')

In [ ]:

Copied!

w_promoter = scm.pp.feature_to_scm(feature=promoters,output_dir="./out",npz_path=None,out_file="w_promoter",cpu=10,smooth=False,relative=True,copy=True)
w_promoter = scm.pp.feature_to_scm(feature=promoters,output_dir="./out",npz_path=None,out_file="w_promoter",cpu=10,smooth=False,relative=True,copy=True)

For GTF annotation files:
Here’s a more flexible approach where you can use additional parameters to filter the annotation file. For example, you can extract upstream regions of protein-coding genes by specifying gene_type='protein_coding', tss_up=1000, and tss_down=1000, which are typically considered potential promoter regions. Similarly, you can exclude unwanted chromosomes from the GTF file by specifying exclude_chromosomes='chrMT'.

In [13]:

Copied!

gtf_file='/xtdisk/methbank_baoym/zongwt/single/genome/mouse/Mus_musculus.GRCm39.111.chr.gtf'
annotation_dict = scm.io.read_gtf(file_path=gtf_file,feature_type="gene", gene_type='protein_coding',tss_up=1000,tss_down=1000,exclude_chromosomes='chrMT')
gtf_file='/xtdisk/methbank_baoym/zongwt/single/genome/mouse/Mus_musculus.GRCm39.111.chr.gtf'
annotation_dict = scm.io.read_gtf(file_path=gtf_file,feature_type="gene", gene_type='protein_coding',tss_up=1000,tss_down=1000,exclude_chromosomes='chrMT')

['1', 'havana', 'gene', '108344807', '108347562', '.', '+', '.', 'gene_id "ENSMUSG00000104478"; gene_version "2"; gene_name "Gm38212"; gene_source "havana"; gene_biotype "TEC";']
TEC Gm38212

In [15]:

Copied!

w_promoter = scm.pp.feature_to_scm(feature=annotation_dict,output_dir="./out",npz_path=None,out_file="w_promoter",cpu=10,smooth=False,relative=True,copy=True)
w_promoter = scm.pp.feature_to_scm(feature=annotation_dict,output_dir="./out",npz_path=None,out_file="w_promoter",cpu=10,smooth=False,relative=True,copy=True)

/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/anndata/_core/anndata.py:1908: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
Saving results in: ./out

In [16]:

Copied!

w_promoter
w_promoter

Out[16]:

AnnData object with n_obs × n_vars = 3 × 21630
    var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var'
    uns: 'workdir'
    layers: 'relative'

De novo features¶

Typically, some functional regions in chromosomes have very similar methylation states in all cells, while other regions show differences in methylation in different cells. In the standard approach, each chromosome is divided into non-overlapping, equal-sized intervals and methylation in each interval is quantified, but this strict window division of interval boundaries may not be the best feature extraction method. Therefore, we implemented a de novo function in the scMethtools program to perform de novo identification of methylation features. The specific process is to divide the genomic chromosome into many overlapping windows, which start with multiples of a fixed small step size. The default is a 200 bp window with a step size of 10 bp. The relative methylation level of each cell in each window is quantified by averaging the methylation residuals of all CpG sites in the window. If there is no CpG site in the window, the calculation is skipped directly. Next, the variance between these values in all cells in each window is calculated. The window with the highest variance is selected and marked as HVMR. If there is overlap between windows, they are merged into a larger HVMR.

In [7]:

Copied!

features = scm.pp.denovo(input_dir='./out',cpu=10,exclude_chroms=['chr23'])
#chr23 is lambda reads
features = scm.pp.denovo(input_dir='./out',cpu=10,exclude_chroms=['chr23'])
#chr23 is lambda reads

Finding denovo windows for chromosome chr18 ...
Finding denovo windows for chromosome chr12 ...
Finding denovo windows for chromosome chr8 ...
Finding denovo windows for chromosome chr5 ...
Finding denovo windows for chromosome chr10 ...
Finding denovo windows for chromosome chr1 ...
Finding denovo windows for chromosome chr4 ...
Finding denovo windows for chromosome chr13 ...
Finding denovo windows for chromosome chr14 ...
Finding denovo windows for chromosome chrX ...










Finding denovo windows for chromosome chr17 ...

Finding denovo windows for chromosome chr16 ...

Finding denovo windows for chromosome chr11 ...

Finding denovo windows for chromosome chr3 ...

Finding denovo windows for chromosome chr15 ...

Finding denovo windows for chromosome chr7 ...

Finding denovo windows for chromosome chr19 ...

Finding denovo windows for chromosome chr9 ...

Finding denovo windows for chromosome chr2 ...

Finding denovo windows for chromosome chrM ...

Finding denovo windows for chromosome chr23 ...

Finding denovo windows for chromosome chrY ...

Finding denovo windows for chromosome chr6 ...

In [ ]:

Copied!

denovos = scm.pp.feature_to_scm(feature= features,output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/toy_data/out",out_file="test_denovo",cpu=10,relative=True,smooth=False,copy=True)
denovos = scm.pp.feature_to_scm(feature= features,output_dir="/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/toy_data/out",out_file="test_denovo",cpu=10,relative=True,smooth=False,copy=True)

Multiple features¶

In [9]:

Copied!

toy_100k = scm.pp.features_to_scm(features=[windows,promoters,features],feature_names=["windows","promoters","denovos"],output_dir="./out",out_file="toy_100k",cpu=10,relative=True,copy=True)
#relative=True will compute the relative methylation levels if smooth files exit
toy_100k = scm.pp.features_to_scm(features=[windows,promoters,features],feature_names=["windows","promoters","denovos"],output_dir="./out",out_file="toy_100k",cpu=10,relative=True,copy=True)
#relative=True will compute the relative methylation levels if smooth files exit

Saving results in: ./out
Saving results in: ./out
Saving results in: ./out

In [10]:

Copied!

toy_100k
toy_100k

Out[10]:

MuData object with n_obs × n_vars = 3 × 183714
  var:	'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var'
  uns:	'description', 'features'
  3 modalities
    mod_windows:	3 x 27268
      obs:	'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level'
      var:	'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var'
      uns:	'workdir'
      layers:	'relative'
    mod_promoters:	3 x 55361
      obs:	'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level'
      var:	'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var'
      uns:	'workdir'
      layers:	'relative'
    mod_denovos:	3 x 101085
      obs:	'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level'
      var:	'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var'
      uns:	'workdir'
      layers:	'relative'

In [ ]:

Copied!

# using multiple features will generate a muon object, which is a multi-omics object 
# using multiple features will generate a muon object, which is a multi-omics object