Step-by-Step

In [1]:

Copied!





import pandas as pd
import numpy as np
import scanpy as sc
import scMethtools as scm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scanpy as sc
import scMethtools as scm
import matplotlib.pyplot as plt

 ####   ####  #    # ##### #    # #####  ####   ####  #       ####  
 #      #    # ##  ##   #   #    #   #   #    # #    # #      #      
  ####  #      # ## #   #   ######   #   #    # #    # #       ####  
      # #      #    #   #   #    #   #   #    # #    # #           # 
 #    # #    # #    #   #   #    #   #   #    # #    # #      #    # 
  ####   ####  #    #   #   #    #   #    ####   ####  ######  ####                                                

Version: 1.0.0, Tutorials: https://ngdc.cncb.ac.cn/methbank/scm

After the preliminary matrix object construction, scMethQ will use the anndata format for downstream analysis. Those familiar with Scanpy would be more accustomed to working with data in this format.
The analysis in scMethQ will be built upon the analysis of anndata data, and some functions can be directly executed using the Scanpy package.

Settings¶

In [3]:

Copied!

scm.settings.set_figure_params(dpi=150,figsize=[5,5],fontsize=16)
scm.settings.set_figure_params(dpi=150,figsize=[5,5],fontsize=16)

step by step analysis¶

promoters scm object constructed by scm.pp.generate_scm cmd

Input¶

h5ad file
Cell metadata

Output¶

Cell-by-promoters AnnData (sparse matrix) with embedding coordinates and cluster labels.

0. Import scDNAm object¶

In [4]:

Copied!

adata  = scm.pp.load_scm("/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/scbs/promoter.h5ad")
# If show=True(default) ,will plot all cell stats 
adata  = scm.pp.load_scm("/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/scbs/promoter.h5ad")
# If show=True(default) ,will plot all cell stats

No description has been provided for this image

In [5]:

Copied!

adata
adata

Out[5]:

AnnData object with n_obs × n_vars = 52 × 55361
    obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level'
    var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var'
    uns: 'workdir'
    layers: 'relative'

1. Load meta data¶

In [6]:

Copied!

scm.pp.add_meta(adata,meta_file='/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/meta/GSE56879_Meta.txt',sep='\t',index_col='sample')
scm.pp.add_meta(adata,meta_file='/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/meta/GSE56879_Meta.txt',sep='\t',index_col='sample')

No column 'label' found in metadata, 'unknown' is used as the default cell labels
No column 'label_color' found in metadata, random color is generated for each cell label

In [7]:

Copied!

adata.obs.head()
adata.obs.head()

Out[7]:

	cell_id	cell_name	sites	meth	n_total	global_meth_level	Series	condition	sample	Run	...	TotalCpGs(3X)	CpGsTotalUnique(3X)	CpGsMeanCov(3X)	CpGsMethRatio(3X)	CpGsMethReadRatio(3X)	TotalCpGs(5X)	CpGsTotalUnique(5X)	CpGsMeanCov(5X)	CpGsMethRatio(5X)	CpGsMethReadRatio(5X)
0	0	GSM1370573	4433266	2759082	6109535	0.622359	GSE56879	embryo	GSM1370573	SRR1248495	...	1294454.0	333664.0	3.879514	0.391457	0.356073	382046.0	53766.0	7.105717	0.238794	0.214864
1	1	GSM1370523	6389935	2001457	10058585	0.313220	GSE56879	embryo	GSM1370523	SRR1248445	...	3234898.0	789376.0	4.098045	0.202003	0.179488	1230314.0	180392.0	6.820225	0.125521	0.110421
2	2	GSM1370575	22641666	16616611	53520217	0.733895	GSE56879	embryo	GSM1370575	SRR1248497	...	33178689.0	7956537.0	4.169991	0.693017	0.650359	14312044.0	2356911.0	6.072374	0.542838	0.516766
3	3	GSM1370549	18155	9962	18548	0.548719	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	4	GSM1370557	4728963	2023226	7751272	0.427837	GSE56879	embryo	GSM1370557	SRR1248479	...	2758438.0	664912.0	4.148576	0.326118	0.296297	1113423.0	166714.0	6.678641	0.232836	0.210458

5 rows × 54 columns

In [9]:

Copied!

adata.obs["Label"] =  adata.obs["Cell_type"].fillna('') + " " + adata.obs["Treatment"].fillna('')
adata.obs["Label"] =  adata.obs["Cell_type"].fillna('') + " " + adata.obs["Treatment"].fillna('')

The following process can be executed in a single step using a simple command. However, in practical data analysis, we have found that the choice of parameters and the required operations may vary depending on the dataset. Therefore, we still recommend performing the analysis step-by-step. This allows for better observation of the data and enables the selection of appropriate analysis steps and parameters based on the characteristics of the dataset.

In [ ]:

Copied!

### All in one command
### All in one command

In [ ]:

2. Filter¶

filter cell
filter feature

In [10]:

Copied!

scm.pp.filter_cells(adata,min_n_features = 10000)
scm.pp.filter_features(adata,min_n_cells=30)
scm.pp.filter_cells(adata,min_n_features = 10000)
scm.pp.filter_features(adata,min_n_cells=30)

filter cells based on min_n_features >=  10000
after filtering out low-quality cells: 
45 cells, 55361 regions
Filter regions based on min_n_cells
After filtering out low coveraged regions: 
45 cells, 38831 regions

/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scMethtools/preprocessing/scm.py:207: RuntimeWarning: Mean of empty slice
  feature_mean = np.nanmean(dense_X, axis=0)

3. Select highly variable methylated (HVM) feature¶

In [13]:

Copied!

scm.pp.feature_select(adata,top=5000)
scm.pp.feature_select(adata,top=5000)

(45, 38831)

Out[13]:

AnnData object with n_obs × n_vars = 45 × 38831
    obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level', 'Series', 'condition', 'sample', 'Run', 'Organism', 'Cell_ID', 'Source_name', 'Cell_type', 'total_cell_type', 'subgroup', 'Development_stage', 'Strain', 'Tissue', 'Cell_line', 'Genotype', 'Treatment', 'Age', 'Sex', 'Race', 'other', 'Disease', 'CenterName', 'avgLength', 'LibraryStrategy', 'LibraryLayout', 'Platform', 'Model', 'Total_bases.Gb', 'Total_reads', 'Reads_After_trim', 'Uniq_mapping.reads', 'Mapping_ratio', 'Conversion_ratio', 'TotalCpGs(1X)', 'CpGsTotalUnique(1X)', 'CpGsMeanCov(1X)', 'CpGsMethRatio(1X)', 'CpGsMethReadRatio(1X)', 'TotalCpGs(3X)', 'CpGsTotalUnique(3X)', 'CpGsMeanCov(3X)', 'CpGsMethRatio(3X)', 'CpGsMethReadRatio(3X)', 'TotalCpGs(5X)', 'CpGsTotalUnique(5X)', 'CpGsMeanCov(5X)', 'CpGsMethRatio(5X)', 'CpGsMethReadRatio(5X)', 'Label', 'n_features', 'pct_peaks'
    var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var', 'mean', 'pct_cell', 'feature_select', 'feature_select_var'
    uns: 'workdir', 'label_color'
    layers: 'relative'

4. Dimension Reduction¶

In the single-cell methylation data matrix, due to low sequencing coverage, there may be many NaN values. The impute parameter can be used to fill in these missing values in the data matrix.

In [14]:

Copied!

scm.pp.pca(adata,n_pcs=30,impute='median',plot=True)
scm.pp.pca(adata,n_pcs=30,impute='median',plot=True)

... using all features

Out[14]:

AnnData object with n_obs × n_vars = 45 × 38831
    obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level', 'Series', 'condition', 'sample', 'Run', 'Organism', 'Cell_ID', 'Source_name', 'Cell_type', 'total_cell_type', 'subgroup', 'Development_stage', 'Strain', 'Tissue', 'Cell_line', 'Genotype', 'Treatment', 'Age', 'Sex', 'Race', 'other', 'Disease', 'CenterName', 'avgLength', 'LibraryStrategy', 'LibraryLayout', 'Platform', 'Model', 'Total_bases.Gb', 'Total_reads', 'Reads_After_trim', 'Uniq_mapping.reads', 'Mapping_ratio', 'Conversion_ratio', 'TotalCpGs(1X)', 'CpGsTotalUnique(1X)', 'CpGsMeanCov(1X)', 'CpGsMethRatio(1X)', 'CpGsMethReadRatio(1X)', 'TotalCpGs(3X)', 'CpGsTotalUnique(3X)', 'CpGsMeanCov(3X)', 'CpGsMethRatio(3X)', 'CpGsMethReadRatio(3X)', 'TotalCpGs(5X)', 'CpGsTotalUnique(5X)', 'CpGsMeanCov(5X)', 'CpGsMethRatio(5X)', 'CpGsMethReadRatio(5X)', 'Label', 'n_features', 'pct_peaks'
    var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var', 'mean', 'pct_cell', 'feature_select', 'feature_select_var'
    uns: 'workdir', 'label_color', 'X_pca_var_ratios'
    obsm: 'X_pca'
    layers: 'relative'

5.1 tSNE¶

In [15]:

Copied!

scm.pp.tsne(adata)
scm.pp.tsne(adata)

In [17]:

Copied!

scm.pl.tsne(adata,frameon='small',color=['Cell_type','Treatment'],size=100,wspace=0.3)
scm.pl.tsne(adata,frameon='small',color=['Cell_type','Treatment'],size=100,wspace=0.3)

5.2 UMAP¶

In [18]:

Copied!

sc.pp.neighbors(adata)
scm.pp.umap(adata)
scm.pl.umap(adata,frameon='small',color=['Cell_type','Treatment'],size=100,wspace=0.3)
sc.pp.neighbors(adata)
scm.pp.umap(adata)
scm.pl.umap(adata,frameon='small',color=['Cell_type','Treatment'],size=100,wspace=0.3)

5.3 using HVM features to conduct decompostion¶

In [19]:

Copied!

scm.pp.pca(adata,n_pcs=30,impute='median',use_hvm=True,key='HVM')
scm.pp.pca(adata,n_pcs=30,impute='median',use_hvm=True,key='HVM')

... using top variable features based on feature_select

Out[19]:

AnnData object with n_obs × n_vars = 45 × 38831
    obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level', 'Series', 'condition', 'sample', 'Run', 'Organism', 'Cell_ID', 'Source_name', 'Cell_type', 'total_cell_type', 'subgroup', 'Development_stage', 'Strain', 'Tissue', 'Cell_line', 'Genotype', 'Treatment', 'Age', 'Sex', 'Race', 'other', 'Disease', 'CenterName', 'avgLength', 'LibraryStrategy', 'LibraryLayout', 'Platform', 'Model', 'Total_bases.Gb', 'Total_reads', 'Reads_After_trim', 'Uniq_mapping.reads', 'Mapping_ratio', 'Conversion_ratio', 'TotalCpGs(1X)', 'CpGsTotalUnique(1X)', 'CpGsMeanCov(1X)', 'CpGsMethRatio(1X)', 'CpGsMethReadRatio(1X)', 'TotalCpGs(3X)', 'CpGsTotalUnique(3X)', 'CpGsMeanCov(3X)', 'CpGsMethRatio(3X)', 'CpGsMethReadRatio(3X)', 'TotalCpGs(5X)', 'CpGsTotalUnique(5X)', 'CpGsMeanCov(5X)', 'CpGsMethRatio(5X)', 'CpGsMethReadRatio(5X)', 'Label', 'n_features', 'pct_peaks'
    var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var', 'mean', 'pct_cell', 'feature_select', 'feature_select_var'
    uns: 'workdir', 'label_color', 'X_pca_var_ratios', 'Cell_type_colors', 'Treatment_colors', 'neighbors', 'umap', 'HVM_pca_var_ratios'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'HVM_pca'
    layers: 'relative'
    obsp: 'distances', 'connectivities'

In [20]:

Copied!

scm.pp.tsne(adata,use_rep='HVM_pca')
scm.pl.tsne(adata,frameon='small',color=['Cell_type','Treatment','n_total'],size=100,wspace=0.3)
scm.pp.tsne(adata,use_rep='HVM_pca')
scm.pl.tsne(adata,frameon='small',color=['Cell_type','Treatment','n_total'],size=100,wspace=0.3)

In [22]:

Copied!

sc.pp.neighbors(adata, n_neighbors=10)
scm.pp.umap(adata)
scm.pl.umap(adata, frameon='small',color = ['Treatment','Cell_type'],size=80)
sc.pp.neighbors(adata, n_neighbors=10)
scm.pp.umap(adata)
scm.pl.umap(adata, frameon='small',color = ['Treatment','Cell_type'],size=80)

6. Clustering¶

In [25]:

Copied!

scm.pp.getNClusters(adata,n_cluster=3)
scm.pp.getNClusters(adata,n_cluster=3)

step 0
got 4 at resolution 1.5
step 1
got 4 at resolution 0.75
step 2
got 3 at resolution 0.375

Out[25]:

(0.375,
 AnnData object with n_obs × n_vars = 45 × 38831
     obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level', 'Series', 'condition', 'sample', 'Run', 'Organism', 'Cell_ID', 'Source_name', 'Cell_type', 'total_cell_type', 'subgroup', 'Development_stage', 'Strain', 'Tissue', 'Cell_line', 'Genotype', 'Treatment', 'Age', 'Sex', 'Race', 'other', 'Disease', 'CenterName', 'avgLength', 'LibraryStrategy', 'LibraryLayout', 'Platform', 'Model', 'Total_bases.Gb', 'Total_reads', 'Reads_After_trim', 'Uniq_mapping.reads', 'Mapping_ratio', 'Conversion_ratio', 'TotalCpGs(1X)', 'CpGsTotalUnique(1X)', 'CpGsMeanCov(1X)', 'CpGsMethRatio(1X)', 'CpGsMethReadRatio(1X)', 'TotalCpGs(3X)', 'CpGsTotalUnique(3X)', 'CpGsMeanCov(3X)', 'CpGsMethRatio(3X)', 'CpGsMethReadRatio(3X)', 'TotalCpGs(5X)', 'CpGsTotalUnique(5X)', 'CpGsMeanCov(5X)', 'CpGsMethRatio(5X)', 'CpGsMethReadRatio(5X)', 'Label', 'n_features', 'pct_peaks', 'leiden'
     var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var', 'mean', 'pct_cell', 'feature_select', 'feature_select_var'
     uns: 'workdir', 'label_color', 'X_pca_var_ratios', 'Cell_type_colors', 'Treatment_colors', 'neighbors', 'umap', 'HVM_pca_var_ratios', 'leiden'
     obsm: 'X_pca', 'X_tsne', 'X_umap', 'HVM_pca'
     layers: 'relative'
     obsp: 'distances', 'connectivities')

In [26]:

Copied!

scm.pl.dendrogram(adata,groupby='leiden')
scm.pl.dendrogram(adata,groupby='leiden')

/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scanpy/tools/_dendrogram.py:135: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  mean_df = rep_df.groupby(level=0).mean()

Out[26]:

<Axes: >

In [34]:

Copied!

scm.pl.umap(adata, frameon='small',color = ['Label','leiden'],size=100)
scm.pl.umap(adata, frameon='small',color = ['Label','leiden'],size=100)

In [35]:

Copied!

scm.pl.propotion(adata,groupby='leiden',color_by='Label')
scm.pl.propotion(adata,groupby='leiden',color_by='Label')

/tmp/ipykernel_70827/1813127376.py:23: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  b=pd.concat([b,b1])

7. DMR Analysis¶

In [36]:

Copied!

df_other = scm.dm.mdiff(adata,group_by = "leiden", method='wilcoxon',key_added = "leiden", top_n =100 ,matrix=True)
df_other = scm.dm.mdiff(adata,group_by = "leiden", method='wilcoxon',key_added = "leiden", top_n =100 ,matrix=True)

/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: FutureWarning: The behavior of DataFrame.sum with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar. To retain the old behavior, pass axis=0 (or do not pass axis)
  return reduction(axis=axis, out=out, **passkwargs)

In [37]:

Copied!

df_other.head()
df_other.head()

Out[37]:

	names	scores	logfoldchanges	pvals	pvals_adj
0	chr2:91081390-91083390	5.676036	3.337860	1.378513e-08	1.861232e-07
1	chr9:46241032-46243032	5.676036	3.596854	1.378513e-08	1.861232e-07
2	chr5:24539978-24541978	5.676036	3.794176	1.378513e-08	1.861232e-07
3	chr4:43482734-43484734	5.676036	4.069543	1.378513e-08	1.861232e-07
4	chr11:120360534-120362534	5.676036	3.491505	1.378513e-08	1.861232e-07

In [39]:

Copied!

sc.tl.dendrogram(adata,groupby='leiden')
sc.tl.dendrogram(adata,groupby='leiden')

/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scanpy/tools/_dendrogram.py:135: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  mean_df = rep_df.groupby(level=0).mean()

In [40]:

Copied!

sc.pl.rank_genes_groups_heatmap(adata,groupby='leiden', n_genes=50, show_gene_labels=True,cmap='Spectral_r',key='leiden',save=True,swap_axes=True)
sc.pl.rank_genes_groups_heatmap(adata,groupby='leiden', n_genes=50, show_gene_labels=True,cmap='Spectral_r',key='leiden',save=True,swap_axes=True)

/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scanpy/get/get.py:69: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  d = d.stack(level=1).reset_index()
/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scanpy/get/get.py:69: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  d = d.stack(level=1).reset_index()
/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scanpy/get/get.py:69: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  d = d.stack(level=1).reset_index()

WARNING: saving figure to file figures/heatmap.pdf

In [ ]: