Step-by-Step
In [1]:
Copied!
import pandas as pd
import numpy as np
import scanpy as sc
import scMethtools as scm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scanpy as sc
import scMethtools as scm
import matplotlib.pyplot as plt
#### #### # # ##### # # ##### #### #### # #### # # # ## ## # # # # # # # # # # #### # # ## # # ###### # # # # # # #### # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #### #### # # # # # # #### #### ###### #### Version: 1.0.0, Tutorials: https://ngdc.cncb.ac.cn/methbank/scm
- After the preliminary matrix object construction, scMethQ will use the anndata format for downstream analysis. Those familiar with Scanpy would be more accustomed to working with data in this format.
- The analysis in scMethQ will be built upon the analysis of anndata data, and some functions can be directly executed using the Scanpy package.
Settings¶
In [3]:
Copied!
scm.settings.set_figure_params(dpi=150,figsize=[5,5],fontsize=16)
scm.settings.set_figure_params(dpi=150,figsize=[5,5],fontsize=16)
0. Import scDNAm object¶
In [4]:
Copied!
adata = scm.pp.load_scm("/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/scbs/promoter.h5ad")
# If show=True(default) ,will plot all cell stats
adata = scm.pp.load_scm("/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/scbs/promoter.h5ad")
# If show=True(default) ,will plot all cell stats
In [5]:
Copied!
adata
adata
Out[5]:
AnnData object with n_obs × n_vars = 52 × 55361 obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level' var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var' uns: 'workdir' layers: 'relative'
1. Load meta data¶
In [6]:
Copied!
scm.pp.add_meta(adata,meta_file='/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/meta/GSE56879_Meta.txt',sep='\t',index_col='sample')
scm.pp.add_meta(adata,meta_file='/xtdisk/methbank_baoym/zongwt/single/data/GSE56789/meta/GSE56879_Meta.txt',sep='\t',index_col='sample')
No column 'label' found in metadata, 'unknown' is used as the default cell labels No column 'label_color' found in metadata, random color is generated for each cell label
In [7]:
Copied!
adata.obs.head()
adata.obs.head()
Out[7]:
cell_id | cell_name | sites | meth | n_total | global_meth_level | Series | condition | sample | Run | ... | TotalCpGs(3X) | CpGsTotalUnique(3X) | CpGsMeanCov(3X) | CpGsMethRatio(3X) | CpGsMethReadRatio(3X) | TotalCpGs(5X) | CpGsTotalUnique(5X) | CpGsMeanCov(5X) | CpGsMethRatio(5X) | CpGsMethReadRatio(5X) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | GSM1370573 | 4433266 | 2759082 | 6109535 | 0.622359 | GSE56879 | embryo | GSM1370573 | SRR1248495 | ... | 1294454.0 | 333664.0 | 3.879514 | 0.391457 | 0.356073 | 382046.0 | 53766.0 | 7.105717 | 0.238794 | 0.214864 |
1 | 1 | GSM1370523 | 6389935 | 2001457 | 10058585 | 0.313220 | GSE56879 | embryo | GSM1370523 | SRR1248445 | ... | 3234898.0 | 789376.0 | 4.098045 | 0.202003 | 0.179488 | 1230314.0 | 180392.0 | 6.820225 | 0.125521 | 0.110421 |
2 | 2 | GSM1370575 | 22641666 | 16616611 | 53520217 | 0.733895 | GSE56879 | embryo | GSM1370575 | SRR1248497 | ... | 33178689.0 | 7956537.0 | 4.169991 | 0.693017 | 0.650359 | 14312044.0 | 2356911.0 | 6.072374 | 0.542838 | 0.516766 |
3 | 3 | GSM1370549 | 18155 | 9962 | 18548 | 0.548719 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 4 | GSM1370557 | 4728963 | 2023226 | 7751272 | 0.427837 | GSE56879 | embryo | GSM1370557 | SRR1248479 | ... | 2758438.0 | 664912.0 | 4.148576 | 0.326118 | 0.296297 | 1113423.0 | 166714.0 | 6.678641 | 0.232836 | 0.210458 |
5 rows × 54 columns
In [9]:
Copied!
adata.obs["Label"] = adata.obs["Cell_type"].fillna('') + " " + adata.obs["Treatment"].fillna('')
adata.obs["Label"] = adata.obs["Cell_type"].fillna('') + " " + adata.obs["Treatment"].fillna('')
- The following process can be executed in a single step using a simple command. However, in practical data analysis, we have found that the choice of parameters and the required operations may vary depending on the dataset. Therefore, we still recommend performing the analysis step-by-step. This allows for better observation of the data and enables the selection of appropriate analysis steps and parameters based on the characteristics of the dataset.
In [ ]:
Copied!
### All in one command
### All in one command
In [ ]:
Copied!
2. Filter¶
- filter cell
- filter feature
In [10]:
Copied!
scm.pp.filter_cells(adata,min_n_features = 10000)
scm.pp.filter_features(adata,min_n_cells=30)
scm.pp.filter_cells(adata,min_n_features = 10000)
scm.pp.filter_features(adata,min_n_cells=30)
filter cells based on min_n_features >= 10000 after filtering out low-quality cells: 45 cells, 55361 regions Filter regions based on min_n_cells After filtering out low coveraged regions: 45 cells, 38831 regions
/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scMethtools/preprocessing/scm.py:207: RuntimeWarning: Mean of empty slice feature_mean = np.nanmean(dense_X, axis=0)
3. Select highly variable methylated (HVM) feature¶
In [13]:
Copied!
scm.pp.feature_select(adata,top=5000)
scm.pp.feature_select(adata,top=5000)
(45, 38831)
Out[13]:
AnnData object with n_obs × n_vars = 45 × 38831 obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level', 'Series', 'condition', 'sample', 'Run', 'Organism', 'Cell_ID', 'Source_name', 'Cell_type', 'total_cell_type', 'subgroup', 'Development_stage', 'Strain', 'Tissue', 'Cell_line', 'Genotype', 'Treatment', 'Age', 'Sex', 'Race', 'other', 'Disease', 'CenterName', 'avgLength', 'LibraryStrategy', 'LibraryLayout', 'Platform', 'Model', 'Total_bases.Gb', 'Total_reads', 'Reads_After_trim', 'Uniq_mapping.reads', 'Mapping_ratio', 'Conversion_ratio', 'TotalCpGs(1X)', 'CpGsTotalUnique(1X)', 'CpGsMeanCov(1X)', 'CpGsMethRatio(1X)', 'CpGsMethReadRatio(1X)', 'TotalCpGs(3X)', 'CpGsTotalUnique(3X)', 'CpGsMeanCov(3X)', 'CpGsMethRatio(3X)', 'CpGsMethReadRatio(3X)', 'TotalCpGs(5X)', 'CpGsTotalUnique(5X)', 'CpGsMeanCov(5X)', 'CpGsMethRatio(5X)', 'CpGsMethReadRatio(5X)', 'Label', 'n_features', 'pct_peaks' var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var', 'mean', 'pct_cell', 'feature_select', 'feature_select_var' uns: 'workdir', 'label_color' layers: 'relative'
4. Dimension Reduction¶
- In the single-cell methylation data matrix, due to low sequencing coverage, there may be many NaN values. The
impute
parameter can be used to fill in these missing values in the data matrix.
In [14]:
Copied!
scm.pp.pca(adata,n_pcs=30,impute='median',plot=True)
scm.pp.pca(adata,n_pcs=30,impute='median',plot=True)
... using all features
Out[14]:
AnnData object with n_obs × n_vars = 45 × 38831 obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level', 'Series', 'condition', 'sample', 'Run', 'Organism', 'Cell_ID', 'Source_name', 'Cell_type', 'total_cell_type', 'subgroup', 'Development_stage', 'Strain', 'Tissue', 'Cell_line', 'Genotype', 'Treatment', 'Age', 'Sex', 'Race', 'other', 'Disease', 'CenterName', 'avgLength', 'LibraryStrategy', 'LibraryLayout', 'Platform', 'Model', 'Total_bases.Gb', 'Total_reads', 'Reads_After_trim', 'Uniq_mapping.reads', 'Mapping_ratio', 'Conversion_ratio', 'TotalCpGs(1X)', 'CpGsTotalUnique(1X)', 'CpGsMeanCov(1X)', 'CpGsMethRatio(1X)', 'CpGsMethReadRatio(1X)', 'TotalCpGs(3X)', 'CpGsTotalUnique(3X)', 'CpGsMeanCov(3X)', 'CpGsMethRatio(3X)', 'CpGsMethReadRatio(3X)', 'TotalCpGs(5X)', 'CpGsTotalUnique(5X)', 'CpGsMeanCov(5X)', 'CpGsMethRatio(5X)', 'CpGsMethReadRatio(5X)', 'Label', 'n_features', 'pct_peaks' var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var', 'mean', 'pct_cell', 'feature_select', 'feature_select_var' uns: 'workdir', 'label_color', 'X_pca_var_ratios' obsm: 'X_pca' layers: 'relative'
5.1 tSNE¶
In [15]:
Copied!
scm.pp.tsne(adata)
scm.pp.tsne(adata)
In [17]:
Copied!
scm.pl.tsne(adata,frameon='small',color=['Cell_type','Treatment'],size=100,wspace=0.3)
scm.pl.tsne(adata,frameon='small',color=['Cell_type','Treatment'],size=100,wspace=0.3)
5.2 UMAP¶
In [18]:
Copied!
sc.pp.neighbors(adata)
scm.pp.umap(adata)
scm.pl.umap(adata,frameon='small',color=['Cell_type','Treatment'],size=100,wspace=0.3)
sc.pp.neighbors(adata)
scm.pp.umap(adata)
scm.pl.umap(adata,frameon='small',color=['Cell_type','Treatment'],size=100,wspace=0.3)
5.3 using HVM features to conduct decompostion¶
In [19]:
Copied!
scm.pp.pca(adata,n_pcs=30,impute='median',use_hvm=True,key='HVM')
scm.pp.pca(adata,n_pcs=30,impute='median',use_hvm=True,key='HVM')
... using top variable features based on feature_select
Out[19]:
AnnData object with n_obs × n_vars = 45 × 38831 obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level', 'Series', 'condition', 'sample', 'Run', 'Organism', 'Cell_ID', 'Source_name', 'Cell_type', 'total_cell_type', 'subgroup', 'Development_stage', 'Strain', 'Tissue', 'Cell_line', 'Genotype', 'Treatment', 'Age', 'Sex', 'Race', 'other', 'Disease', 'CenterName', 'avgLength', 'LibraryStrategy', 'LibraryLayout', 'Platform', 'Model', 'Total_bases.Gb', 'Total_reads', 'Reads_After_trim', 'Uniq_mapping.reads', 'Mapping_ratio', 'Conversion_ratio', 'TotalCpGs(1X)', 'CpGsTotalUnique(1X)', 'CpGsMeanCov(1X)', 'CpGsMethRatio(1X)', 'CpGsMethReadRatio(1X)', 'TotalCpGs(3X)', 'CpGsTotalUnique(3X)', 'CpGsMeanCov(3X)', 'CpGsMethRatio(3X)', 'CpGsMethReadRatio(3X)', 'TotalCpGs(5X)', 'CpGsTotalUnique(5X)', 'CpGsMeanCov(5X)', 'CpGsMethRatio(5X)', 'CpGsMethReadRatio(5X)', 'Label', 'n_features', 'pct_peaks' var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var', 'mean', 'pct_cell', 'feature_select', 'feature_select_var' uns: 'workdir', 'label_color', 'X_pca_var_ratios', 'Cell_type_colors', 'Treatment_colors', 'neighbors', 'umap', 'HVM_pca_var_ratios' obsm: 'X_pca', 'X_tsne', 'X_umap', 'HVM_pca' layers: 'relative' obsp: 'distances', 'connectivities'
In [20]:
Copied!
scm.pp.tsne(adata,use_rep='HVM_pca')
scm.pl.tsne(adata,frameon='small',color=['Cell_type','Treatment','n_total'],size=100,wspace=0.3)
scm.pp.tsne(adata,use_rep='HVM_pca')
scm.pl.tsne(adata,frameon='small',color=['Cell_type','Treatment','n_total'],size=100,wspace=0.3)
In [22]:
Copied!
sc.pp.neighbors(adata, n_neighbors=10)
scm.pp.umap(adata)
scm.pl.umap(adata, frameon='small',color = ['Treatment','Cell_type'],size=80)
sc.pp.neighbors(adata, n_neighbors=10)
scm.pp.umap(adata)
scm.pl.umap(adata, frameon='small',color = ['Treatment','Cell_type'],size=80)
6. Clustering¶
In [25]:
Copied!
scm.pp.getNClusters(adata,n_cluster=3)
scm.pp.getNClusters(adata,n_cluster=3)
step 0 got 4 at resolution 1.5 step 1 got 4 at resolution 0.75 step 2 got 3 at resolution 0.375
Out[25]:
(0.375, AnnData object with n_obs × n_vars = 45 × 38831 obs: 'cell_id', 'cell_name', 'sites', 'meth', 'n_total', 'global_meth_level', 'Series', 'condition', 'sample', 'Run', 'Organism', 'Cell_ID', 'Source_name', 'Cell_type', 'total_cell_type', 'subgroup', 'Development_stage', 'Strain', 'Tissue', 'Cell_line', 'Genotype', 'Treatment', 'Age', 'Sex', 'Race', 'other', 'Disease', 'CenterName', 'avgLength', 'LibraryStrategy', 'LibraryLayout', 'Platform', 'Model', 'Total_bases.Gb', 'Total_reads', 'Reads_After_trim', 'Uniq_mapping.reads', 'Mapping_ratio', 'Conversion_ratio', 'TotalCpGs(1X)', 'CpGsTotalUnique(1X)', 'CpGsMeanCov(1X)', 'CpGsMethRatio(1X)', 'CpGsMethReadRatio(1X)', 'TotalCpGs(3X)', 'CpGsTotalUnique(3X)', 'CpGsMeanCov(3X)', 'CpGsMethRatio(3X)', 'CpGsMethReadRatio(3X)', 'TotalCpGs(5X)', 'CpGsTotalUnique(5X)', 'CpGsMeanCov(5X)', 'CpGsMethRatio(5X)', 'CpGsMethReadRatio(5X)', 'Label', 'n_features', 'pct_peaks', 'leiden' var: 'chromosome', 'start', 'end', 'covered_cell', 'var', 'sr_var', 'mean', 'pct_cell', 'feature_select', 'feature_select_var' uns: 'workdir', 'label_color', 'X_pca_var_ratios', 'Cell_type_colors', 'Treatment_colors', 'neighbors', 'umap', 'HVM_pca_var_ratios', 'leiden' obsm: 'X_pca', 'X_tsne', 'X_umap', 'HVM_pca' layers: 'relative' obsp: 'distances', 'connectivities')
In [26]:
Copied!
scm.pl.dendrogram(adata,groupby='leiden')
scm.pl.dendrogram(adata,groupby='leiden')
/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scanpy/tools/_dendrogram.py:135: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. mean_df = rep_df.groupby(level=0).mean()
Out[26]:
<Axes: >
In [34]:
Copied!
scm.pl.umap(adata, frameon='small',color = ['Label','leiden'],size=100)
scm.pl.umap(adata, frameon='small',color = ['Label','leiden'],size=100)
In [35]:
Copied!
scm.pl.propotion(adata,groupby='leiden',color_by='Label')
scm.pl.propotion(adata,groupby='leiden',color_by='Label')
/tmp/ipykernel_70827/1813127376.py:23: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. b=pd.concat([b,b1])
7. DMR Analysis¶
In [36]:
Copied!
df_other = scm.dm.mdiff(adata,group_by = "leiden", method='wilcoxon',key_added = "leiden", top_n =100 ,matrix=True)
df_other = scm.dm.mdiff(adata,group_by = "leiden", method='wilcoxon',key_added = "leiden", top_n =100 ,matrix=True)
/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86: FutureWarning: The behavior of DataFrame.sum with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar. To retain the old behavior, pass axis=0 (or do not pass axis) return reduction(axis=axis, out=out, **passkwargs)
In [37]:
Copied!
df_other.head()
df_other.head()
Out[37]:
group | names | scores | logfoldchanges | pvals | pvals_adj | |
---|---|---|---|---|---|---|
0 | 0 | chr2:91081390-91083390 | 5.676036 | 3.337860 | 1.378513e-08 | 1.861232e-07 |
1 | 0 | chr9:46241032-46243032 | 5.676036 | 3.596854 | 1.378513e-08 | 1.861232e-07 |
2 | 0 | chr5:24539978-24541978 | 5.676036 | 3.794176 | 1.378513e-08 | 1.861232e-07 |
3 | 0 | chr4:43482734-43484734 | 5.676036 | 4.069543 | 1.378513e-08 | 1.861232e-07 |
4 | 0 | chr11:120360534-120362534 | 5.676036 | 3.491505 | 1.378513e-08 | 1.861232e-07 |
In [39]:
Copied!
sc.tl.dendrogram(adata,groupby='leiden')
sc.tl.dendrogram(adata,groupby='leiden')
/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scanpy/tools/_dendrogram.py:135: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. mean_df = rep_df.groupby(level=0).mean()
In [40]:
Copied!
sc.pl.rank_genes_groups_heatmap(adata,groupby='leiden', n_genes=50, show_gene_labels=True,cmap='Spectral_r',key='leiden',save=True,swap_axes=True)
sc.pl.rank_genes_groups_heatmap(adata,groupby='leiden', n_genes=50, show_gene_labels=True,cmap='Spectral_r',key='leiden',save=True,swap_axes=True)
/p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scanpy/get/get.py:69: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning. d = d.stack(level=1).reset_index() /p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scanpy/get/get.py:69: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning. d = d.stack(level=1).reset_index() /p300s/baoym_group/zongwt/zongwt/software/miniconda3/envs/py39/lib/python3.9/site-packages/scanpy/get/get.py:69: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning. d = d.stack(level=1).reset_index()
WARNING: saving figure to file figures/heatmap.pdf
In [ ]:
Copied!