0 Welcome to SIGEL
We develop the Spatially Informed Gene Embedding Learning (SIGEL) that can simultaneously identify spatially co-expressed genes and learn semantically meaningful gene embeddings from SRT data through a pretext task of gene clustering. SIGEL first employs an image encoder to transform the spatial expression maps of genes into gene embeddings modeled by a Student’s t mixture distribution (SMM). Subsequently, a discriminatively boosted gene clustering algorithm is applied on the posterior soft assignments of genes to the mixture components, iteratively adapting the parameters of the encoder and the SMM.

0.1 Introduction
The limited exploration into spatial gene co-expression within tissues has been a significant bottleneck in fully harnessing the spatial genomic context for more insightful gene representations. To bridge this gap, we introduce SIGEL, a novel few-shot, self-supervised learning model tailored for the genomic field.
As shown in the figure above, SIGEL generates semantically meaningful gene Representations (SGRs) by identifying spatial gene co-expression patterns. Cofunctional and enrichment analyses of SGRs endorse their utility as genomic contexts, validated through relational semantics and exploration of gene functional ontology. Three novel SGR-based methods are proposed for enhancing FISH-based spatial transcriptomics, detecting spatially variable genes, and spatial clustering. Extensive real data results affirm the superior performance of these methods, highlighting the utility of SGRs for downstream tasks.
1 Preparation
1.1 Installation
To use SIGEL, please download code from https://github.com/WLatSunLab/SIGEL or conduct code below:
git clone https://github.com/WLatSunLab/SIGEL.git
cd SIGEL/
python3 setup.py install
1.2 Getting Help
For inquiries related to SIGEL's codebase or experimental setup, please feel free to post your questions in the GitHub issues section of the SIGEL repository.
1.3 Datasets
You can access a variety of datasets for spatial gene expression analysis:
- The Slide-seqV2 dataset of Mouse Hippocampus (ssq-mHippo)
- The 10x-Visium datasets of Human Dorsolateral Prefrontal Cortex (10x-hDLPFC)
- The 10x-Visium dataset of Human Breast Cancer (10x-hBC)
- The 10x-Visium datasets of Human Middle Temporal Gyrus (10x-hMTG)
- The 10x-Visium dataset of Mouse Embryo (10x-mEmb)
- The SeqFISH dataset of Mouse Embryo (sqf-mEmb)
Example data required for SIGEL is available here. Please ensure that these data are properly organized as followes:
. <SIGEL>
├── ...
├── <data>
│ ├── 151676_10xvisium.h5ad
│ ├── DLPFC_matrix_151676.dat
│ └── <mEmb>
│ ├── 10x_mEmb_matrix.dat
│ ├── sqf_mEmb_adata.h5ad
│ └── qf_mEmb_matrix.dat
├── <model_pretrained>
│ │
└── ...
2 Obtain SGRs
In this task, we use 10x-hDLPFC-151676 dataset to generate SGRs.
2.1 Data preprocessing
We follow standard preprocessing for ST data using SCANPY, removing mitochondrial and ERCC spike-in genes initially. Genes present in fewer than 10 spots are excluded. Quality control is not applied to spatial spots to maintain data integrity. Gene expression counts are normalized by library size and log-transformed. Afterward, spatial coordinates from ST data are converted into images based on quality control outcomes.
from src.SIGEL.SIGEL import SIGEL
import warnings
warnings.filterwarnings("ignore")
adata= SIGEL.get_data(sample_id='151676', data_type='adata')
dataset, adata = SIGEL.data_process(adata)
gene_name = adata.var.index.values
2.2 Display gene images
Display the visual effects of 8 genes converted into images using the plt.imshow()
function.
import matplotlib.pyplot as plt
fig, axs = plt.subplots(2, 3,figsize=(15, 10), dpi=100)
for i in range(2):
for j in range(3):
axs[i, j].imshow(dataset[12+i+j])
axs[i, j].set_title(f'{gene_name[12+i+j]}')
plt.tight_layout()
plt.show()
2.3 Getting SGRs
labels, SGRs, model = SIGEL.train(dataset, pretrain=True)
use cuda: True
Pretrain: 100%|██████████| 30/30 [02:56<00:00, 5.87s/it]
model saved to model_pretrained/SIGEL.pkl.
load pretrained mae from model_pretrained/SIGEL.pkl
Clustering: 100%|██████████| 30/30 [07:24<00:00, 14.81s/it]
torch.Size([16060, 32])
2.4 Gene co-expression information
The effects of clustering as an auxiliary task will be demonstrated, and it will be evaluated whether SGRs can capture gene co-expression information.
import squidpy as sq
adata.var['cluster_id']=labels
name = list(adata.var_names[adata.var['cluster_id']==1])
sq.pl.spatial_scatter(adata, color=name[:8])
3 Imputing missing genes in FISH-based ST to enhance transcriptomic coverage
In this section, We introduce SIGEL-enhanced-transcriptomics-coverage (SIGEL-ETC), a novel SGR-based Generative Adversarial Network (GAN) model. It assumes gene relational semantics are consistent across ST data for the same tissue, allowing it to extrapolate spatial profiles from covered to uncovered genes using comprehensive datasets like 10x Visium. Its effectiveness is demonstrated by reproducing spatial gene profiles from a mouse embryo SeqFISH dataset (sqf-mEmb), guided by SGRs from a 10x Visium dataset (10x-mEmb).
3.1 Load 10x-/sqf-mEmb
We begin by loading and visualizing SeqFISH and 10x Visium data from mice embryos that share similar health statuses and developmental stages.
from src.SIGEL_ETC.src.SIGEL_ETC import SIGEL_ETC
adata = SIGEL_ETC.get_data(data='sqf', data_type='adata')
adata, key_m, dataset_m = SIGEL_ETC.data_process(adata)
#key_m, dataset_m = SIGEL_ETC.get_data(data='sqf', data_type='image')
key_v, dataset_v = SIGEL_ETC.get_data(data='10x', data_type='image')
adata
refers to the sqf-mEmb dataset.key_m
represents the gene names in the sqf-mEmb dataset.dataset_m
denotes the imagified version of the sqf-mEmb dataset.key_v
is the gene name in the 10x-mEmb dataset.dataset_v
refers to the imagified 10x-mEmb dataset.
3.2 Display gene images
import matplotlib.pyplot as plt
fig, axs = plt.subplots(2, 4,figsize=(15, 10), dpi=100)
for i in range(2):
for j in range(4):
axs[i, j].imshow(dataset_m[12+i+j])
axs[i, j].set_title(f'{key_m[12+i+j]}')
plt.tight_layout()
plt.show()
import matplotlib.pyplot as plt
fig, axs = plt.subplots(2, 4,figsize=(15, 10), dpi=100)
for i in range(2):
for j in range(4):
axs[i, j].imshow(dataset_v[12+i+j])
axs[i, j].set_title(f'{key_v[12+i+j]}')
plt.tight_layout()
plt.show()
3.3 Train SIGEL
We train SIGEL using all genes from the 10x-mEmb dataset.
_, model = SIGEL_ETC.train(dataset_v, pretrain=True)
model = SIGEL_ETC.load_model()
all_gmat_v2m, all_gmat_m2v = SIGEL_ETC.data_filter(key_v, dataset_v, key_m, dataset_m)
use cuda: True
Pretraining: 100%|██████████| 30/30 [01:11<00:00, 2.37s/it]
model saved to model_pretrained/SIGEL_ETC.pkl.
load pretrained mae from model_pretrained/SIGEL_ETC.pkl
Clustering: 100%|██████████| 10/10 [02:03<00:00, 12.32s/it]
model
: Refers to the SIGEL model that has been trained using the imagified data from the 10x-mEmb dataset.all_gamt_v2m
: A dictionary mapping gene names to sqf-mEmb image pairs, derived from intersecting genes between the sqf-mEmb and 10x-mEmb datasets.all_gamt_m2v
: A dictionary mapping gene names to 10x-mEmb image pairs, also derived from intersecting genes between the sqf-mEmb and 10x-mEmb datasets.
3.4 Train ETC-GAN
We use 80% of the intersecting genes between the sqf-mEmb and 10x-mEmb datasets as the training set to train SIGEL-ETC.
ETC_GAN, model, SGRs = SIGEL_ETC.train_ETC(adata, all_gmat_v2m, all_gmat_m2v, model)
100%|██████████| 100/100 [00:52<00:00, 1.91it/s]
3.5 Imputating gene expression
We evaluate the SIGEL-ETC on the test set.
import matplotlib.pyplot as plt
import torch
import numpy as np
gene_imp = SIGEL_ETC.sqf_gen(ETC_GAN, SGRs, adata)
def prepare_and_show_image(ax, tensor_data, title, cmap='Greys', interpolation='nearest'):
img = torch.tensor(tensor_data)
img = img.squeeze(0).cpu().detach().numpy()
ax.imshow(img, cmap=cmap, interpolation=interpolation)
ax.set_title(title)
ax.axis('off')
# Initialize plot
fig, axs = plt.subplots(3, 3, figsize=(12, 15), dpi=300)
# Data and indices
key_m = np.array(list(all_gmat_m2v.keys()))
indices = [330, 308, 325]
for row, idx in enumerate(indices):
# Show real 10x-mEmv
prepare_and_show_image(axs[row, 0], all_gmat_v2m[key_m[idx]], key_m[idx])
# Show real sqf-mEmb
prepare_and_show_image(axs[row, 1], all_gmat_m2v[key_m[idx]], f'real_{key_m[idx]}')
# Show generated sqf-mEmb
img_fake = gene_imp[idx]
img_real = torch.tensor(all_gmat_m2v[key_m[idx]]).squeeze(0).cpu().detach().numpy()
prepare_and_show_image(axs[row, 2], img_fake, f'gen_{key_m[idx]}')
plt.show()
3.6 Enhanced spatial clustering
import squidpy as sq
from src.SIGEL_ETC.src.utils import sequential_gen
dataset_m_660, key_m_660 = sequential_gen(model, ETC_GAN, all_gmat_v2m, all_gmat_m2v)
Shape of the non-overlapping gene set: (16569, 35, 30)
Shape of SGRs: torch.Size([16569, 64])
The first imputation finished, get 110 new genes
The second imputation finished, get 220 new genes
The third imputation finished, get 330 new genes
3.7 Display ground truth
sq.pl.spatial_scatter(
adata, color="celltype_mapped_refined", shape=None, figsize=(5, 10), dpi=500, frameon=False, legend_loc=None, save = 'grouns_truth.png', title=None
)
3.8 Display origin leigen result
from src.SIGEL_ETC.src.utils import ari_evalution
adata = adata[:,list(all_gmat_m2v.keys())]
ari_evalution(adata)
Iteration 0: Resolution 0.55 -> 20 clusters
Iteration 1: Resolution 0.775 -> 22 clusters
ARI: 0.3612215326878697
WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`
3.9 Display enhanced leigen result
X_imp = np.zeros([adata.X.shape[0], len(dataset_m_660)])
for i in range(X_imp.shape[1]):
X_imp[:, i] = dataset_m_660[i][adata.obs['array_y'], adata.obs['array_x']]
print(X_imp.shape)
import pandas as pd
import anndata as ad
n_obs = X_imp.shape[1]
var_names = key_m_660[:n_obs]
var = pd.DataFrame(index=var_names)
adata_imp = ad.AnnData(X_imp, obs=adata.obs, var=var, obsm = adata.obsm, uns = adata.uns)
ari_evalution(adata_imp)
Iteration 0: Resolution 0.55 -> 41 clusters
Iteration 1: Resolution 0.325 -> 31 clusters
Iteration 2: Resolution 0.21250000000000002 -> 22 clusters
ARI: 0.4421413788706413
WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`
4 Spatial variability genes detection
4.1 Load dataset
```bash from src.SIGEL_SVG import simu_zinb from src.SIGEL_SVG import get_svg_score from src.SIGEL.SIGEL import SIGEL import matplotlib.pyplot as plt import numpy as np import pandas as pd import scanpy as sc import warnings warnings.filterwarnings("ignore")
adata= SIGEL.get_data(sample_id='151676', data_type='adata') dataset, adata = SIGEL.data_process(adata) gene_name = adata.var.index.values
## 4.2 Train SIGEL
```bash
labels, SGRs, model = SIGEL.train(dataset=dataset, pretrain=False)
4.3 Calculate SVGs score
svg_score = get_svg_score(SGRs, dataset, adata, model, sim_times=10)
ascending_indices = np.argsort(svg_score)
descending_indices = ascending_indices[::-1]
4.4 Display SVGs
## gene expression whitin high svg score
plot_gene = gene_name[descending_indices[:4]]
fig, axs = plt.subplots(2, 2, figsize=(8, 8))
sc.pl.spatial(adata, img_key="hires", color=plot_gene[0], show=False, ax=axs[0, 0], title=plot_gene[0], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[1], show=False, ax=axs[0, 1], title=plot_gene[1], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[2], show=False, ax=axs[1, 0], title=plot_gene[2], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[3], show=False, ax=axs[1, 1], title=plot_gene[3], vmax='p99')
plt.tight_layout()
plt.savefig("svg_high_6genes.png", dpi=100, bbox_inches="tight")
plt.show()
4.5 DIsplay non-SVGs
## gene expression whitin low svg score
plot_gene = gene_name[ascending_indices[2:6]]
fig, axs = plt.subplots(2, 2, figsize=(8, 8))
sc.pl.spatial(adata, img_key="hires", color=plot_gene[0], show=False, ax=axs[0, 0], title=plot_gene[0], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[1], show=False, ax=axs[0, 1], title=plot_gene[1], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[2], show=False, ax=axs[1, 0], title=plot_gene[2], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[3], show=False, ax=axs[1, 1], title=plot_gene[3], vmax='p99')
plt.tight_layout()
plt.savefig("svg_low_6genes.png", dpi=100, bbox_inches="tight")
plt.show()
5 Designated expression patterns across specific tissue regions
In this section, we use the 10x-hDLPFC-151676 dataset as an example to demonstrate gene simulation in the L1-L6 and white matter (WM) regions according to two specific expression patterns: top 65%-5% and 5%-65%.
5.1 Data loading and preprocessing
To ensure the data better conforms to the negative binomial distribution, we applied size-factor normalization to the loaded dataset.
cd SIGEL
from src.SIGEL.SIGEL import SIGEL
import warnings
from src.SIGEL_SPS import preprocess, sim_spatial_pattern
warnings.filterwarnings("ignore")
adata = SIGEL.get_data(sample_id='151676', data_type='adata')
preprocess = preprocess(seed=12)
adata = preprocess.data_process(adata=adata)
5.2 Simulate
Specify the expression levels for each domain and then simulate genes with designated expression patterns.
sim_spatial_pattern = sim_spatial_pattern(seed=12)
sim_data = sim_spatial_pattern.simulate(adata, region_key='cluster', region_list=['L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'WM'],
t=[65,55,45,35,25,15,5])
L1 has been simulated.
L2 has been simulated.
L3 has been simulated.
L4 has been simulated.
L5 has been simulated.
L6 has been simulated.
WM has been simulated.
sim_data
contains the simulated gene dataset, with the first gene generated by SIGEL-SPS.region_key
refers to the pattern set.region_list
is name list of patterns.t
represents each pattern's expression level, where 65 indicates the top 65%.
5.3 Denoising
In this section, we use STAGTE as a denoising method to more clearly present the simulation results.
import scanpy as sc
import matplotlib.pyplot as plt
import STAGATE_pyG.STAGATE_pyG as STAGATE
sc.pp.normalize_total(sim_data, target_sum=1e4)
sc.pp.log1p(sim_data)
STAGATE.Cal_Spatial_Net(sim_data, rad_cutoff=300)
STAGATE.Stats_Spatial_Net(sim_data)
------Calculating spatial graph...
The graph contains 58748 edges, 3425 cells.
17.1527 neighbors per cell on average.
sim_data = STAGATE.train_STAGATE(sim_data, save_reconstrction=True)
Size of Input: (3425, 13722)
STAGATE(
(conv1): GATConv(13722, 512, heads=1)
(conv2): GATConv(512, 30, heads=1)
(conv3): GATConv(30, 512, heads=1)
(conv4): GATConv(512, 13722, heads=1)
)
100%|██████████| 1000/1000 [00:20<00:00, 47.65it/s]
5.4 Display simulated results
gene_list = sim_data.var_names
fig, axs = plt.subplots(1, 2, figsize=(10, 3), gridspec_kw={'width_ratios': [1, 1]})
sc.pl.spatial(adata, img_key="hires", color=gene_list[0], show=False, ax=axs[0], title='simulated_result', vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=gene_list[0], show=False, ax=axs[1], title='STAGATE_simulated_result', layer='STAGATE_ReX', vmax='p99')
for ax in axs:
ax.title.set_size(9)
plt.subplots_adjust(wspace=0.5)
plt.savefig("simulated_65_5.png", dpi=100, bbox_inches="tight")
plt.show()
5.5 Another example
from src.SIGEL.SIGEL import SIGEL
import warnings
from src.SIGEL_SPS import preprocess, sim_spatial_pattern
warnings.filterwarnings("ignore")
adata = SIGEL.get_data(sample_id='151676', data_type='adata')
preprocess = preprocess(seed=12)
adata = preprocess.data_process(adata=adata)
sim_spatial_pattern = sim_spatial_pattern(seed=12)
sim_data = sim_spatial_pattern.simulate(adata, region_key='cluster', region_list=['L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'WM'],
t=[5,15,25,35,45,55,65])
L1 has been simulated.
L2 has been simulated.
L3 has been simulated.
L4 has been simulated.
L5 has been simulated.
L6 has been simulated.
WM has been simulated.
import scanpy as sc
import matplotlib.pyplot as plt
import STAGATE_pyG.STAGATE_pyG as STAGATE
sc.pp.normalize_total(sim_data, target_sum=1e4)
sc.pp.log1p(sim_data)
STAGATE.Cal_Spatial_Net(sim_data, rad_cutoff=300)
STAGATE.Stats_Spatial_Net(sim_data)
------Calculating spatial graph...
The graph contains 58748 edges, 3425 cells.
17.1527 neighbors per cell on average.
sim_data = STAGATE.train_STAGATE(sim_data, save_reconstrction=True)
Size of Input: (3425, 13722)
STAGATE(
(conv1): GATConv(13722, 512, heads=1)
(conv2): GATConv(512, 30, heads=1)
(conv3): GATConv(30, 512, heads=1)
(conv4): GATConv(512, 13722, heads=1)
)
100%|██████████| 1000/1000 [00:21<00:00, 46.53it/s]
gene_list = sim_data.var_names
fig, axs = plt.subplots(1, 2, figsize=(10, 3), gridspec_kw={'width_ratios': [1, 1]})
sc.pl.spatial(adata, img_key="hires", color=gene_list[0], show=False, ax=axs[0], title='simulated_result', vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=gene_list[0], show=False, ax=axs[1], title='STAGATE_simulated_result', layer='STAGATE_ReX', vmax='p99')
for ax in axs:
ax.title.set_size(9)
plt.subplots_adjust(wspace=0.5)
plt.savefig("simulated_5_65.png", dpi=100, bbox_inches="tight")
plt.show()
6 Improving spatial clustering
In this section, we use the 10x-hDLPFC-151676 dataset as test data to perform spatial clustering with SpaGCN. We will evaluate the clustering performance by filtering genes based on the SVG_Scores in three scenarios: selecting 60% of the genes, selecting 70% of the genes, and using all genes without any filtering.
6.1 Load dataset
from src.SIGEL.SIGEL import SIGEL
import warnings
warnings.filterwarnings("ignore")
adata= SIGEL.get_data(sample_id='151676', data_type='adata')
dataset, adata = SIGEL.data_process(adata)
gene_name = adata.var.index.values
6.2 Obtain SGRs
labels, SGRs, model = SIGEL.train(dataset, pretrain=True)
use cuda: True
Pretrain: 100%|██████████| 50/50 [02:49<00:00, 3.39s/it]
model saved to model_pretrained/SIGEL.pkl.
load pretrained mae from model_pretrained/SIGEL.pkl
Clustering: 100%|██████████| 30/30 [06:10<00:00, 12.34s/it]
6.3 Calculate SVG_Scores
from src.SIGEL_SVG import get_svg_score
svg_score = get_svg_score(SGRs, dataset, adata, model, sim_times=10)
100%|██████████| 10/10 [02:11<00:00, 13.20s/it]
6.4 60% Selection
from src.SIGEL_ISC import SIGEL_ISC
adata= SIGEL.get_data(sample_id='151676', data_type='adata')
adata.obs['Ground Truth']=adata.obs['layer_guess_reordered_short']
new_samples_indices = SIGEL_ISC.gene_select(labels, svg_score, selection_percentage=0.6)
adata = SIGEL_ISC.spatial_clustering(adata, new_samples_indices, n_clusters=7)
Calculateing adj matrix using xy only...
Run 1: l [0.01, 1000], p [0.0, 3421.9987256815807]
Run 2: l [0.01, 500.005], p [0.0, 3416.00732421875]
Run 3: l [0.01, 250.0075], p [0.0, 3392.22265625]
Run 4: l [0.01, 125.00874999999999], p [0.0, 3299.90869140625]
Run 5: l [0.01, 62.509375], p [0.0, 2971.17138671875]
un 6: l [0.01, 31.2596875], p [0.0, 2097.834228515625]
Run 7: l [0.01, 15.63484375], p [0.0, 945.4818725585938]
Run 8: l [0.01, 7.822421875], p [0.0, 304.0054626464844]
Run 9: l [0.01, 3.9162109375], p [0.0, 84.49016571044922]
Run 10: l [0.01, 1.9631054687499998], p [0.0, 21.678667068481445]
Run 11: l [0.01, 0.9865527343749999], p [0.0, 4.903018474578857]
Run 12: l [0.01, 0.49827636718749996], p [0.0, 0.5869753360748291]
Run 13: l [0.25413818359374996, 0.49827636718749996], p [0.0016840696334838867, 0.5869753360748291]
Run 14: l [0.37620727539062493, 0.49827636718749996], p [0.11651778221130371, 0.5869753360748291]
Run 15: l [0.4372418212890624, 0.49827636718749996], p [0.304052472114563, 0.5869753360748291]
Run 16: l [0.4677590942382812, 0.49827636718749996], p [0.43444645404815674, 0.5869753360748291]
recommended l = 0.48301773071289056
Start at res = 0.7 step = 0.1
Initializing cluster centers with louvain, resolution = 0.7
Epoch 0
Epoch 10
Res = 0.7 Num of clusters = 5
Initializing cluster centers with louvain, resolution = 0.7999999999999999
Epoch 0
Epoch 10
Res = 0.7999999999999999 Num of clusters = 6
Res changed to 0.7999999999999999
Initializing cluster centers with louvain, resolution = 0.8999999999999999
Epoch 0
Epoch 10
Res = 0.8999999999999999 Num of clusters = 7
recommended res = 0.8999999999999999
Initializing cluster centers with louvain, resolution = 0.8999999999999999
Epoch 0
Epoch 10
Epoch 20
Epoch 30
Epoch 40
Epoch 50
Epoch 60
Epoch 70
Epoch 80
Epoch 90
Epoch 100
Epoch 110
Epoch 120
Epoch 130
Epoch 140
Epoch 150
Epoch 160
Epoch 170
Epoch 180
Epoch 190
Calculateing adj matrix using xy only...
6.5 NMI & ARI
## calculate ARI and NMI
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics import adjusted_rand_score
label_pred = list(adata.obs['r_pred'])
label_pred = np.array(label_pred).astype(int)
ground_truth = list(adata.obs['Ground Truth'])
category_to_int = {category: idx for idx, category in enumerate(set(ground_truth))}
i = 0
for key in category_to_int:
category_to_int[key] = i
i = i+1
ground = [category_to_int[category] for category in ground_truth]
ari = adjusted_rand_score(ground, label_pred)
print("ARI:", ari)
nmi = normalized_mutual_info_score(ground, label_pred)
print("NMI:", nmi)
ARI: 0.48727462612238504
NMI: 0.6308788297383676
6.6 70% Selection
from src.SIGEL_ISC import SIGEL_ISC
adata= SIGEL.get_data(sample_id='151676', data_type='adata')
adata.obs['Ground Truth']=adata.obs['layer_guess_reordered_short']
new_samples_indices = SIGEL_ISC.gene_select(labels, svg_score, selection_percentage=0.7)
adata = SIGEL_ISC.spatial_clustering(adata, new_samples_indices, n_clusters=7)
select 11029 genes
Calculateing adj matrix using xy only...
Run 1: l [0.01, 1000], p [0.0, 3421.9987256815807]
Run 2: l [0.01, 500.005], p [0.0, 3416.00732421875]
Run 3: l [0.01, 250.0075], p [0.0, 3392.22265625]
Run 4: l [0.01, 125.00874999999999], p [0.0, 3299.90869140625]
Run 5: l [0.01, 62.509375], p [0.0, 2971.17138671875]
Run 6: l [0.01, 31.2596875], p [0.0, 2097.834228515625]
Run 7: l [0.01, 15.63484375], p [0.0, 945.4818725585938]
Run 8: l [0.01, 7.822421875], p [0.0, 304.0054626464844]
Run 9: l [0.01, 3.9162109375], p [0.0, 84.49016571044922]
Run 10: l [0.01, 1.9631054687499998], p [0.0, 21.678667068481445]
Run 11: l [0.01, 0.9865527343749999], p [0.0, 4.903018474578857]
Run 12: l [0.01, 0.49827636718749996], p [0.0, 0.5869753360748291]
Run 13: l [0.25413818359374996, 0.49827636718749996], p [0.0016840696334838867, 0.5869753360748291]
Run 14: l [0.37620727539062493, 0.49827636718749996], p [0.11651778221130371, 0.5869753360748291]
Run 15: l [0.4372418212890624, 0.49827636718749996], p [0.304052472114563, 0.5869753360748291]
Run 16: l [0.4677590942382812, 0.49827636718749996], p [0.43444645404815674, 0.5869753360748291]
recommended l = 0.48301773071289056
Start at res = 0.7 step = 0.1
Initializing cluster centers with louvain, resolution = 0.7
Epoch 0
Epoch 10
Res = 0.7 Num of clusters = 6
Initializing cluster centers with louvain, resolution = 0.7999999999999999
Epoch 0
Epoch 10
Res = 0.7999999999999999 Num of clusters = 7
recommended res = 0.7999999999999999
Initializing cluster centers with louvain, resolution = 0.7999999999999999
Epoch 0
Epoch 10
Epoch 20
Epoch 30
Epoch 40
delta_label 0.004963503649635037 < tol 0.005
Reach tolerance threshold. Stopping training.
Total epoch: 46
Calculateing adj matrix using xy only...
## calculate ARI and NMI
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics import adjusted_rand_score
label_pred = list(adata.obs['r_pred'])
label_pred = np.array(label_pred).astype(int)
ground_truth = list(adata.obs['Ground Truth'])
category_to_int = {category: idx for idx, category in enumerate(set(ground_truth))}
i = 0
for key in category_to_int:
category_to_int[key] = i
i = i+1
ground = [category_to_int[category] for category in ground_truth]
ari = adjusted_rand_score(ground, label_pred)
print("ARI:", ari)
nmi = normalized_mutual_info_score(ground, label_pred)
print("NMI:", nmi)
ARI: 0.36240751573191976
NMI: 0.4967488597318761
6.7 100% Selection
from src.SIGEL_ISC import SIGEL_ISC
adata= SIGEL.get_data(sample_id='151676', data_type='adata')
adata.obs['Ground Truth']=adata.obs['layer_guess_reordered_short']
new_samples_indices = SIGEL_ISC.gene_select(labels, svg_score, selection_percentage=1.0)
adata = SIGEL_ISC.spatial_clustering(adata, new_samples_indices, n_clusters=7)
select 16060 genes
Calculateing adj matrix using xy only...
Run 1: l [0.01, 1000], p [0.0, 3421.9987256815807]
Run 2: l [0.01, 500.005], p [0.0, 3416.00732421875]
Run 3: l [0.01, 250.0075], p [0.0, 3392.22265625]
Run 4: l [0.01, 125.00874999999999], p [0.0, 3299.90869140625]
Run 5: l [0.01, 62.509375], p [0.0, 2971.17138671875]
Run 6: l [0.01, 31.2596875], p [0.0, 2097.834228515625]
Run 7: l [0.01, 15.63484375], p [0.0, 945.4818725585938]
Run 8: l [0.01, 7.822421875], p [0.0, 304.0054626464844]
Run 9: l [0.01, 3.9162109375], p [0.0, 84.49016571044922]
Run 10: l [0.01, 1.9631054687499998], p [0.0, 21.678667068481445]
Run 11: l [0.01, 0.9865527343749999], p [0.0, 4.903018474578857]
Run 12: l [0.01, 0.49827636718749996], p [0.0, 0.5869753360748291]
Run 13: l [0.25413818359374996, 0.49827636718749996], p [0.0016840696334838867, 0.5869753360748291]
Run 14: l [0.37620727539062493, 0.49827636718749996], p [0.11651778221130371, 0.5869753360748291]
Run 15: l [0.4372418212890624, 0.49827636718749996], p [0.304052472114563, 0.5869753360748291]
Run 16: l [0.4677590942382812, 0.49827636718749996], p [0.43444645404815674, 0.5869753360748291]
recommended l = 0.48301773071289056
Start at res = 0.7 step = 0.1
Initializing cluster centers with louvain, resolution = 0.7
Epoch 0
Epoch 10
Res = 0.7 Num of clusters = 7
recommended res = 0.7
Initializing cluster centers with louvain, resolution = 0.7
Epoch 0
Epoch 10
Epoch 20
Epoch 30
Epoch 40
Epoch 50
Epoch 60
Epoch 70
Epoch 80
Epoch 90
Epoch 100
Epoch 110
Epoch 120
Epoch 130
delta_label 0.004087591240875913 < tol 0.005
Reach tolerance threshold. Stopping training.
Total epoch: 136
Calculateing adj matrix using xy only...
## calculate ARI and NMI
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics import adjusted_rand_score
label_pred = list(adata.obs['r_pred'])
label_pred = np.array(label_pred).astype(int)
ground_truth = list(adata.obs['Ground Truth'])
category_to_int = {category: idx for idx, category in enumerate(set(ground_truth))}
i = 0
for key in category_to_int:
category_to_int[key] = i
i = i+1
ground = [category_to_int[category] for category in ground_truth]
ari = adjusted_rand_score(ground, label_pred)
print("ARI:", ari)
nmi = normalized_mutual_info_score(ground, label_pred)
print("NMI:", nmi)
ARI: 0.32108580310325163
NMI: 0.487463199451013
Everything else is prepared and ready for deployment. If you have any other needs, please directly contact Wenlin Li at the email: zipging@gmail.com.