0 Welcome to SIGEL

We develop the Spatially Informed Gene Embedding Learning (SIGEL) that can simultaneously identify spatially co-expressed genes and learn semantically meaningful gene embeddings from SRT data through a pretext task of gene clustering. SIGEL first employs an image encoder to transform the spatial expression maps of genes into gene embeddings modeled by a Student’s t mixture distribution (SMM). Subsequently, a discriminatively boosted gene clustering algorithm is applied on the posterior soft assignments of genes to the mixture components, iteratively adapting the parameters of the encoder and the SMM.

0.1 Introduction

The limited exploration into spatial gene co-expression within tissues has been a significant bottleneck in fully harnessing the spatial genomic context for more insightful gene representations. To bridge this gap, we introduce SIGEL, a novel few-shot, self-supervised learning model tailored for the genomic field.

png

As shown in the figure above, SIGEL generates semantically meaningful gene Representations (SGRs) by identifying spatial gene co-expression patterns. Cofunctional and enrichment analyses of SGRs endorse their utility as genomic contexts, validated through relational semantics and exploration of gene functional ontology. Three novel SGR-based methods are proposed for enhancing FISH-based spatial transcriptomics, detecting spatially variable genes, and spatial clustering. Extensive real data results affirm the superior performance of these methods, highlighting the utility of SGRs for downstream tasks.

1 Preparation

1.1 Installation

To use SIGEL, please download code from https://github.com/WLatSunLab/SIGEL or conduct code below:

git clone https://github.com/WLatSunLab/SIGEL.git
cd SIGEL/
python3 setup.py install

1.2 Getting Help

For inquiries related to SIGEL's codebase or experimental setup, please feel free to post your questions in the GitHub issues section of the SIGEL repository.

1.3 Datasets

You can access a variety of datasets for spatial gene expression analysis:

The Slide-seqV2 dataset of Mouse Hippocampus (ssq-mHippo)
The 10x-Visium datasets of Human Dorsolateral Prefrontal Cortex (10x-hDLPFC)
The 10x-Visium dataset of Human Breast Cancer (10x-hBC)
The 10x-Visium datasets of Human Middle Temporal Gyrus (10x-hMTG)
The 10x-Visium dataset of Mouse Embryo (10x-mEmb)
The SeqFISH dataset of Mouse Embryo (sqf-mEmb)

Example data required for SIGEL is available here. Please ensure that these data are properly organized as followes:

 . <SIGEL>
        ├── ...
        ├── <data>
        │   ├── 151676_10xvisium.h5ad
        │   ├── DLPFC_matrix_151676.dat
        │   └── <mEmb>
        │       ├── 10x_mEmb_matrix.dat
        │       ├── sqf_mEmb_adata.h5ad
        │       └── qf_mEmb_matrix.dat
        ├── <model_pretrained>
        │   │
        └── ...

2 Obtain SGRs

In this task, we use 10x-hDLPFC-151676 dataset to generate SGRs.

2.1 Data preprocessing

We follow standard preprocessing for ST data using SCANPY, removing mitochondrial and ERCC spike-in genes initially. Genes present in fewer than 10 spots are excluded. Quality control is not applied to spatial spots to maintain data integrity. Gene expression counts are normalized by library size and log-transformed. Afterward, spatial coordinates from ST data are converted into images based on quality control outcomes.

from src.SIGEL.SIGEL import SIGEL
import warnings
warnings.filterwarnings("ignore")

adata= SIGEL.get_data(sample_id='151676', data_type='adata')
dataset, adata = SIGEL.data_process(adata)
gene_name = adata.var.index.values

2.2 Display gene images

Display the visual effects of 8 genes converted into images using the plt.imshow() function.

import matplotlib.pyplot as plt

fig, axs = plt.subplots(2, 3,figsize=(15, 10), dpi=100)

for i in range(2):
    for j in range(3):
        axs[i, j].imshow(dataset[12+i+j])
        axs[i, j].set_title(f'{gene_name[12+i+j]}')

plt.tight_layout()
plt.show()

png

2.3 Getting SGRs

labels, SGRs, model = SIGEL.train(dataset, pretrain=True)

use cuda: True
Pretrain: 100%|██████████| 30/30 [02:56<00:00,  5.87s/it]
model saved to model_pretrained/SIGEL.pkl.
load pretrained mae from model_pretrained/SIGEL.pkl
Clustering: 100%|██████████| 30/30 [07:24<00:00, 14.81s/it]
torch.Size([16060, 32])

2.4 Gene co-expression information

The effects of clustering as an auxiliary task will be demonstrated, and it will be evaluated whether SGRs can capture gene co-expression information.

import squidpy as sq
adata.var['cluster_id']=labels
name = list(adata.var_names[adata.var['cluster_id']==1])
sq.pl.spatial_scatter(adata, color=name[:8])

png

3 Imputing missing genes in FISH-based ST to enhance transcriptomic coverage

In this section, We introduce SIGEL-enhanced-transcriptomics-coverage (SIGEL-ETC), a novel SGR-based Generative Adversarial Network (GAN) model. It assumes gene relational semantics are consistent across ST data for the same tissue, allowing it to extrapolate spatial profiles from covered to uncovered genes using comprehensive datasets like 10x Visium. Its effectiveness is demonstrated by reproducing spatial gene profiles from a mouse embryo SeqFISH dataset (sqf-mEmb), guided by SGRs from a 10x Visium dataset (10x-mEmb).

3.1 Load 10x-/sqf-mEmb

We begin by loading and visualizing SeqFISH and 10x Visium data from mice embryos that share similar health statuses and developmental stages.

from src.SIGEL_ETC.src.SIGEL_ETC import SIGEL_ETC
adata = SIGEL_ETC.get_data(data='sqf', data_type='adata')
adata, key_m, dataset_m = SIGEL_ETC.data_process(adata)
#key_m, dataset_m = SIGEL_ETC.get_data(data='sqf', data_type='image')
key_v, dataset_v = SIGEL_ETC.get_data(data='10x', data_type='image')

adata refers to the sqf-mEmb dataset.
key_m represents the gene names in the sqf-mEmb dataset.
dataset_m denotes the imagified version of the sqf-mEmb dataset.
key_v is the gene name in the 10x-mEmb dataset.
dataset_v refers to the imagified 10x-mEmb dataset.

3.2 Display gene images

import matplotlib.pyplot as plt
fig, axs = plt.subplots(2, 4,figsize=(15, 10), dpi=100)

for i in range(2):
    for j in range(4):
        axs[i, j].imshow(dataset_m[12+i+j])
        axs[i, j].set_title(f'{key_m[12+i+j]}')

plt.tight_layout()
plt.show()

png

import matplotlib.pyplot as plt
fig, axs = plt.subplots(2, 4,figsize=(15, 10), dpi=100)

for i in range(2):
    for j in range(4):
        axs[i, j].imshow(dataset_v[12+i+j])
        axs[i, j].set_title(f'{key_v[12+i+j]}')

plt.tight_layout()
plt.show()

png

3.3 Train SIGEL

We train SIGEL using all genes from the 10x-mEmb dataset.

_, model = SIGEL_ETC.train(dataset_v, pretrain=True)
model = SIGEL_ETC.load_model()
all_gmat_v2m, all_gmat_m2v = SIGEL_ETC.data_filter(key_v, dataset_v, key_m, dataset_m)

use cuda: True
Pretraining: 100%|██████████| 30/30 [01:11<00:00,  2.37s/it]
model saved to model_pretrained/SIGEL_ETC.pkl.
load pretrained mae from model_pretrained/SIGEL_ETC.pkl
Clustering: 100%|██████████| 10/10 [02:03<00:00, 12.32s/it]

model: Refers to the SIGEL model that has been trained using the imagified data from the 10x-mEmb dataset.
all_gamt_v2m: A dictionary mapping gene names to sqf-mEmb image pairs, derived from intersecting genes between the sqf-mEmb and 10x-mEmb datasets.
all_gamt_m2v: A dictionary mapping gene names to 10x-mEmb image pairs, also derived from intersecting genes between the sqf-mEmb and 10x-mEmb datasets.

3.4 Train ETC-GAN

We use 80% of the intersecting genes between the sqf-mEmb and 10x-mEmb datasets as the training set to train SIGEL-ETC.

ETC_GAN, model, SGRs = SIGEL_ETC.train_ETC(adata, all_gmat_v2m, all_gmat_m2v, model)

100%|██████████| 100/100 [00:52<00:00,  1.91it/s]

3.5 Imputating gene expression

We evaluate the SIGEL-ETC on the test set.

import matplotlib.pyplot as plt
import torch
import numpy as np
gene_imp = SIGEL_ETC.sqf_gen(ETC_GAN, SGRs, adata)

def prepare_and_show_image(ax, tensor_data, title, cmap='Greys', interpolation='nearest'):
    img = torch.tensor(tensor_data)
    img = img.squeeze(0).cpu().detach().numpy()
    ax.imshow(img, cmap=cmap, interpolation=interpolation)
    ax.set_title(title)
    ax.axis('off')

# Initialize plot
fig, axs = plt.subplots(3, 3, figsize=(12, 15), dpi=300)

# Data and indices
key_m = np.array(list(all_gmat_m2v.keys()))
indices = [330, 308, 325]

for row, idx in enumerate(indices):
    # Show real 10x-mEmv
    prepare_and_show_image(axs[row, 0], all_gmat_v2m[key_m[idx]], key_m[idx])

    # Show real sqf-mEmb
    prepare_and_show_image(axs[row, 1], all_gmat_m2v[key_m[idx]], f'real_{key_m[idx]}')

    # Show generated sqf-mEmb
    img_fake = gene_imp[idx]
    img_real = torch.tensor(all_gmat_m2v[key_m[idx]]).squeeze(0).cpu().detach().numpy()
    prepare_and_show_image(axs[row, 2], img_fake, f'gen_{key_m[idx]}')

plt.show()

png

3.6 Enhanced spatial clustering

import squidpy as sq
from src.SIGEL_ETC.src.utils import sequential_gen
dataset_m_660, key_m_660 = sequential_gen(model, ETC_GAN, all_gmat_v2m, all_gmat_m2v)

Shape of the non-overlapping gene set: (16569, 35, 30)
Shape of SGRs: torch.Size([16569, 64])
The first imputation finished, get 110 new genes
The second imputation finished, get 220 new genes
The third imputation finished, get 330 new genes

3.7 Display ground truth

sq.pl.spatial_scatter(
    adata, color="celltype_mapped_refined", shape=None, figsize=(5, 10), dpi=500, frameon=False, legend_loc=None, save = 'grouns_truth.png', title=None
)

3.8 Display origin leigen result

from src.SIGEL_ETC.src.utils import ari_evalution
adata = adata[:,list(all_gmat_m2v.keys())]
ari_evalution(adata)

Iteration 0: Resolution 0.55 -> 20 clusters
Iteration 1: Resolution 0.775 -> 22 clusters
ARI: 0.3612215326878697
WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`

3.9 Display enhanced leigen result

X_imp = np.zeros([adata.X.shape[0], len(dataset_m_660)])
for i in range(X_imp.shape[1]):
    X_imp[:, i] = dataset_m_660[i][adata.obs['array_y'], adata.obs['array_x']]
print(X_imp.shape)

import pandas as pd
import anndata as ad
n_obs = X_imp.shape[1]
var_names = key_m_660[:n_obs]
var = pd.DataFrame(index=var_names)
adata_imp = ad.AnnData(X_imp, obs=adata.obs, var=var, obsm = adata.obsm, uns = adata.uns)

ari_evalution(adata_imp)

Iteration 0: Resolution 0.55 -> 41 clusters
Iteration 1: Resolution 0.325 -> 31 clusters
Iteration 2: Resolution 0.21250000000000002 -> 22 clusters
ARI: 0.4421413788706413
WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`

4 Spatial variability genes detection

4.1 Load dataset

```bash from src.SIGEL_SVG import simu_zinb from src.SIGEL_SVG import get_svg_score from src.SIGEL.SIGEL import SIGEL import matplotlib.pyplot as plt import numpy as np import pandas as pd import scanpy as sc import warnings warnings.filterwarnings("ignore")

adata= SIGEL.get_data(sample_id='151676', data_type='adata') dataset, adata = SIGEL.data_process(adata) gene_name = adata.var.index.values


## 4.2 Train SIGEL
```bash
labels, SGRs, model = SIGEL.train(dataset=dataset, pretrain=False)

4.3 Calculate SVGs score

svg_score = get_svg_score(SGRs, dataset, adata, model, sim_times=10)
ascending_indices = np.argsort(svg_score)
descending_indices = ascending_indices[::-1]

4.4 Display SVGs

## gene expression whitin high svg score
plot_gene = gene_name[descending_indices[:4]]

fig, axs = plt.subplots(2, 2, figsize=(8, 8))

sc.pl.spatial(adata, img_key="hires", color=plot_gene[0], show=False, ax=axs[0, 0], title=plot_gene[0], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[1], show=False, ax=axs[0, 1], title=plot_gene[1], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[2], show=False, ax=axs[1, 0], title=plot_gene[2], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[3], show=False, ax=axs[1, 1], title=plot_gene[3], vmax='p99')

plt.tight_layout()
plt.savefig("svg_high_6genes.png", dpi=100, bbox_inches="tight")
plt.show()

png

4.5 DIsplay non-SVGs

## gene expression whitin low svg score
plot_gene = gene_name[ascending_indices[2:6]]

fig, axs = plt.subplots(2, 2, figsize=(8, 8))

sc.pl.spatial(adata, img_key="hires", color=plot_gene[0], show=False, ax=axs[0, 0], title=plot_gene[0], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[1], show=False, ax=axs[0, 1], title=plot_gene[1], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[2], show=False, ax=axs[1, 0], title=plot_gene[2], vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene[3], show=False, ax=axs[1, 1], title=plot_gene[3], vmax='p99')

plt.tight_layout()
plt.savefig("svg_low_6genes.png", dpi=100, bbox_inches="tight")
plt.show()

png

5 Designated expression patterns across specific tissue regions

In this section, we use the 10x-hDLPFC-151676 dataset as an example to demonstrate gene simulation in the L1-L6 and white matter (WM) regions according to two specific expression patterns: top 65%-5% and 5%-65%.

5.1 Data loading and preprocessing

To ensure the data better conforms to the negative binomial distribution, we applied size-factor normalization to the loaded dataset.

cd SIGEL
from src.SIGEL.SIGEL import SIGEL
import warnings
from src.SIGEL_SPS import preprocess, sim_spatial_pattern
warnings.filterwarnings("ignore")
adata = SIGEL.get_data(sample_id='151676', data_type='adata')
preprocess = preprocess(seed=12)
adata = preprocess.data_process(adata=adata)

5.2 Simulate

Specify the expression levels for each domain and then simulate genes with designated expression patterns.

sim_spatial_pattern = sim_spatial_pattern(seed=12)
sim_data = sim_spatial_pattern.simulate(adata, region_key='cluster', region_list=['L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'WM'],
                   t=[65,55,45,35,25,15,5])

L1 has been simulated.
L2 has been simulated.
L3 has been simulated.
L4 has been simulated.
L5 has been simulated.
L6 has been simulated.
WM has been simulated.

sim_data contains the simulated gene dataset, with the first gene generated by SIGEL-SPS.
region_key refers to the pattern set.
region_list is name list of patterns.
t represents each pattern's expression level, where 65 indicates the top 65%.

5.3 Denoising

In this section, we use STAGTE as a denoising method to more clearly present the simulation results.

import scanpy as sc
import matplotlib.pyplot as plt
import STAGATE_pyG.STAGATE_pyG as STAGATE
sc.pp.normalize_total(sim_data, target_sum=1e4)
sc.pp.log1p(sim_data)
STAGATE.Cal_Spatial_Net(sim_data, rad_cutoff=300)
STAGATE.Stats_Spatial_Net(sim_data)

------Calculating spatial graph...
The graph contains 58748 edges, 3425 cells.
17.1527 neighbors per cell on average.

png

sim_data = STAGATE.train_STAGATE(sim_data, save_reconstrction=True)

Size of Input:  (3425, 13722)
STAGATE(
  (conv1): GATConv(13722, 512, heads=1)
  (conv2): GATConv(512, 30, heads=1)
  (conv3): GATConv(30, 512, heads=1)
  (conv4): GATConv(512, 13722, heads=1)
)


100%|██████████| 1000/1000 [00:20<00:00, 47.65it/s]

5.4 Display simulated results

gene_list = sim_data.var_names
fig, axs = plt.subplots(1, 2, figsize=(10, 3), gridspec_kw={'width_ratios': [1, 1]})
sc.pl.spatial(adata, img_key="hires", color=gene_list[0], show=False, ax=axs[0], title='simulated_result', vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=gene_list[0], show=False, ax=axs[1], title='STAGATE_simulated_result', layer='STAGATE_ReX', vmax='p99')
for ax in axs:
    ax.title.set_size(9)
plt.subplots_adjust(wspace=0.5)
plt.savefig("simulated_65_5.png", dpi=100, bbox_inches="tight")
plt.show()

png

5.5 Another example

from src.SIGEL.SIGEL import SIGEL
import warnings
from src.SIGEL_SPS import preprocess, sim_spatial_pattern
warnings.filterwarnings("ignore")
adata = SIGEL.get_data(sample_id='151676', data_type='adata')
preprocess = preprocess(seed=12)
adata = preprocess.data_process(adata=adata)

sim_spatial_pattern = sim_spatial_pattern(seed=12)
sim_data = sim_spatial_pattern.simulate(adata, region_key='cluster', region_list=['L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'WM'],
                   t=[5,15,25,35,45,55,65])

L1 has been simulated.
L2 has been simulated.
L3 has been simulated.
L4 has been simulated.
L5 has been simulated.
L6 has been simulated.
WM has been simulated.

import scanpy as sc
import matplotlib.pyplot as plt
import STAGATE_pyG.STAGATE_pyG as STAGATE
sc.pp.normalize_total(sim_data, target_sum=1e4)
sc.pp.log1p(sim_data)
STAGATE.Cal_Spatial_Net(sim_data, rad_cutoff=300)
STAGATE.Stats_Spatial_Net(sim_data)

------Calculating spatial graph...
The graph contains 58748 edges, 3425 cells.
17.1527 neighbors per cell on average.

png

sim_data = STAGATE.train_STAGATE(sim_data, save_reconstrction=True)

Size of Input:  (3425, 13722)
STAGATE(
  (conv1): GATConv(13722, 512, heads=1)
  (conv2): GATConv(512, 30, heads=1)
  (conv3): GATConv(30, 512, heads=1)
  (conv4): GATConv(512, 13722, heads=1)
)


100%|██████████| 1000/1000 [00:21<00:00, 46.53it/s]

gene_list = sim_data.var_names
fig, axs = plt.subplots(1, 2, figsize=(10, 3), gridspec_kw={'width_ratios': [1, 1]})
sc.pl.spatial(adata, img_key="hires", color=gene_list[0], show=False, ax=axs[0], title='simulated_result', vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=gene_list[0], show=False, ax=axs[1], title='STAGATE_simulated_result', layer='STAGATE_ReX', vmax='p99')
for ax in axs:
    ax.title.set_size(9)
plt.subplots_adjust(wspace=0.5)
plt.savefig("simulated_5_65.png", dpi=100, bbox_inches="tight")
plt.show()

png

6 Improving spatial clustering

In this section, we use the 10x-hDLPFC-151676 dataset as test data to perform spatial clustering with SpaGCN. We will evaluate the clustering performance by filtering genes based on the SVG_Scores in three scenarios: selecting 60% of the genes, selecting 70% of the genes, and using all genes without any filtering.

6.1 Load dataset

from src.SIGEL.SIGEL import SIGEL
import warnings
warnings.filterwarnings("ignore")

adata= SIGEL.get_data(sample_id='151676', data_type='adata')
dataset, adata = SIGEL.data_process(adata)
gene_name = adata.var.index.values

6.2 Obtain SGRs

labels, SGRs, model = SIGEL.train(dataset, pretrain=True)

use cuda: True
Pretrain: 100%|██████████| 50/50 [02:49<00:00,  3.39s/it]
model saved to model_pretrained/SIGEL.pkl.
load pretrained mae from model_pretrained/SIGEL.pkl
Clustering: 100%|██████████| 30/30 [06:10<00:00, 12.34s/it]

6.3 Calculate SVG_Scores

from src.SIGEL_SVG import get_svg_score
svg_score = get_svg_score(SGRs, dataset, adata, model, sim_times=10)

100%|██████████| 10/10 [02:11<00:00, 13.20s/it]

6.4 60% Selection

from src.SIGEL_ISC import SIGEL_ISC
adata= SIGEL.get_data(sample_id='151676', data_type='adata')
adata.obs['Ground Truth']=adata.obs['layer_guess_reordered_short'] 
new_samples_indices = SIGEL_ISC.gene_select(labels, svg_score, selection_percentage=0.6)
adata = SIGEL_ISC.spatial_clustering(adata, new_samples_indices, n_clusters=7)

Calculateing adj matrix using xy only...
Run 1: l [0.01, 1000], p [0.0, 3421.9987256815807]
Run 2: l [0.01, 500.005], p [0.0, 3416.00732421875]
Run 3: l [0.01, 250.0075], p [0.0, 3392.22265625]
Run 4: l [0.01, 125.00874999999999], p [0.0, 3299.90869140625]
Run 5: l [0.01, 62.509375], p [0.0, 2971.17138671875]
un 6: l [0.01, 31.2596875], p [0.0, 2097.834228515625]
Run 7: l [0.01, 15.63484375], p [0.0, 945.4818725585938]
Run 8: l [0.01, 7.822421875], p [0.0, 304.0054626464844]
Run 9: l [0.01, 3.9162109375], p [0.0, 84.49016571044922]
Run 10: l [0.01, 1.9631054687499998], p [0.0, 21.678667068481445]
Run 11: l [0.01, 0.9865527343749999], p [0.0, 4.903018474578857]
Run 12: l [0.01, 0.49827636718749996], p [0.0, 0.5869753360748291]
Run 13: l [0.25413818359374996, 0.49827636718749996], p [0.0016840696334838867, 0.5869753360748291]
Run 14: l [0.37620727539062493, 0.49827636718749996], p [0.11651778221130371, 0.5869753360748291]
Run 15: l [0.4372418212890624, 0.49827636718749996], p [0.304052472114563, 0.5869753360748291]
Run 16: l [0.4677590942382812, 0.49827636718749996], p [0.43444645404815674, 0.5869753360748291]
recommended l =  0.48301773071289056
Start at res =  0.7 step =  0.1
Initializing cluster centers with louvain, resolution =  0.7
Epoch  0
Epoch  10
Res =  0.7 Num of clusters =  5
Initializing cluster centers with louvain, resolution =  0.7999999999999999
Epoch  0
Epoch  10
Res =  0.7999999999999999 Num of clusters =  6
Res changed to 0.7999999999999999
Initializing cluster centers with louvain, resolution =  0.8999999999999999
Epoch  0
Epoch  10
Res =  0.8999999999999999 Num of clusters =  7
recommended res =  0.8999999999999999
Initializing cluster centers with louvain, resolution =  0.8999999999999999
Epoch  0
Epoch  10
Epoch  20
Epoch  30
Epoch  40
Epoch  50
Epoch  60
Epoch  70
Epoch  80
Epoch  90
Epoch  100
Epoch  110
Epoch  120
Epoch  130
Epoch  140
Epoch  150
Epoch  160
Epoch  170
Epoch  180
Epoch  190
Calculateing adj matrix using xy only...

6.5 NMI & ARI

## calculate ARI and NMI
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics import adjusted_rand_score
label_pred = list(adata.obs['r_pred'])
label_pred = np.array(label_pred).astype(int)

ground_truth = list(adata.obs['Ground Truth'])
category_to_int = {category: idx for idx, category in enumerate(set(ground_truth))}

i = 0
for key in category_to_int:
    category_to_int[key] = i
    i = i+1

ground = [category_to_int[category] for category in ground_truth]

ari = adjusted_rand_score(ground, label_pred)
print("ARI:", ari)

nmi = normalized_mutual_info_score(ground, label_pred)
print("NMI:", nmi)

ARI: 0.48727462612238504
NMI: 0.6308788297383676

6.6 70% Selection

from src.SIGEL_ISC import SIGEL_ISC
adata= SIGEL.get_data(sample_id='151676', data_type='adata')
adata.obs['Ground Truth']=adata.obs['layer_guess_reordered_short'] 
new_samples_indices = SIGEL_ISC.gene_select(labels, svg_score, selection_percentage=0.7)
adata = SIGEL_ISC.spatial_clustering(adata, new_samples_indices, n_clusters=7)

select 11029 genes
Calculateing adj matrix using xy only...
Run 1: l [0.01, 1000], p [0.0, 3421.9987256815807]
Run 2: l [0.01, 500.005], p [0.0, 3416.00732421875]
Run 3: l [0.01, 250.0075], p [0.0, 3392.22265625]
Run 4: l [0.01, 125.00874999999999], p [0.0, 3299.90869140625]
Run 5: l [0.01, 62.509375], p [0.0, 2971.17138671875]
Run 6: l [0.01, 31.2596875], p [0.0, 2097.834228515625]
Run 7: l [0.01, 15.63484375], p [0.0, 945.4818725585938]
Run 8: l [0.01, 7.822421875], p [0.0, 304.0054626464844]
Run 9: l [0.01, 3.9162109375], p [0.0, 84.49016571044922]
Run 10: l [0.01, 1.9631054687499998], p [0.0, 21.678667068481445]
Run 11: l [0.01, 0.9865527343749999], p [0.0, 4.903018474578857]
Run 12: l [0.01, 0.49827636718749996], p [0.0, 0.5869753360748291]
Run 13: l [0.25413818359374996, 0.49827636718749996], p [0.0016840696334838867, 0.5869753360748291]
Run 14: l [0.37620727539062493, 0.49827636718749996], p [0.11651778221130371, 0.5869753360748291]
Run 15: l [0.4372418212890624, 0.49827636718749996], p [0.304052472114563, 0.5869753360748291]
Run 16: l [0.4677590942382812, 0.49827636718749996], p [0.43444645404815674, 0.5869753360748291]
recommended l =  0.48301773071289056
Start at res =  0.7 step =  0.1
Initializing cluster centers with louvain, resolution =  0.7
Epoch  0
Epoch  10
Res =  0.7 Num of clusters =  6
Initializing cluster centers with louvain, resolution =  0.7999999999999999
Epoch  0
Epoch  10
Res =  0.7999999999999999 Num of clusters =  7
recommended res =  0.7999999999999999
Initializing cluster centers with louvain, resolution =  0.7999999999999999
Epoch  0
Epoch  10
Epoch  20
Epoch  30
Epoch  40
delta_label  0.004963503649635037 < tol  0.005
Reach tolerance threshold. Stopping training.
Total epoch: 46
Calculateing adj matrix using xy only...

## calculate ARI and NMI
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics import adjusted_rand_score
label_pred = list(adata.obs['r_pred'])
label_pred = np.array(label_pred).astype(int)

ground_truth = list(adata.obs['Ground Truth'])
category_to_int = {category: idx for idx, category in enumerate(set(ground_truth))}

i = 0
for key in category_to_int:
    category_to_int[key] = i
    i = i+1

ground = [category_to_int[category] for category in ground_truth]

ari = adjusted_rand_score(ground, label_pred)
print("ARI:", ari)

nmi = normalized_mutual_info_score(ground, label_pred)
print("NMI:", nmi)

ARI: 0.36240751573191976
NMI: 0.4967488597318761

6.7 100% Selection

from src.SIGEL_ISC import SIGEL_ISC
adata= SIGEL.get_data(sample_id='151676', data_type='adata')
adata.obs['Ground Truth']=adata.obs['layer_guess_reordered_short'] 
new_samples_indices = SIGEL_ISC.gene_select(labels, svg_score, selection_percentage=1.0)
adata = SIGEL_ISC.spatial_clustering(adata, new_samples_indices, n_clusters=7)

select 16060 genes
Calculateing adj matrix using xy only...
Run 1: l [0.01, 1000], p [0.0, 3421.9987256815807]
Run 2: l [0.01, 500.005], p [0.0, 3416.00732421875]
Run 3: l [0.01, 250.0075], p [0.0, 3392.22265625]
Run 4: l [0.01, 125.00874999999999], p [0.0, 3299.90869140625]
Run 5: l [0.01, 62.509375], p [0.0, 2971.17138671875]
Run 6: l [0.01, 31.2596875], p [0.0, 2097.834228515625]
Run 7: l [0.01, 15.63484375], p [0.0, 945.4818725585938]
Run 8: l [0.01, 7.822421875], p [0.0, 304.0054626464844]
Run 9: l [0.01, 3.9162109375], p [0.0, 84.49016571044922]
Run 10: l [0.01, 1.9631054687499998], p [0.0, 21.678667068481445]
Run 11: l [0.01, 0.9865527343749999], p [0.0, 4.903018474578857]
Run 12: l [0.01, 0.49827636718749996], p [0.0, 0.5869753360748291]
Run 13: l [0.25413818359374996, 0.49827636718749996], p [0.0016840696334838867, 0.5869753360748291]
Run 14: l [0.37620727539062493, 0.49827636718749996], p [0.11651778221130371, 0.5869753360748291]
Run 15: l [0.4372418212890624, 0.49827636718749996], p [0.304052472114563, 0.5869753360748291]
Run 16: l [0.4677590942382812, 0.49827636718749996], p [0.43444645404815674, 0.5869753360748291]
recommended l =  0.48301773071289056
Start at res =  0.7 step =  0.1
Initializing cluster centers with louvain, resolution =  0.7
Epoch  0
Epoch  10
Res =  0.7 Num of clusters =  7
recommended res =  0.7
Initializing cluster centers with louvain, resolution =  0.7
Epoch  0
Epoch  10
Epoch  20
Epoch  30
Epoch  40
Epoch  50
Epoch  60
Epoch  70
Epoch  80
Epoch  90
Epoch  100
Epoch  110
Epoch  120
Epoch  130
delta_label  0.004087591240875913 < tol  0.005
Reach tolerance threshold. Stopping training.
Total epoch: 136
Calculateing adj matrix using xy only...

## calculate ARI and NMI
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics import adjusted_rand_score
label_pred = list(adata.obs['r_pred'])
label_pred = np.array(label_pred).astype(int)

ground_truth = list(adata.obs['Ground Truth'])
category_to_int = {category: idx for idx, category in enumerate(set(ground_truth))}

i = 0
for key in category_to_int:
    category_to_int[key] = i
    i = i+1

ground = [category_to_int[category] for category in ground_truth]

ari = adjusted_rand_score(ground, label_pred)
print("ARI:", ari)

nmi = normalized_mutual_info_score(ground, label_pred)
print("NMI:", nmi)

ARI: 0.32108580310325163
NMI: 0.487463199451013

Everything else is prepared and ready for deployment. If you have any other needs, please directly contact Wenlin Li at the email: zipging@gmail.com.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search