Downstream Analysis: CRG Identification, Alteration Type Disentanglement & CRR Delineation

This notebook performs three downstream analyses after SILPAG model training:

Task 1: CRG Identification — Identify condition-responsive genes.
Task 2: Alteration Type Disentanglement — Determine the nature of each CRG’s alteration.
Task 3: CRR Delineation — Delineate condition-responsive tissue regions.

Prerequisites: A trained SILPAG model saved from Getting Started with SILPAG.

0. Setup

[1]:

import warnings
warnings.filterwarnings('ignore')

import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
import torch
from scipy import sparse
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib

import SILPAG as sp

1. Load Data & Preprocessing

Same preprocessing as Getting Started with SILPAG — load reference (normal breast) and target (breast cancer), intersect genes, filter.

[2]:

ref = sc.read_h5ad('SILPAG/data/V02_nb.h5ad')
adata = sc.read_h5ad('SILPAG/data/H1_nb.h5ad')
ref.var_names_make_unique()
adata.var_names_make_unique()
sp.prefilter_specialgenes(ref)
sp.prefilter_specialgenes(adata)

gene = list(set(ref.var_names) & set(adata.var_names))
ref = ref[:, gene]
adata = adata[:, gene]

g1 = sc.pp.filter_genes(ref, min_cells=int(ref.shape[0] * 0.02), inplace=False)
g2 = sc.pp.filter_genes(adata, min_cells=int(adata.shape[0] * 0.02), inplace=False)
ref = ref[:, g1[0] | g2[0]]
adata = adata[:, g1[0] | g2[0]]

key = list(set(ref.var_names) & set(adata.var_names))
ref = ref[:, key]
adata = adata[:, key]
ref.var['hkgene'] = adata.var['hkgene']

labels = np.zeros(ref.shape[1], dtype=int)
labels[np.where(ref.var['hkgene'] == 1)[0]] = 1
del ref.raw

print(f'Reference: {ref.shape}, Target: {adata.shape}')

Reference: (1896, 7874), Target: (613, 7874)

2. Load Trained Model & Generate View-Transferred SGMs

Load the saved model, compute the codebook transport plan, and generate:

result1: view-transferred SGM (reference \(\rightarrow\) target via OT)
result2: target self-reconstruction

[3]:

# Load saved config first so that patch_size, K, img_size etc. are restored
args = sp.Config.load_config('SILPAG/saved_model/hBC_demo')
args.device = 'cuda:7'
args.train = False

# Clear size lists — GeneDataset will re-populate them from the actual data
args.orig_img_size = []
args.resized_img_size = []
args.img_size = []

gene_images = [ref.varm['gene_img'], adata.varm['gene_img']]
train_dataset = sp.GeneDataset(gene_images, labels, args)

from torch.utils.data import DataLoader
train_loader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=False, drop_last=False)
model, dataset, PI, Q = sp.train_marker(train_loader, hist_data=None, args=args)

print(f'Model loaded. Codebook sizes: {[model.get_codebook(i).shape[0] for i in range(args.num_slice)]}')

Shape of each View:  [(66, 96), (30, 30)]
Model loaded. Codebook sizes: [51, 203]

[4]:

# Compute gene codes via OT transport plan
code = sp.get_code(model, dataset, hist_data=None, pi_init=PI, source_idx=0, target_idx=1, args=args)
adata.varm['code_ref'] = code[0]  # reference -> target (transported)
adata.varm['code_tgt'] = code[1]  # target native

# Generate SGMs
result1 = model.generate(args, adata, tgt_index=1, code_id='code_ref', gene='all')  # view-transferred
result2 = model.generate(args, adata, tgt_index=1, code_id='code_tgt', gene='all')  # target self-recon

print(f'Generated SGMs: {result1.shape}, {result2.shape}')

Generating...: 100%|██████████| 7874/7874 [00:07<00:00, 1017.53gene/s]
Generating...: 100%|██████████| 7874/7874 [00:07<00:00, 992.01gene/s]

Generated SGMs: (613, 7874), (613, 7874)

Task 1: CRG Identification

For each gene \(i\), compute the CRG score:

\[S_i = \|\hat{\mathbf{x}}^{r \to t}_i - \mathbf{x}^t_i\|_F^2\]

The anchor gene scores form an empirical null modelled as \(S \sim \Gamma(k, \theta)\). Genes with \(p < 0.05\) are designated as CRGs.

[5]:

# Compute CRG scores
scores = sp.compute_crg_scores(result1, adata)

# Anchor mask from training
anchor_mask = dataset.labels.astype(bool)
print(f'Total genes: {len(scores)}, Anchor genes: {anchor_mask.sum()}')

# Identify CRGs
# null_model='gamma': fits Gamma distribution to anchor scores
# No FDR correction — use raw p-value directly
crg_result = sp.identify_crgs(
    scores, anchor_mask,
    p_threshold=0.1,
    null_model='gamma',
    correction=None,
)

n_crg = crg_result['is_crg'].sum()
print(f'Identified {n_crg} CRGs (p < 0.05, null_model=gamma, no FDR)')
print(f'Null info: {crg_result["null_info"]}')

Total genes: 7874, Anchor genes: 1914
Identified 193 CRGs (p < 0.05, null_model=gamma, no FDR)
Null info: {'model': 'gamma', 'k': 0.4509458695722796, 'loc': 0, 'theta': np.float32(2338.6943)}

Store CRG Results in AnnData

[6]:

adata.var['crg_score'] = scores
adata.var['crg_pval'] = crg_result['pvalues']
adata.var['crg_fdr'] = crg_result['pvalues_adj']
adata.var['is_crg'] = crg_result['is_crg']

crg_genes = adata.var_names[crg_result['is_crg']]
print(f'Top 20 CRGs by score:')
adata.var.loc[crg_genes].sort_values('crg_score', ascending=False).head(20)[['crg_score', 'crg_fdr']]

Top 20 CRGs by score:

[6]:

	crg_score	crg_fdr
IGKC	6.109030e+06	0.000000e+00
IGHA1	3.070430e+06	0.000000e+00
IGHG4	1.892223e+06	0.000000e+00
IGHG3	7.547800e+05	0.000000e+00
IGLC3	4.252860e+05	0.000000e+00
TMSB10	3.985322e+05	0.000000e+00
IGLC2	1.883705e+05	0.000000e+00
RACK1	1.202756e+05	0.000000e+00
ERBB2	1.130827e+05	0.000000e+00
GAPDH	1.039025e+05	0.000000e+00
MGP	1.028431e+05	0.000000e+00
IGHG1	9.073847e+04	0.000000e+00
S100A9	8.803026e+04	0.000000e+00
PFN1	5.699932e+04	2.245537e-12
ACTB	5.231029e+04	1.744782e-11
UBA52	4.986116e+04	5.099254e-11
SSR4	3.802821e+04	9.255538e-09
CD74	3.611092e+04	2.158245e-08
KRT19	3.067662e+04	2.397699e-07
CST3	2.966526e+04	3.759009e-07

Task 2: Alteration Type Disentanglement

For each CRG, three statistical tests determine the nature of the alteration:

Test	What it detects	Method
Differential Abundance	Shift in expression magnitude	Wilcoxon rank-sum
Differential Prevalence	Change in fraction of expressing spots	Fisher’s exact (binarised at \(\epsilon\))
Differential Spatial Pattern	Structured spatial discrepancy	Moran’s I permutation on SDM

A gene can be assigned to multiple alteration types.

[7]:

df_alt = sp.disentangle_alteration_types(
    result_ref=result1,
    adata=adata,
    crg_mask=crg_result['is_crg'],
    epsilon=0.01,              # binarisation threshold for prevalence
    n_neighbors=15,
    fdr_threshold=0.05,
    abundance_effect_size=1,
    spatial_I_threshold=0.5,   # require Moran's I >= 0.1 for spatial
)

print(f'Disentanglement results for {len(df_alt)} CRGs:')
print(df_alt[['is_diff_abundance', 'is_diff_prevalence', 'is_diff_spatial']].sum())

Disentanglement results for 193 CRGs:
is_diff_abundance     95
is_diff_prevalence    47
is_diff_spatial       62
dtype: int64

Detailed Results Table

[8]:

# Merge CRG scores into the disentanglement table
df_alt['crg_score'] = adata.var.loc[df_alt.index, 'crg_score'].values
df_alt['crg_fdr'] = adata.var.loc[df_alt.index, 'crg_fdr'].values

df_display = df_alt.sort_values('crg_score', ascending=False)
df_display[['crg_score', 'crg_fdr', 'abundance_fdr', 'prevalence_fdr', 'spatial_fdr']].head(15)

[8]:

	crg_score	crg_fdr	abundance_fdr	prevalence_fdr	spatial_fdr
gene
IGKC	6.109030e+06	0.000000e+00	0.000000e+00	1.0	0.0
IGHA1	3.070430e+06	0.000000e+00	0.000000e+00	1.0	0.0
IGHG4	1.892223e+06	0.000000e+00	0.000000e+00	1.0	0.0
IGHG3	7.547800e+05	0.000000e+00	0.000000e+00	1.0	0.0
IGLC3	4.252860e+05	0.000000e+00	0.000000e+00	1.0	0.0
TMSB10	3.985322e+05	0.000000e+00	0.000000e+00	1.0	0.0
IGLC2	1.883705e+05	0.000000e+00	0.000000e+00	1.0	0.0
RACK1	1.202756e+05	0.000000e+00	0.000000e+00	1.0	0.0
ERBB2	1.130827e+05	0.000000e+00	7.353808e-286	1.0	0.0
GAPDH	1.039025e+05	0.000000e+00	0.000000e+00	1.0	0.0
MGP	1.028431e+05	0.000000e+00	3.347538e-280	1.0	0.0
IGHG1	9.073847e+04	0.000000e+00	0.000000e+00	1.0	0.0
S100A9	8.803026e+04	0.000000e+00	0.000000e+00	1.0	0.0
PFN1	5.699932e+04	2.245537e-12	1.087336e-297	1.0	0.0
ACTB	5.231029e+04	1.744782e-11	2.150568e-265	1.0	0.0

Example CRG Visualisation

Visualise the reference expression, view-transferred SGM, target self-reconstruction, and SDM.

[9]:

# Visualise
top_gene = 'S100A8'
gene_idx = list(adata.var_names).index(top_gene)

X_ref_sgm = result1.X.toarray() if sparse.issparse(result1.X) else np.array(result1.X)

# Reference slice: original expression on reference coordinates
ref_expr = ref.X.toarray()[:, gene_idx] if sparse.issparse(ref.X) else np.array(ref.X)[:, gene_idx]
ref_coords = ref.obsm['spatial']

# Target slice
tgt_coords = adata.obsm['spatial']
expr_vt = X_ref_sgm[:, gene_idx]      # view-transferred

# Original target expression
tgt_raw = adata.X.toarray()[:, gene_idx] if sparse.issparse(adata.X) else np.array(adata.X)[:, gene_idx]

# Normalize (library-size) + log1p before visualisation
def _norm_log(x):
    total = x.sum()
    if total > 0:
        x = x / total * np.median([ref_expr.sum(), expr_vt.sum(), tgt_raw.sum()])
    return np.log1p(x)

ref_expr = _norm_log(ref_expr)
expr_vt  = _norm_log(expr_vt)
tgt_raw  = _norm_log(tgt_raw)

sdm = tgt_raw - expr_vt               # SDM: original target - view-transferred

fig, axes = plt.subplots(1, 4, figsize=(22, 4.5))

# Unified color scale for expression panels
vmin_expr = 0
vmax_expr = max(ref_expr.max(), expr_vt.max(), tgt_raw.max())

# Panel 0: Reference expression (on reference coordinates)
sc0 = axes[0].scatter(ref_coords[:, 0], ref_coords[:, 1], c=ref_expr, s=8, cmap='Reds', vmin=vmin_expr, vmax=vmax_expr)
axes[0].set_title(f'{top_gene} — Reference Expression')
axes[0].set_aspect('equal')
axes[0].invert_yaxis()
plt.colorbar(sc0, ax=axes[0], shrink=0.7)

# Panel 1: View-transferred SGM (on target coordinates)
sc1 = axes[1].scatter(tgt_coords[:, 0], tgt_coords[:, 1], c=expr_vt, s=8, cmap='Reds', vmin=vmin_expr, vmax=vmax_expr)
axes[1].set_title(f'{top_gene} — View-transferred (Ref→Tgt)')
axes[1].set_aspect('equal')
axes[1].invert_yaxis()
plt.colorbar(sc1, ax=axes[1], shrink=0.7)

# Panel 2: Original target expression (on target coordinates)
sc2 = axes[2].scatter(tgt_coords[:, 0], tgt_coords[:, 1], c=tgt_raw, s=8, cmap='Reds', vmin=vmin_expr, vmax=vmax_expr)
axes[2].set_title(f'{top_gene} — Original Target Expression')
axes[2].set_aspect('equal')
axes[2].invert_yaxis()
plt.colorbar(sc2, ax=axes[2], shrink=0.7)

# Panel 3: SDM (symmetric scale)
vlim = max(abs(sdm.min()), abs(sdm.max()))
sc3 = axes[3].scatter(tgt_coords[:, 0], tgt_coords[:, 1], c=sdm, s=8, cmap='RdBu_r', vmin=-vlim, vmax=vlim)
axes[3].set_title(f'{top_gene} — SDM ($x^t - \\hat{{x}}^{{r→t}}$)')
axes[3].set_aspect('equal')
axes[3].invert_yaxis()
plt.colorbar(sc3, ax=axes[3], shrink=0.7)

for ax in axes:
    ax.set_xticks([])
    ax.set_yticks([])

if top_gene in df_alt.index:
    alt_type = df_alt.loc[top_gene, 'alteration_type']
else:
    alt_type = 'N/A'
fig.suptitle(f'Gene: {top_gene} | CRG Score: {scores[gene_idx]:.2f} | Alteration: {alt_type}', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

Task 3: CRR Delineation

Delineate condition-responsive tissue regions (CRR) using two approaches:

Marker-based: For a known condition-associated gene \(g\), compute the SDM \(D_g = |\hat{\mathbf{x}}^{r \to t}_g - \mathbf{x}^t_g|\), smooth with a Gaussian kernel (\(\sigma=1\)), then apply Otsu’s method to find the optimal threshold \(\tau^*\) that maximises between-group variance.
Marker-free: When no markers are available, use all CRGs. Each spot \(i\) gets a discrepancy vector \(\mathbf{f}_i \in \mathbb{R}^{|I_{CRG}|}\). Leiden clustering on \(\{\mathbf{f}_i\}\) yields two spatial domains; the CRR is the one with higher aggregate discrepancy.

Marker-based CRR Delineation

Use ERBB2 as the marker gene to delineate the CRR.

[10]:

# Marker-based delineation with ERBB2
marker_gene = 'ERBB2'
print(f'Marker gene for CRR delineation: {marker_gene}')

crr_marker = sp.delineate_crr_marker(
    result_ref=result1,
    adata=adata,
    gene=marker_gene,
    sigma=1.0,
)

n_crr = crr_marker['crr_mask'].sum()
print(f'CRR spots: {n_crr}/{len(crr_marker["crr_mask"])} '
      f'(Otsu threshold: {crr_marker["threshold"]:.2f})')

# Visualise
sp.plot_crr(
    coords=crr_marker['coords'],
    crr_mask=crr_marker['crr_mask'],
    sdm_smoothed=crr_marker['sdm_smoothed'],
    title=f'Marker-based CRR ({marker_gene})',
)

Marker gene for CRR delineation: ERBB2
CRR spots: 96/613 (Otsu threshold: 2.70)

Store CRR Results

[11]:

# Store CRR annotations in adata.obs
adata.obs['crr_marker'] = crr_marker['crr_mask'].astype(int)
adata.obs['crr_marker_sdm'] = crr_marker['sdm_smoothed']

print(f'Marker-based CRR: {adata.obs["crr_marker"].sum()} spots')

Marker-based CRR: 96 spots

Save All Results

[12]:

# Save disentanglement table
df_alt.to_csv('crg_disentanglement_results.csv')

# Save full gene-level results
adata.var[['crg_score', 'crg_pval', 'crg_fdr', 'is_crg']].to_csv('crg_identification_results.csv')

# Save GO enrichment results
# df_pathway.to_csv('crg_pathway_enrichment_results.csv', index=False)

# Save CRR results
adata.obs[['crr_marker', 'crr_marker_sdm']].to_csv('crr_delineation_results.csv')

print('All results saved.')

All results saved.