Tutorial 2 — NSCLC (CosMx)

Train PHAROST on bulk RNA-Seq labels + 10x Xenium spatial transcriptomics for three drugs, then run downstream analyses on the predicted cell-resolved drug-response scores.

import os
import sys
import pandas as pd
import torch
from tqdm import tqdm

sys.path.insert(0, '.')
import pharost
/users/rwang257/.conda/envs/PHAROST_env/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Model Training

Train one PHAROST model per drug. Bulk RNA-Seq is the source domain (with known sensitive/resistant labels), spatial transcriptomics is the target domain. Domain alignment is enforced by LMMD + CORAL losses on the latent representations.

Configuration

Hyperparameters and input file locations. The bulk drug-response label CSV (ALL_label_harmonized.csv) provides the per-cell-line sensitivity labels; per-drug subsets are written to Selected_drug/{drug}.csv on first use.

n_epochs = 60
batch_size = 20
selected_drugs = ['GEFITINIB', 'OSIMERTINIB', 'CRIZOTINIB']

file_dir = 'data/CosMx_NSCLC_Preprocessed'
result_dir = 'NSCLC_result'
response_dir = f'{file_dir}/Selected_drug'
os.makedirs(response_dir, exist_ok=True)

drug_response = pd.read_csv('data/Preprocessed_Bulk_All/ALL_label_harmonized.csv', index_col=0)

Train per drug

For each drug we (i) cache its label column, then (ii) call pharost.train end-to-end. Each run writes the trained model, predicted probabilities, and a full training log to NSCLC_result/{drug}/.

for drug in tqdm(selected_drugs, desc="Processing drugs"):
    torch.cuda.empty_cache()

    response_filename = f'{response_dir}/{drug}.csv'
    if not os.path.exists(response_filename):
        drug_response[[drug]].to_csv(response_filename)

    pharost.train(
        p_bulk_gene_exp=f'{file_dir}/bulk_processed.csv',
        p_bulk_label=response_filename,
        p_adata=f'{file_dir}/NSCLC_section_28.h5ad',
        out_dir=f'{result_dir}/{drug}',
        n_epochs=n_epochs,
        batch_size=batch_size,
    )
Processing drugs:   0%|          | 0/3 [00:00<?, ?it/s]
Processing drugs: 100%|██████████| 3/3 [15:34<00:00, 311.54s/it]

Downstream Analysis

Load predictions back into adata.obs and explore cell-type-resolved drug-response patterns: spatial maps, per-celltype proportions, gene-correlations, and bivariate spatial coexpression plots.

Plotting setup

Editable PDF text (fonttype=42) and global scanpy figure params.

import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.colors import LinearSegmentedColormap

mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
sc.set_figure_params(dpi_save=300, frameon=False, fontsize=7, format='eps', transparent=True)

Cell-type spatial map

Load the spatial AnnData and color each cell by its annotated cell type.

color_palette = {
    'Tumor core': "#FDBA12",       
    'Tumor surface': "#D62828",     
    'Desmoplastic stroma': "#ABD8C9",   
    'Smooth muscle structures': "#8d5749",  
    'T cell aggregates': "#E5989B",     
    'Macrophage islands': "#6A4C93",   
    'DC islands': "#08ada4",            
    'Airways': "#EEC994",               
    'Vascular stroma': "#83ccec",       
    'Alveolar spaces': "#5C677D",      
    'Excluded': "#EAEAEA"
}

adata = sc.read_h5ad(f'{file_dir}/NSCLC_section_28.h5ad')
adata.obs['niches_3D'] = adata.obs['niches_3D'].astype('category')

sc.pl.spatial(
    adata,
    color='niches_3D',
    spot_size=4.5,
    palette=[color_palette[c] for c in adata.obs['niches_3D'].cat.categories],
    title='Niche Types (Section 28)',
    legend_loc='right margin',
    show=False,
)
fig = plt.gcf()
plt.show()
_images/b9ace782a0cc3db3315b482e09a4b86cd95fb5b0fc8e63230d7e6ddc22c9488b.png

Drug-response spatial maps

pharost.analysis.load_response_prediction populates adata.obs[{drug}] from each drug’s predicted_probabilities.csv. The spatial plot then shows the probability distribution across the tissue.

adata = pharost.analysis.load_response_prediction(
    adata,
    drugs=selected_drugs,
    path_template=lambda d: f'{result_dir}/{d}/predicted_probabilities.csv',
)

drug_cmap = LinearSegmentedColormap.from_list(
    'pink_yellow_teal', ['#CADD9B', '#494C4F', '#6EB2F1'], N=256,
)

sc.pl.spatial(
    adata, color=selected_drugs, spot_size=4.5,
    cmap=drug_cmap, vmax=1, show=False,
)
fig = plt.gcf()

os.makedirs('figures_NSCLC', exist_ok=True)
plt.savefig('figures_NSCLC/01_Spatial_drug_response.png', bbox_inches='tight', dpi=500)
plt.show()
_images/8f0ffa438b8417d882765c85e856f561d7093f15843db8de0f673bdd4e09a54d.png

Marker-gene + drug-response dotplot

Compare canonical breast-cancer markers (ERBB2, ESR1, PGR) against predicted drug responses in one dotplot, grouped by cell type.

var_names = ['EGFR', 'MET'] + selected_drugs

dot_cmap = LinearSegmentedColormap.from_list(
    "blue_white_orange", ["#A0C5E3", "#F7F8F0", "#FCB55C"]
)
sc.pl.dotplot(
    adata,
    var_names=var_names,
    groupby='niches_3D',
    categories_order=list(color_palette.keys()),
    standard_scale='var',
    cmap=dot_cmap,
    show=False,
)
plt.savefig('figures_NSCLC/01_Drug_gene_dotplot_summary.png', bbox_inches='tight', dpi=300)
plt.show()
_images/391712780229d38e0684cfd243e341e3f57f21ffc2e65950e78d85a3bc31661c.png

Sensitive-cell proportion per cell type

For each drug, plot the fraction of cells with predicted probability > 0.5 within each cell type. Highlights which populations the model considers sensitive.

pharost.analysis.plot_response_celltype_prop(
    adata,
    target_drugs=selected_drugs,
    cell_type_col='niches_3D',
    save=True,
    file_format='pdf',
    palette = ['#9BE8F0', '#74CEF7', '#0180CC'], 
    save_dir='figures_NSCLC',
)
Figure saved to figures_NSCLC/response_celltype_prop.pdf
_images/703a39d5efd4c673195f5c62555266b7c1221c58d7751640db9e526d3cdfe6a9.png