Home Phase 5A Phase 5B Biomarkers Model

Cancer Biomarker Discovery

A machine learning pipeline for multi-class cancer classification using RNA-Seq data, achieving perfect prediction and identifying therapeutically relevant biomarkers.

đŸŽ¯
100%
F1-Macro Score
model_results_summary.txt
đŸ§Ŧ
50
Top Biomarkers
gene_coefficient_rankings.csv
📊
5
Cancer Types
labels.csv
đŸ”Ŧ
801
Samples
labels.csv

Raw Features

20,531

After Selection

1,000

Final Panel

50

Pathways

2

📊 Data Sources

All data displayed on this dashboard is derived directly from project output files. Biomarker rankings and coefficients come from gene_coefficient_rankings.csv. Model performance metrics are from model_results_summary.txt. Class distribution data is from labels.csv (801 samples).

PHASE 5A Model Interpretation: Coefficient Analysis

Leveraging Logistic Regression coefficients to rank genes and identify class-specific biomarkers

5.1 Methodology

Coefficient Extraction

The trained Logistic Regression model's coefficient matrix was extracted to understand how each gene contributes to cancer classification.

  • Matrix shape: 5 classes × 1,000 features
  • Each gene has 5 coefficient values (one per cancer type)
  • Positive coefficient: Higher expression → Higher probability of that class
  • Negative coefficient: Higher expression → Lower probability of that class

Biomarker Ranking

Genes were prioritized by their maximum absolute coefficient value across all 5 classes, representing overall predictive power.

Key Finding: The model relies on a sparse, specific set of genes. The top biomarker gene_15898 (STXBP3) was identified as the primary driver for LUAD classification with a coefficient of +0.0816.

📁 Generated Files (Phase 5A)

top_50_influential_biomarkers.txt - Ranked list of top 50 genes by coefficient
gene_coefficient_rankings.csv - Full 1,000 gene coefficient matrix
biomarker_coefficient_heatmap.png - Visual heatmap of top gene coefficients
top1-3_gene_*_boxplot.png - Expression distribution box plots

đŸŽ¯ Top Biomarker for Each Cancer Type

Source: README.md & gene_coefficient_rankings.csv
Cancer Type Top Biomarker Coefficient Biological Role
BRCA Breast gene_357 (AIM2) +0.0513 DNA-sensing inflammasome
COAD Colon gene_12013 +0.0285 Upregulated
KIRC Kidney gene_3439 +0.0398 Upregulated
LUAD Lung gene_15898 (STXBP3) +0.0816 Vesicular transport
PRAD Prostate gene_9176 +0.0410 Upregulated

PHASE 5B Pathway Enrichment & Validation

External validation of the biomarker panel using gProfiler pathway analysis

5.2 The Gene ID Mapping Challenge

âš ī¸ Challenge: The original dataset used custom gene IDs (gene_15898) that were unrecognized by standard bioinformatics tools like gProfiler. This is a common obstacle in RNA-Seq analysis when datasets don't use standard nomenclature.
✅ Solution: A dedicated data mapping exercise was performed, confirming that the numeric identifier corresponded to the official HGNC ID. This allowed successful conversion of the entire biomarker panel to verifiable symbols (e.g., gene_15898 → STXBP3, gene_357 → AIM2).

đŸ”Ŧ gProfiler Analysis

The final 50-gene biomarker panel was submitted to gProfiler for pathway enrichment analysis. The tool identified statistically significant over-representation in specific biological pathways, validating the functional significance of our biomarker discoveries.

đŸ§Ŧ Key Pathway Findings

GO:MF

DNA-binding Transcription Factor Activity

Statistically significant enrichment in genes involved in transcriptional regulation, suggesting the biomarker panel captures fundamental gene expression control mechanisms that differ between cancer types.

Biological Significance: Transcription factors are master regulators of cellular identity and are frequently dysregulated in cancer.
GO:MF

Syntaxin Binding / Vesicular Transport

Enrichment driven by the STXBP (Syntaxin Binding Protein) gene family, particularly STXBP3 which emerged as the top biomarker for LUAD classification.

STXBP Family: These genes regulate vesicular trafficking and membrane fusion, processes that are altered in cancer metastasis.
STXBP3 STXBP-related genes

🏆 Therapeutic Target Candidates

Based on coefficient analysis and pathway validation, the following genes are identified as high-priority therapeutic targets for future research:

STXBP3 (gene_15898)

  • Highest coefficient: +0.0816 for LUAD
  • Function: Vesicular transport regulation
  • Pathway: Syntaxin binding
  • Potential: Lung adenocarcinoma biomarker

AIM2 (gene_357)

  • Top BRCA biomarker: +0.0513
  • Function: DNA-sensing inflammasome
  • Pathway: Innate immune response
  • Potential: Breast cancer immunotherapy target

📊 Discovery Zone From gene_coefficient_rankings.csv

Top 50 cancer biomarkers ranked by maximum absolute coefficient from Logistic Regression model

â„šī¸ About the Data

Gene IDs: Original dataset identifiers (gene_XXXX format). Coefficients: Extracted from trained Logistic Regression model. Dominant Class: Cancer type with highest absolute coefficient for each gene. All values sourced from reports/gene_coefficient_rankings.csv.

đŸ§Ŧ Top 50 Biomarkers

Rank Gene ID Max |Coef| Dominant Cancer Coefficient
📊 Top 10 Biomarkers by Coefficient
Source: gene_coefficient_rankings.csv
đŸŽ¯ Dominant Cancer Distribution
Source: gene_coefficient_rankings.csv (top 50 genes)
đŸ§Ŧ Coefficient Heatmap (Top 15 Genes)
Source: gene_coefficient_rankings.csv | Coefficients by cancer type

📈 Model Performance From model_comparison.csv

Comparison of machine learning models trained on the 1,000-gene biomarker panel

🏆 Model Comparison

Model F1-Macro Accuracy Status
Logistic Regression 100.00% 100.00% đŸĨ‡ Best Model
Support Vector Classifier 99.89% 99.88% đŸĨˆ
Random Forest 99.68% 99.63% đŸĨ‰

đŸ”Ŧ Why Logistic Regression?

Despite achieving similar accuracy to ensemble methods, Logistic Regression was chosen as the final model because of its interpretability. The coefficient matrix allows direct extraction of gene-cancer associations, enabling Phase 5 biomarker discovery and pathway validation. This demonstrates the value of interpretable models in biomedical research.

📊 Class Distribution
Source: labels.csv (801 samples)
📈 Model Comparison
Source: model_comparison.csv