User Manual

Overview

CABIgo identifies potential cancer biomarkers using deep learning that integrates protein-protein interaction networks with Gene Ontology and knowledge graph embeddings.

Quick Start

  1. Go to the Analysis page
  2. Select your cancer type
  3. Enter a gene list OR upload PPI network files
  4. Click "Run Analysis"
  5. View results and download predictions

Example Data

Download example files to test the tool or use as templates for your own data.

nodes.csv

Example nodes file containing gene symbols and optional degree information.

Download nodes.csv

edges.csv

Example edges file containing source-target gene pairs with confidence scores.

Download edges.csv

All Example Files

Download all example files as a single ZIP archive.

Download All (ZIP)

Sample Gene List

You can also copy-paste this gene list directly into the analysis form:

ABAT
ABCA6
ABCA9
ACAA2
ACACB
ACADL
ACADS
ACKR1
ACKR3
ACKR4
ACO1
ACSL1
ACSM5
ACSS2
ACSS3
ACVR1C
ADAM12
ADAMTS3
ADAMTS5
ADAMTSL4
ADCY4
ADCYAP1R1
ADGRA2
ADGRD1
ADGRL4
ADH1B
ADH1C
ADHFE1
ADIPOQ
ADIRF
ADM
ADRA1A
ADRA2A
ADRB1
ADRB2
AGPAT2
AGR2
AGTR1
AHNAK
AHNAK2
AIFM2
AKAP12
AKAP9
AKR1C1
AKR1C3
ALCAM
ALDH1A1
ALDH1L1
ALDH2
ALDH3A2
ALDH3B2
ALK
AMOTL2
ANG
ANGPT1
ANGPTL4
ANK2
ANK3
ANLN
ANO3
ANTXR2
ANXA1
AOC3
AP1M2
AP1S2
APCDD1
APOB
APOBEC3A
APOBEC3B
APOC1
AQP1
AQP3
ARHGEF6
ASPA
ASPH
ASPM
ATAD2
ATE1
ATG10
ATM
ATP1A2
ATP1B1
ATP8B4
ATR
ATXN7
AURKA
AVPR1A
AZGP1
AZIN1
BABAM1
BAMBI
BARD1
BCL11B
BCL2A1
BCL6
BCLAF1
BCOR
BGN
BHMT2
BICDL1
BIK
BIN1
BIRC5
BLM
BMP2
BMP6
BOK
BRCA1
BRCA2
BRIP1
BUB1
BUB1B
BUB3
C19orf12
C4orf19
C6
CA3
CA4
CALB2
CASQ2
CAT
CAV1
CAV2
CAVIN1
CAVIN2
CBX2
CBX7
CBX8
CCBE1
CCDC170
CCDC69
CCN4
CCNA2
CCNB1
CCNB2
CCNE2
CCR5
CCR7
CCT2
CD209
CD24
CD248
CD300LF
CD300LG
CD34
CD36
CD37
CD9
CD99L2
CDC14B
CDC20
CDC42EP2
CDC45
CDC7
CDCA3
CDCA5
CDCA7
CDCA8
CDCP1
CDH1
CDH11
CDH5
CDK1
CDK12
CDKN1C
CDKN2A
CDKN2B
CDKN2C
CDKN3
CDO1
CDON
CDS1
CDYL2
CEACAM6
CEBPA
CENPE
CENPF
CENPK
CENPN
CENPU
CEP41
CEP55
CERS6
CETP
CFAP298
CFD
CFH
CFL2
CGN
CHEK1
CHEK2
CHMP4C
CIDEA
CIDEC
CKMT2
CKS2
CLDN3
CLDN4
CLDN5
CLDN7
CLEC7A
CLGN
CLIC5
CLU
CNKSR2
CNR1
CNRIP1
CNTNAP2
COL10A1
COL11A1
COL1A1
COL6A6
COMP
COX11
COX7A1
CPEB1
CRABP2
CREB3L4
CREB5
CRNKL1
CRYAB
CTHRC1
CTPS1
CXADR
CXCL10
CXCL11
CXCL12
CXCL8
CXCR4
CYBRD1
CYP26B1
CYTH2
CYTIP
DBF4
DCLRE1B
DCN
DDR2
DEGS2
DEPDC1
DGAT2
DHX15
DIO2
DLC1
DLGAP5
DLX2
DMD
DMTN
DNAJC1
DPT
DSP
DST
DTL
DUSP5
E2F3
E2F5
E2F8
EBF1
EBF2
EBF3
ECM2
ECT2
EDNRB
EFEMP1
EFNA1
EFNA4
EGFLAM
EGFR
EGLN3
EHBP1
EHD2
EIF1
ELF3
ELL
EMCN
ENAH
ENC1
ENPP2
EP300
EPAS1
EPB41L2
EPB41L5
EPB42
EPCAM
EPN3
EPSTI1
ERBB2
ERBB3
ERBB4
ERG
ESPN
ESR1
ESRP1
ETFB
ETNK1
EXO1
EZH1
EZH2
EZR
F10
F12
FA2H
FABP4
FABP5
FADS3
FAM83D
FANCD2
FANCI
FAXDC2
FBLN2
FBLN5
FBN1
FBXO11
FCRL4
FDPS
FEN1
FERMT2
FEZ1
FGF1
FGF2
FGF3
FGFR2
FGFR3
FHL1
FKBP11
FKBP4
FN1
FNDC5
FOS

Input Requirements

Option 1: Gene List

Provide a list of gene symbols (HGNC format). The system will automatically construct a PPI network using STRING database.

Minimum requirement: At least 10 genes are required for analysis.

Format

  • One gene symbol per line, OR
  • Comma-separated gene symbols

Example

BRCA1
BRCA2
TP53
EGFR
MYC

Option 2: PPI Network Files

Upload pre-constructed network files in CSV format.

Note: The tool does not support .tsv files from STRING directly. Please convert to the CSV format described below.

nodes.csv

ColumnRequiredDescription
SYMBOLYesGene symbol (HGNC format)
degreeNoNode degree (calculated if not provided)

Example nodes.csv:

SYMBOL,degree
BRCA1,15
BRCA2,12
TP53,25
EGFR,18

edges.csv

ColumnRequiredDescription
sourceYesSource gene symbol
targetYesTarget gene symbol
weightNoEdge confidence score (0-1)

Example edges.csv:

source,target,weight
BRCA1,BRCA2,0.95
BRCA1,TP53,0.87
TP53,EGFR,0.82

Analysis Pipeline

1. PPI Construction

If gene list provided, query STRING database to build interaction network. Edges are filtered by confidence score and nodes by degree.

2. GO Term Embedding

Retrieve Gene Ontology annotations for each protein. Generate 768-dimensional embeddings using fine-tuned BioBERT model.

3. GeoKG Embedding

Map proteins to UniProt IDs and retrieve pre-computed knowledge graph embeddings (50 dimensions supported).

4. Feature Assembly

Concatenate GO and GeoKG embeddings. Filter out proteins not present in both embedding sets.

5. Model Prediction

Apply cancer-specific Graph Neural Network. Output probability and binary prediction for each gene.

6. Enrichment Analysis

Perform GO and KEGG pathway enrichment on predicted biomarkers (requires at least 5 predicted biomarkers).

Output Interpretation

Predictions Table

ColumnDescription
SYMBOLGene symbol
biomarker_probabilityProbability of being a biomarker (0-1)
predicted_biomarkerBinary prediction (0 = non-biomarker, 1 = biomarker)
confidenceConfidence in prediction

Network Visualization

Predicted biomarkers
Non-biomarkers (neighbors of biomarkers)

Node size is proportional to network degree.

Enrichment Results

  • GO Enrichment: Biological Process, Molecular Function, Cellular Component
  • KEGG Pathways: Enriched biological pathways with links to KEGG database

Parameters

ParameterDefaultRangeDescription
Cancer TypeBreast CancerBreast, Lung, Glioblastoma Cancer-specific model to use for prediction
GeoKG Dimension5050, 100, 200, 500, 1000 Knowledge graph embedding dimension
Min Confidence0.70.4 - 0.9 Minimum STRING confidence score for edges
Min Degree21 - 5 Minimum node degree (nodes below are removed)

Troubleshooting

Q: "Please provide at least 10 genes"

A: The analysis requires a minimum of 10 genes for meaningful network construction. Add more genes to your list.

More genes are recommended for PPI construction and reliable predictions.


Q: "Network too small after filtering"

A: Too many genes were removed due to low degree or missing embeddings. Try:

  • Lowering the minimum confidence threshold
  • Lowering the minimum degree requirement
  • Adding more genes to your list

Q: "Only X genes have both GO and GeoKG embeddings"

A: Some genes in your list don't have GO annotations or aren't in the knowledge graph. This is normal for less-characterized genes.

Q: Analysis is taking too long

A: Large gene lists (>500 genes) may take 2-5 minutes. The STRING API query is usually the slowest step.

Start Analysis Back to Home