Algorithm Overview¶

scPPIN-py detects functional modules in protein-protein interaction networks by integrating single-cell RNA sequencing (scRNA-seq) data using a Prize-Collecting Steiner Tree (PCST) approach.

Pipeline Overview¶

The algorithm consists of five main steps:

Step	Description
1. Fit BUM Model	Distinguish true signals from noise in p-value distributions using a Beta-Uniform Mixture (BUM) model
2. Compute Node Scores	Transform p-values to node weights (prizes) using the fitted BUM model
3. Prepare Edge Costs	Compute edge costs from network topology, optionally incorporating edge weights (confidence scores)
4. Solve PCST	Find optimal subgraph that maximizes prize (node scores) minus cost (edge costs)
5. Return Module	Extract and return the connected functional module

Step 1: Fit BUM Model¶

The Beta-Uniform Mixture (BUM) model is used to separate true signals from noise in p-value distributions from differential expression analysis. The model assumes p-values come from a mixture of:

Uniform component (noise): P-values uniformly distributed in [0, 1]
Beta component (signal): P-values following a Beta distribution

The model is parameterized by:

\(\lambda\): Mixing parameter (proportion of uniform/noise component)
\(\alpha\): Beta distribution shape parameter (\(0 < \alpha < 1\))

Smaller \(\alpha\) values indicate stronger signal enrichment in small p-values.

Step 2: Compute Node Scores¶

Node scores (prizes) are computed from p-values using the fitted BUM model:

\[\text{score}(p) = -\log_{10}(p)\]

Where \(p\) is the p-value. Nodes with smaller p-values receive higher scores (prizes), making them more likely to be included in the detected module.

The network is automatically filtered to only include genes with available p-values before module detection.

Step 3: Prepare Edge Costs¶

Edge costs determine the penalty for including an edge in the module. The default cost is based on the network topology, but can be adjusted using edge weights (confidence scores):

\[\text{cost}(e) = \text{base\_cost} \cdot (1 - w(e) \cdot \text{scale}) + c_0\]

Where:

\(w(e)\): Edge weight (in [0, 1]), higher values indicate more confidence
\(\text{scale}\): Scaling factor for weight influence (default: 1.0)
\(c_0\): Minimum cost to prevent zero-cost edges (default: 0.1 · base_cost)

This formula ensures that:

Higher edge weights → Lower costs → More likely to include edge
Minimum cost \(c_0\) prevents numerical instability from zero-cost edges

If no edge weights are provided, uniform costs are used.

Step 4: Solve PCST¶

The Prize-Collecting Steiner Tree (PCST) problem finds a connected subgraph that maximizes:

\[\text{objective} = \sum_{v \in V'} \text{prize}(v) - \sum_{e \in E'} \text{cost}(e)\]

Where:

\(V'\) is the set of nodes in the module
\(E'\) is the set of edges in the module
\(\text{prize}(v)\) is the node score from Step 2
\(\text{cost}(e)\) is the edge cost from Step 3

The solution is guaranteed to be a connected subgraph (tree) that balances including high-scoring nodes while minimizing connectivity costs.

Step 5: Return Module¶

The detected module is returned as an igraph Graph containing:

Nodes: Genes significantly contributing to the module
Edges: Protein-protein interactions connecting module genes
Node attributes: Original p-values, computed scores, BUM parameters

False Discovery Rate (FDR) Control¶

The FDR threshold controls which nodes are considered “significant” for module detection:

Lower FDR (e.g., 0.001): Only genes with very small p-values are included → Smaller, more stringent modules
Higher FDR (e.g., 0.05): Genes with moderate p-values can be included → Larger, less stringent modules

Note: FDR guarantees are approximate when using edge weights, as the cost formula adjusts edge costs dynamically.

Original Method¶

This implementation follows the method described in:

Klimm et al. (2020): Functional module detection through integration of single-cell RNA sequencing data with protein–protein interaction networks. BMC Genomics 21, Article number: 756.

DOI: 10.1186/s12864-020-07144-2

Key Differences from R Implementation¶

Class-based API: Object-oriented design with setup method chaining
Edge weight formula: Uses the author’s recommended formula from GitHub Issue #10
Automatic filtering: Network automatically filtered to genes with p-values
Performance: Vectorized NumPy operations provide ~5-10x speedup

Next Steps¶

See Basic Usage Tutorial for step-by-step usage guide
Check API Reference for implementation details
Explore Basic Example for complete working examples