Performance Benchmarking Guide

This document describes how to benchmark similaripy and track performance across versions.

Quick Start

Run All Benchmarks

# Using tox (recommended for CI)
make benchmark

# Or run locally
make benchmark-local

Run Specific Benchmarks

# Run only normalization benchmarks (20 tests)
make benchmark-norm

# Run only similarity benchmarks (4 tests)
make benchmark-similarity

# Or use pytest directly for more control
uv run pytest tests/benchmarks.py::TestNormalizationPerformance --benchmark-only
uv run pytest tests/benchmarks.py::TestSPlusPerformance --benchmark-only

# Run specific size (e.g., only small matrices)
uv run pytest tests/benchmarks.py -k "small" --benchmark-only

Comparing Versions

Method 1: pytest-benchmark (Recommended)

Save a baseline and compare:

# 1. Save baseline from current version
uv run pytest tests/benchmarks.py --benchmark-only --benchmark-save=v0.2.4

# 2. Make your changes...

# 3. Compare against baseline
uv run pytest tests/benchmarks.py --benchmark-only --benchmark-compare=v0.2.4

# 4. Fail if performance degrades >10%
uv run pytest tests/benchmarks.py --benchmark-only \
  --benchmark-compare=v0.2.4 \
  --benchmark-compare-fail=mean:10%

View all saved benchmarks:

ls -la .benchmarks/

Quick Comparison (Small Matrices Only)

For faster comparison, test only small matrices:

# Save baseline with small matrices only
uv run pytest tests/benchmarks.py -k "small" --benchmark-only --benchmark-save=baseline

# Compare after changes
uv run pytest tests/benchmarks.py -k "small" --benchmark-only --benchmark-compare=baseline

# Fail if >15% slower
uv run pytest tests/benchmarks.py -k "small" --benchmark-only \
  --benchmark-compare=baseline \
  --benchmark-compare-fail=mean:15%

Benchmark Structure

Test Organization

tests/
├── benchmarks.py              # Main benchmark suite (24 tests total)
├── test_normalization.py      # Unit tests for normalization
├── test_similarity.py         # Unit tests for similarity
└── conftest.py               # pytest configuration (registers perf marker)

Benchmark Categories

Normalization Benchmarks (20 tests) - L1, L2, max, TF-IDF, BM25 across 3 matrix sizes
Similarity Benchmarks (4 tests) - s_plus computations with realistic datasets
- Basic s_plus (2 tests): Default parameters on MovieLens-like datasets
- Complex s_plus (2 tests): Full normalization parameters (l1, l2, l3, depopularization, bayesian shrinkage)

Matrix Sizes

Small: 1,000 × 500 (0.05 density) - ~25K non-zeros
Medium: 10,000 × 5,000 (0.01 density) - ~500K non-zeros
Large: 50,000 × 10,000 (0.005 density) - ~2.5M non-zeros

Interpreting Results

pytest-benchmark Output

Name (time in ms)           Mean    StdDev  Min     Max     Rounds
test_l2_norm[small]        12.5    0.3     12.1    13.2    50
test_l2_norm[medium]       245.2   5.1     238.4   255.3   20

What to Look For

Mean time: Primary metric
Std Dev: Low is better (more consistent)
Regression: >10% slowdown should be investigated
Improvement: >20% speedup is significant

Best Practices

Consistent Environment
- Close other applications
- Disable CPU frequency scaling
- Run on same hardware for comparisons
Warmup Runs
- First run is always slower (compilation, caching)
- Benchmarks include warmup rounds
Multiple Runs
- Default: 5-10 rounds per benchmark
- More rounds = more reliable statistics
Realistic Data
- Use matrix sizes matching your use case
- Test with actual data sparsity patterns

Advanced Usage

Custom Benchmark

@pytest.mark.perf
def test_my_custom_benchmark(benchmark):
    mat = generate_sparse_matrix(20_000, 10_000, 0.01)
    result = benchmark(my_function, mat, my_param=42)
    assert result is not None

Profiling

For detailed profiling:

# Using pytest-benchmark
uv run pytest tests/benchmarks.py::test_specific_benchmark \
  --benchmark-only \
  --benchmark-cprofile

# Using Python's profiler
uv run python -m cProfile -o profile.stats your_script.py
uv run python -m pstats profile.stats