Skip to content

🔍 Similarity Functions

SimilariPy provides a suite of similarity functions for sparse matrixes, all implemented in Cython and parallelized with OpenMP. These models compute item-to-item or user-to-user similarity based on vector math or graph-based transformations.

Similarities

Function Description
dot_product() Simple raw inner product between vectors.
cosine() Cosine similarity with optional shrinkage.
asymmetric_cosine(alpha=0.5) Asymmetric variant of cosine similarity, where alpha controls the weighting between vectors.
jaccard() Set-based similarity defined as the intersection over union.
dice() Harmonic mean of two vectors' lengths.
tversky(alpha=1.0, beta=1.0) Tversky similarity, a generalization of Jaccard and Dice.
p3alpha(alpha=1.0) Graph-based similarity computed as normalized matrix multiplication with alpha exponentiation.
rp3beta(alpha=1.0, beta=1.0) P3alpha variant that penalizes popular items with a beta exponent.
s_plus(l1=0.5, l2=0.5, l3=0.0, t1=1.0, t2=1.0, c1=0.5, c2=0.5, alpha=1, pop1='none', pop2='none', beta1=0.0, beta2=0.0) Hybrid similarity model that combines Tversky and Cosine normalizations with RP3Beta-style depopularization, controlled by tunable weights. The pop1 and pop2 parameters define item popularity weights and may be provided as custom arrays of arbitrary values, or initialized with built-in options: 'sum': use the sum of interactions per item; 'none': disable popularity weighting (default: 'none')

Common Parameters

All similarity functions in Similaripy share the following parameters:

Parameter Description
m1 Input sparse matrix for which to calculate the similarity.
m2 Optional transpose matrix. If None, uses m1.T. (default: None)
k Number of top-k items per row. (default: 100)
h Shrinkage coefficient applied during normalization.
threshold Minimum similarity value to retain. Values below are set to zero. (default: 0)
shrink_type Shrinkage type: stabilized, bayesian, or additive. (default: stabilized)
binary If True, binarizes the input matrix. (default: False)
target_rows List or array of row indices to compute. If None, computes for all rows. (default: None)
target_cols Subset of columns to consider before applying top-k. Can be an array (applied to all rows) or a sparse matrix (row-specific). (default: None)
filter_cols Subset of columns to filter before applying top-k. Can be an array (applied to all rows) or a sparse matrix (row-specific). (default: None)
verbose If True, shows a progress bar. (default: True)
format_output Output format: 'coo' or 'csr'. (default: 'coo')
Note: 'csr' not currently supported on Windows.
num_threads Number of threads to use. 0 means use all available cores. (default: 0)

Shrinkage Types

The shrinkage equations are displayed with the cosine normalization for simplicity, however, they are available in all the similarities.

Stabilized Shrinkage

\(s_{xy} = \frac{x \cdot y}{\sqrt{\sum_i x_i^2} \cdot \sqrt{\sum_i y_i^2} + h}\)

  • Prevents instability when norms are small.
  • h acts as the shrinkage strength.
  • shrink_type = 'stabilized'

Bayesian Shrinkage

\(s_{xy} = \frac{x \cdot y}{\sqrt{\sum_i x_i^2} \cdot \sqrt{\sum_i y_i^2}} \cdot \frac{x \cdot y}{x \cdot y + h}\)

  • Penalizes similarities with items with low overlap support.
  • h acts as the shrinkage strength.
  • shrink_type = 'bayesian'

Additive Shrinkage

\(s_{xy} = \frac{x \cdot y}{\sqrt{\sum_i (x_i^2 + h)} \cdot \sqrt{\sum_i (y_i^2 + h)}}\)

  • Penalizes similarities with items with low support.
  • Adds shrinkage directly into the cosine denominator norms.
  • h acts as the shrinkage strength.
  • shrink_type = 'additive'

Notes

  • All similarity functions are implemented in Cython + OpenMP for high-performance computation on CSR matrixes.
  • Computations are fully multi-threaded and scale with CPU cores.
  • Supports CSR and COO sparse matrix formats as output.
  • ⚠️ Windows: use format_output='coo' (CSR output is not supported on Windows due to a platform data type mismatch).

Math Equations

Dot Product

\(s_{xy} = x \cdot y\)

Cosine

\(s_{xy} = \frac{x \cdot y}{\|x\| \cdot \|y\|}\)

Asymmetric Cosine

\(s_{xy} = \frac{x \cdot y}{\left(\sum x_i^2\right)^\alpha \left(\sum y_i^2\right)^{1 - \alpha}}\)

  • α: Asymmetry coefficient ∈ [0, 1]

Jaccard

\(s_{xy} = \frac{x \cdot y}{|x| + |y| - x \cdot y}\)

Dice

\(s_{xy} = \frac{x \cdot y}{\frac{1}{2}|x| + \frac{1}{2}|y| - x \cdot y}\)

Tversky

\(s_{xy} = \frac{x \cdot y}{\alpha(|x| - x \cdot y) + \beta(|y| - x \cdot y) + x \cdot y}\)

  • α, β: Tversky coefficients ∈ [0, 1]

P3α

\(s_{xy} = x^\alpha \cdot y^\alpha\)

  • α: P3α coefficient ∈ [0, 1]
  • Normalizion row-wise (L1) is applied before exponentiation

RP3β

\(s_{xy} = \frac{x^\alpha \cdot y^\alpha}{{pop}(y)^\beta}\)

  • α: P3α coefficient ∈ [0, 1]
  • β: Popularity penalization coefficient ∈ [0, 1]
  • pop(y) Number of interactions for item y
  • Normalizion row-wise (L1) is applied before exponentiation
  • Penalization is applied before the top k selection

S-Plus

\(s_{xy} = \frac{(x \cdot y)^\alpha}{l_1 \left(t_1(|x| - x \cdot y) + t_2(|y| - x \cdot y) + x \cdot y\right) + l_2\left(\sum x_i^2\right)^{c_1} \cdot \left(\sum y_i^2\right)^{c_2} + l_3(pop_1(x)^{\beta_1} \cdot pop_2(y)^{\beta_2})}\)

  • l1, l2: Tversky,Cosine normalization strength ∈ [0, 1]
  • l3: Popularity penalization strength ∈ [0, 1]
  • t1, t2: Tversky coefficients ∈ [0, 1]
  • c1, c2: Cosine weighting exponent ∈ [0, 1]
  • α: Coefficient for the raw interaction overlap (different from P3α)
  • β1, β2: Popularity penalization coefficient for the item x/y ∈ [0, 1]
  • pop(x), pop(y): Popularity value for the item x/y