Skip to content

🔍 Similarity Functions

SimilariPy provides a suite of similarity functions for sparse matrixes, all implemented in Cython and parallelized with OpenMP. These models compute item-to-item or user-to-user similarity based on vector math or graph-based transformations.

Similarities

Function Description
dot_product() Simple raw inner product between vectors.
cosine() Cosine similarity with optional shrinkage.
asymmetric_cosine(alpha=0.5) Asymmetric variant of cosine similarity, where alpha controls the weighting between vectors.
jaccard() Set-based similarity defined as the intersection over union.
dice() Harmonic mean of two vectors' lengths.
tversky(alpha=1.0, beta=1.0) Tversky similarity, a generalization of Jaccard and Dice.
p3alpha(alpha=1.0) Graph-based similarity computed as normalized matrix multiplication with alpha exponentiation.
rp3beta(alpha=1.0, beta=1.0) P3alpha variant that penalizes popular items with a beta exponent.
s_plus(l=0.5, t1=1.0, t2=1.0, c=0.5) Hybrid model combining Tversky and Cosine with tunable weights.

Common Parameters

All similarity functions in Similaripy share the following parameters:

Parameter Description
m1 Input sparse matrix for which to calculate the similarity.
m2 Optional transpose matrix. If None, uses m1.T. (default: None)
k Number of top-k items per row. (default: 100)
h Shrinkage coefficient applied during normalization.
threshold Minimum similarity value to retain. Values below are set to zero. (default: 0)
binary If True, binarizes the input matrix. (default: False)
target_rows List or array of row indices to compute. If None, computes for all rows. (default: None)
target_cols Subset of columns to consider before applying top-k. Can be an array (applied to all rows) or a sparse matrix (row-specific). (default: None)
filter_cols Subset of columns to filter before applying top-k. Can be an array (applied to all rows) or a sparse matrix (row-specific). (default: None)
verbose If True, shows a progress bar. (default: True)
format_output Output format: 'coo' or 'csr'. (default: 'coo')
Note: 'csr' not currently supported on Windows.
num_threads Number of threads to use. 0 means use all available cores. (default: 0)

Notes

  • All similarity functions are implemented in Cython + OpenMP for high-performance computation on CSR matrixes.
  • Computations are fully multi-threaded and scale with CPU cores.
  • Supports CSR and COO sparse matrix formats as output.
  • ⚠️ Windows: use format_output='coo' (CSR output is not supported on Windows due to a platform data type mismatch).

Math Equations

Dot Product

\(s_{xy} = x \cdot y\)

Cosine

\(s_{xy} = \frac{x \cdot y}{\|x\| \cdot \|y\| + h}\)

Asymmetric Cosine

\(s_{xy} = \frac{x \cdot y}{\left(\sum x_i^2\right)^\alpha \left(\sum y_i^2\right)^{1 - \alpha} + h}\)

  • α: Asymmetry coefficient ∈ [0, 1]

Jaccard

\(s_{xy} = \frac{x \cdot y}{|x| + |y| - x \cdot y + h}\)

Dice

\(s_{xy} = \frac{x \cdot y}{\frac{1}{2}|x| + \frac{1}{2}|y| - x \cdot y + h}\)

Tversky

\(s_{xy} = \frac{x \cdot y}{\alpha(|x| - x \cdot y) + \beta(|y| - x \cdot y) + x \cdot y + h}\)

  • α, β: Tversky coefficients ∈ [0, 1]

P3α

\(s_{xy} = x^\alpha \cdot y^\alpha\)

  • α: P3α coefficient ∈ [0, 1]
  • Normalizion row-wise (L1) is applied before exponentiation

RP3β

\(s_{xy} = \frac{x^\alpha \cdot y^\alpha}{{pop}(y)^\beta}\)

  • α: P3α coefficient ∈ [0, 1]
  • β: Popularity penalization coefficient ∈ [0, 1]
  • pop(j) Number of interactions for item j
  • Normalizion row-wise (L1) is applied before exponentiation
  • Penalization is applied before the top k selection

S-Plus

\(s_{xy} = \frac{x \cdot y}{l \left(t_1(|x| - x \cdot y) + t_2(|y| - x \cdot y) + x \cdot y\right) + (1 - l)\left(\sum x_i^2\right)^c \left(\sum y_i^2\right)^{1 - c} + h}\)

  • l: Balance between Tversky and Cosine parts ∈ [0, 1]
  • t1, t2: Tversky coefficients ∈ [0, 1]
  • c: Cosine weighting exponent ∈ [0, 1]