🔍 Similarity Functions

SimilariPy provides a suite of similarity functions for sparse matrixes, all implemented in Cython and parallelized with OpenMP. These models compute item-to-item or user-to-user similarity based on vector math or graph-based transformations.

Similarities

Function	Description
`dot_product()`	Simple raw inner product between vectors.
`cosine()`	Cosine similarity with optional shrinkage.
`asymmetric_cosine(alpha=0.5)`	Asymmetric variant of cosine similarity, where `alpha` controls the weighting between vectors.
`jaccard()`	Set-based similarity defined as the intersection over union.
`dice()`	Harmonic mean of two vectors' lengths.
`tversky(alpha=1.0, beta=1.0)`	Tversky similarity, a generalization of Jaccard and Dice.
`p3alpha(alpha=1.0)`	Graph-based similarity computed as normalized matrix multiplication with `alpha` exponentiation.
`rp3beta(alpha=1.0, beta=1.0)`	P3alpha variant that penalizes popular items with a `beta` exponent.
`s_plus(l=0.5, t1=1.0, t2=1.0, c=0.5)`	Hybrid model combining Tversky and Cosine with tunable weights.

Common Parameters

All similarity functions in Similaripy share the following parameters:

Parameter	Description
`m1`	Input sparse matrix for which to calculate the similarity.
`m2`	Optional transpose matrix. If `None`, uses `m1.T`. (default: `None`)
`k`	Number of top-k items per row. (default: `100`)
`h`	Shrinkage coefficient applied during normalization.
`threshold`	Minimum similarity value to retain. Values below are set to zero. (default: `0`)
`binary`	If `True`, binarizes the input matrix. (default: `False`)
`target_rows`	List or array of row indices to compute. If `None`, computes for all rows. (default: `None`)
`target_cols`	Subset of columns to consider before applying top-k. Can be an array (applied to all rows) or a sparse matrix (row-specific). (default: `None`)
`filter_cols`	Subset of columns to filter before applying top-k. Can be an array (applied to all rows) or a sparse matrix (row-specific). (default: `None`)
`verbose`	If `True`, shows a progress bar. (default: `True`)
`format_output`	Output format: `'coo'` or `'csr'`. (default: `'coo'`) Note: `'csr'` not currently supported on Windows.
`num_threads`	Number of threads to use. `0` means use all available cores. (default: `0`)

Notes

All similarity functions are implemented in Cython + OpenMP for high-performance computation on CSR matrixes.
Computations are fully multi-threaded and scale with CPU cores.
Supports CSR and COO sparse matrix formats as output.
⚠️ Windows: use format_output='coo' (CSR output is not supported on Windows due to a platform data type mismatch).

Math Equations

Dot Product

\(s_{xy} = x \cdot y\)

Cosine

\(s_{xy} = \frac{x \cdot y}{\|x\| \cdot \|y\| + h}\)

Asymmetric Cosine

\(s_{xy} = \frac{x \cdot y}{\left(\sum x_i^2\right)^\alpha \left(\sum y_i^2\right)^{1 - \alpha} + h}\)

α: Asymmetry coefficient ∈ [0, 1]

Jaccard

\(s_{xy} = \frac{x \cdot y}{|x| + |y| - x \cdot y + h}\)

Dice

\(s_{xy} = \frac{x \cdot y}{\frac{1}{2}|x| + \frac{1}{2}|y| - x \cdot y + h}\)

Tversky

\(s_{xy} = \frac{x \cdot y}{\alpha(|x| - x \cdot y) + \beta(|y| - x \cdot y) + x \cdot y + h}\)

α, β: Tversky coefficients ∈ [0, 1]

P3α

\(s_{xy} = x^\alpha \cdot y^\alpha\)

α: P3α coefficient ∈ [0, 1]
Normalizion row-wise (L1) is applied before exponentiation

RP3β

\(s_{xy} = \frac{x^\alpha \cdot y^\alpha}{{pop}(y)^\beta}\)

α: P3α coefficient ∈ [0, 1]
β: Popularity penalization coefficient ∈ [0, 1]
pop(j) Number of interactions for item j
Normalizion row-wise (L1) is applied before exponentiation
Penalization is applied before the top k selection

S-Plus

\(s_{xy} = \frac{x \cdot y}{l \left(t_1(|x| - x \cdot y) + t_2(|y| - x \cdot y) + x \cdot y\right) + (1 - l)\left(\sum x_i^2\right)^c \left(\sum y_i^2\right)^{1 - c} + h}\)

l: Balance between Tversky and Cosine parts ∈ [0, 1]
t1, t2: Tversky coefficients ∈ [0, 1]
c: Cosine weighting exponent ∈ [0, 1]