🔍 Similarity Functions
SimilariPy provides a suite of similarity functions for sparse matrixes, all implemented in Cython and parallelized with OpenMP. These models compute item-to-item or user-to-user similarity based on vector math or graph-based transformations.
Similarities
| Function | Description |
|---|---|
dot_product() |
Simple raw inner product between vectors. |
cosine() |
Cosine similarity with optional shrinkage. |
asymmetric_cosine(alpha=0.5) |
Asymmetric variant of cosine similarity, where alpha controls the weighting between vectors. |
jaccard() |
Set-based similarity defined as the intersection over union. |
dice() |
Harmonic mean of two vectors' lengths. |
tversky(alpha=1.0, beta=1.0) |
Tversky similarity, a generalization of Jaccard and Dice. |
p3alpha(alpha=1.0) |
Graph-based similarity computed as normalized matrix multiplication with alpha exponentiation. |
rp3beta(alpha=1.0, beta=1.0) |
P3alpha variant that penalizes popular items with a beta exponent. |
s_plus(l1=0.5, l2=0.5, l3=0.0, t1=1.0, t2=1.0, c1=0.5, c2=0.5, alpha=1, pop1='none', pop2='none', beta1=0.0, beta2=0.0) |
Hybrid similarity model that combines Tversky and Cosine normalizations with RP3Beta-style depopularization, controlled by tunable weights. The pop1 and pop2 parameters define item popularity weights and may be provided as custom arrays of arbitrary values, or initialized with built-in options: 'sum': use the sum of interactions per item; 'none': disable popularity weighting (default: 'none') |
Common Parameters
All similarity functions in Similaripy share the following parameters:
| Parameter | Description |
|---|---|
m1 |
Input sparse matrix for which to calculate the similarity. |
m2 |
Optional transpose matrix. If None, uses m1.T. (default: None) |
k |
Number of top-k items per row. (default: 100) |
h |
Shrinkage coefficient applied during normalization. |
threshold |
Minimum similarity value to retain. Values below are set to zero. (default: 0) |
shrink_type |
Shrinkage type: stabilized, bayesian, or additive. (default: stabilized) |
binary |
If True, binarizes the input matrix. (default: False) |
target_rows |
List or array of row indices to compute. If None, computes for all rows. (default: None) |
target_cols |
Subset of columns to consider before applying top-k. Can be an array (applied to all rows) or a sparse matrix (row-specific). (default: None) |
filter_cols |
Subset of columns to filter before applying top-k. Can be an array (applied to all rows) or a sparse matrix (row-specific). (default: None) |
verbose |
If True, shows a progress bar. (default: True) |
format_output |
Output format: 'coo' or 'csr'. (default: 'coo')Note: 'csr' not currently supported on Windows. |
num_threads |
Number of threads to use. 0 means use all available cores. (default: 0) |
Shrinkage Types
The shrinkage equations are displayed with the cosine normalization for simplicity, however, they are available in all the similarities.
Stabilized Shrinkage
\(s_{xy} = \frac{x \cdot y}{\sqrt{\sum_i x_i^2} \cdot \sqrt{\sum_i y_i^2} + h}\)
- Prevents instability when norms are small.
hacts as the shrinkage strength.shrink_type = 'stabilized'
Bayesian Shrinkage
\(s_{xy} = \frac{x \cdot y}{\sqrt{\sum_i x_i^2} \cdot \sqrt{\sum_i y_i^2}} \cdot \frac{x \cdot y}{x \cdot y + h}\)
- Penalizes similarities with items with low overlap support.
hacts as the shrinkage strength.shrink_type = 'bayesian'
Additive Shrinkage
\(s_{xy} = \frac{x \cdot y}{\sqrt{\sum_i (x_i^2 + h)} \cdot \sqrt{\sum_i (y_i^2 + h)}}\)
- Penalizes similarities with items with low support.
- Adds shrinkage directly into the cosine denominator norms.
hacts as the shrinkage strength.shrink_type = 'additive'
Notes
- All similarity functions are implemented in Cython + OpenMP for high-performance computation on CSR matrixes.
- Computations are fully multi-threaded and scale with CPU cores.
- Supports CSR and COO sparse matrix formats as output.
- ⚠️ Windows: use
format_output='coo'(CSR output is not supported on Windows due to a platform data type mismatch).
Math Equations
Dot Product
\(s_{xy} = x \cdot y\)
Cosine
\(s_{xy} = \frac{x \cdot y}{\|x\| \cdot \|y\|}\)
Asymmetric Cosine
\(s_{xy} = \frac{x \cdot y}{\left(\sum x_i^2\right)^\alpha \left(\sum y_i^2\right)^{1 - \alpha}}\)
α: Asymmetry coefficient ∈ [0, 1]
Jaccard
\(s_{xy} = \frac{x \cdot y}{|x| + |y| - x \cdot y}\)
Dice
\(s_{xy} = \frac{x \cdot y}{\frac{1}{2}|x| + \frac{1}{2}|y| - x \cdot y}\)
Tversky
\(s_{xy} = \frac{x \cdot y}{\alpha(|x| - x \cdot y) + \beta(|y| - x \cdot y) + x \cdot y}\)
α,β: Tversky coefficients ∈ [0, 1]
P3α
\(s_{xy} = x^\alpha \cdot y^\alpha\)
α: P3α coefficient ∈ [0, 1]- Normalizion row-wise (L1) is applied before exponentiation
RP3β
\(s_{xy} = \frac{x^\alpha \cdot y^\alpha}{{pop}(y)^\beta}\)
α: P3α coefficient ∈ [0, 1]β: Popularity penalization coefficient ∈ [0, 1]pop(y)Number of interactions for item y- Normalizion row-wise (L1) is applied before exponentiation
- Penalization is applied before the top k selection
S-Plus
\(s_{xy} = \frac{(x \cdot y)^\alpha}{l_1 \left(t_1(|x| - x \cdot y) + t_2(|y| - x \cdot y) + x \cdot y\right) + l_2\left(\sum x_i^2\right)^{c_1} \cdot \left(\sum y_i^2\right)^{c_2} + l_3(pop_1(x)^{\beta_1} \cdot pop_2(y)^{\beta_2})}\)
l1,l2: Tversky,Cosine normalization strength ∈ [0, 1]l3: Popularity penalization strength ∈ [0, 1]t1,t2: Tversky coefficients ∈ [0, 1]c1,c2: Cosine weighting exponent ∈ [0, 1]α: Coefficient for the raw interaction overlap (different from P3α)β1,β2: Popularity penalization coefficient for the item x/y ∈ [0, 1]pop(x),pop(y): Popularity value for the item x/y