Semantic Similarity

Semantic similarity measures how closely products are related in meaning — based on their text descriptions, attributes, and metadata — rather than behavioral signals.

How it is calculated

Each product description and metadata is encoded into a dense vector (embedding) using multilingual transformer models (e.g., SBERT, FastText). The algorithm computes the cosine similarity between product vectors:

Where:

v_i, v_j = product embedding vectors
Similarity ∈ [0,1], with 1 meaning semantically identical.

Example

Source product: “Nike Air Zoom Pegasus 40 running shoe, red” → Semantically similar products might include:
- “Adidas Ultraboost 22 running shoe, blue” (same purpose and category)
- “Salomon Speedcross trail shoe” (related usage, similar function)
- ❌ “Red dress” (similar color but irrelevant meaning — model filters this out).

Multilingual model

The embedding model is trained on 109 languages, including those without spaces (Japanese, Chinese, Thai, etc.), allowing semantic matching across all markets. This differs from the Content Interest Criterion used in the Segment Builder, which only supports Western languages.

Key takeaways

Works from day one — no behavioral data needed.
Ideal for new, long-tail, or low-traffic products.
Enables “Similar products” and “Alternative discovery” strategies.

PreviousTF-IDF (Term Frequency – Inverse Document Frequency)NextCustom Rules

Last updated 11 days ago

Was this helpful?