1 Understanding the Matching Stage

In the cascade architecture of industrial advertising systems, the matching phase, also known as the candidate retrieval stage, serves as the foundational step in the ad selection process. Its main objective is to efficiently retrieve a large but relevant subset of ads from an enormous inventory—often containing billions of ad items—to form a candidate pool for further refinement.

This phase is distinct from the ranking phase, which comes later in the pipeline. While ranking focuses on scoring and ordering a smaller, more curated list of ads based on predicted performance (e.g., click-through rate or conversion likelihood), the matching phase is primarily about surfacing ads that are broadly aligned with the user’s intent, context, or historical behavior.

Role and Characteristics of the Matching Phase:

Scale: Operates at massive scale, scanning the entire ad corpus across potentially billions of items.
Speed: Must deliver results within milliseconds to support real-time ad serving requirements.
Personalization: While not as fine-grained as the ranking stage, it should incorporate broad personalization signals like user interests, browsing history, or coarse-grained demographic data.
Precision vs. Recall: Prioritizes high recall—retrieving a wide range of possibly relevant ads to avoid missing good candidates. Fine-tuned precision is handled downstream during ranking.

2 Rule-Based Matching

Rule-based matching remains one of the foundational techniques in e-commerce advertising systems, where explicit relationships and behaviors can be leveraged to infer user preferences. These approaches rely on predefined heuristics or business rules to match ads (or products) to users, often offering simplicity, interpretability, and control over the matching logic.

2.1 Fine-Grained Matching with Core Product Terms

Product categories are often too coarse-grained to capture specific user interests. For example, a user might show a preference for the "electronics" category but may actually be interested in only a small subset of products, such as wireless earbuds. To enable more precise targeting, e-commerce advertising systems often extract core product terms from item titles. These terms represent the most descriptive and meaningful keywords that define a product—such as “earbuds,” “laptop,” or “wallet.”

The system then builds a user profile based on the core product terms the user has interacted with, allowing for more fine-grained matching. In this approach, ads are selected if they contain core product terms that match the user's known interests. This allows the system to surface items that may belong to different sellers but share the same core term.

Example: A user frequently views “earbuds” → The system matches other ads that emphasize “earbuds,” even from different brands or price segments.

2.2 Seller-Based Matching via Follow Relationships

Another effective rule-based strategy utilizes seller followship data. In many e-commerce platforms, users can follow specific sellers to stay updated on their latest offerings. This follow relationship is a strong, explicit signal of interest.

When a followed seller adds a new item to their catalog, the system can automatically include that item as a matching candidate for all their followers. This ensures timely exposure of new products to a highly relevant audience.

Rule: If user A follows seller B, and seller B lists a new item, push the item to A's candidate pool.

This technique not only increases user engagement but also supports sellers by giving their new listings immediate visibility.

3 Embedding-Based Similarity Matching

Embedding-based similarity matching is a widely adopted technique in advertising and recommendation systems. The core idea is to represent both users and items (ads, products, or videos) as dense, high-dimensional embedding vectors in a shared semantic space. These vectors capture complex relationships based on user behavior, item features, and contextual signals.

Once the embedding models are trained, a nearest neighbor search is performed between the user embedding and item embeddings to retrieve a set of candidate items that are most similar to the user in that vector space. This technique offers both scalability and personalization, making it highly suitable for large-scale systems.

3.1 YouTubeDNN

One of the most influential architectures in this domain is YouTubeDNN [1], originally designed for large-scale video recommendation. YouTubeDNN formulates the recommendation task as an extreme multi-class classification problem, where the objective is to predict the probability of a user watching a specific video among millions of videos.

Given a user $U$, context $C$, and a watched video $w_t$, the goal is to model the probability:

$$P(w_t=i|U,C) = \frac{e^{v_i u}}{\sum_{j\in V} e^{v_j, v}}$$

where

$u$: The embedding vector of the user-context pair.
$v_j$: The embedding vector for candidate video.
$V$: The full set of videos in the corpus.

Here, the model uses a softmax classifier over all candidate items, learning user embeddings that are highly discriminative across millions of options.

Inspired by the Continuous Bag-of-Words (CBOW) model from NLP, YouTubeDNN:

Learns a fixed embedding for each video in the catalog.
Builds user embeddings by aggregating signals from their interaction history, along with contextual features like time, device type, or location.
Passes these features through a deep feedforward neural network to generate the user vector.

At inference time (during serving), the system:

Computes the user embedding using the trained model and current context.
Performs an Approximate Nearest Neighbor (ANN) search to find the top-k most similar item embeddings based on vector similarity (typically cosine or dot product).

This architecture allows for highly scalable and responsive retrieval, even when the item corpus is extremely large.

3.2 DSSM

Deep Structured Semantic Models (DSSM) were originally proposed for semantic matching in web search, where the goal is to map a user query to its most relevant documents. Over time, DSSM has become a foundational architecture in the fields of advertising and recommendation systems, particularly for embedding-based candidate retrieval.

At the heart of DSSM is the two-tower (or dual-tower) architecture, which has heavily influenced many modern matching systems. This architecture consists of:

A user tower: Encodes user features, such as browsing history, demographic data, or contextual signals, into a dense embedding.
An item tower: Encodes item-specific features, such as product description, category, or brand, into another dense embedding.

The user and item embeddings are trained to lie in a shared semantic space such that relevant user-item pairs are close together under a similarity measure (typically dot product or cosine similarity).

DSSM is trained using clickthrough data in a discriminative manner. Given a user query (or context) and a set of documents (or items), the model is trained to maximize the conditional likelihood of the clicked item over a small group of candidate items. The training objective is to maximize the similarity between the user vector and the clicked item vector. This is often formulated as a list-wise loss over a small batch of candidates:

$$L=- \log \prod_{(Q,D^{+})} P(D^{+}|Q)$$

where $Q$is a query, $D^{+}$is the clicked document and $P()$represents the posterior probability of a document given a query, derived from their semantic relevance score using a softmax function.

The model parameters are optimized to minimize the negative log-likelihood of the clicked item across the training set.

3.3 Comparison of the Two Models

Although DSSM and YouTubeDNN share the core idea of learning embeddings for users and items in a shared semantic space, they differ significantly in both their architectures and loss functions.

YouTubeDNN uses a single-tower architecture, where the model is trained with a point-wise loss. It treats each user-item interaction independently and samples negative items from the global item corpus. To address sampling bias, importance weighting is applied during training.

In contrast, DSSM follows a two-tower architecture, with separate neural networks for encoding users and items. It employs a list-wise loss function, where a clicked item and multiple unclicked items are considered together in a single training instance. In the original implementation, for each query or user session, the training batch consists of one clicked document and several randomly selected unclicked documents (e.g., four negatives).

In our practical implementation of DSSM for the matching scenario, we extend this approach by sampling negative items from the entire global item corpus, enabling a more diverse and representative set of contrasts during training.

The hard mining technique, as described in [3], is used to design a training dataset that enables an embedding model to learn efficiently and effectively within the embedding space. Hard mining includes both hard negative mining and hard positive mining. This article will not delve into the details of this technique.

4 References

[1] Covington, Paul, Jay Adams, and Emre Sargin. "Deep neural networks for youtube recommendations." Proceedings of the 10th ACM conference on recommender systems. 2016.

[2] Huang, Po-Sen, et al. "Learning deep structured semantic models for web search using clickthrough data." Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2013.

[3] Huang, Jui-Ting, et al. "Embedding-based retrieval in facebook search." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.

Exploring Various Matching Techniques Used in E-commerce Advertising Systems