Practical Lessons for Building Reranking Models in Advertising Systems

1 Multi-Stage Architecture of Online Advertising Systems
The modern online advertising system typically follow a multi-stage cascade architecture, designed to balance scalability, latency, and relevance. Each stage progressively refines a large pool of candidate ads down to the final set shown to users:
Matching Stage: This is the first retrieval layer, responsible for generating a broad set of candidate ad items from a massive inventory. Techniques like preference matching, embedding-based similarity, and approximate nearest neighbor (ANN) search are commonly used to ensure high recall and maintain computational efficiency.
Pre-Ranking Stage: Also known as the filtering or rough scoring phase, this stage applies a lightweight model or strategy to assign preliminary scores to the candidate set. Its primary goal is to efficiently filter out low-quality or irrelevant ads, reducing the pool from tens of thousands to a more manageable size for downstream processing.
Ranking Stage: At this point, a more fine-grained and complex model is applied to the remaining candidates. This model produces precise relevance scores by leveraging rich user, context, and ad features. The top-scoring items from this stage represent the most relevant ads according to the system's current optimization goals (e.g., CTR, conversion rate, revenue).
Reranking Stage: The reranking phase operates on the top-K ad items output from the ranking model. This final stage is designed to further refine the ordering, often incorporating additional signals (e.g., diversity, fairness, recency, or multi-objective trade-offs) that may not have been fully captured earlier. Reranking models help optimize the final presentation list, ensuring that the most effective and contextually appropriate ads are delivered to the user.
The figure above illustrates the cascade architecture of an industrial advertising system, as referenced in [3].
2 Models
Reranking plays a critical role in online advertising systems, serving as the final stage in the multi-stage ranking pipeline. Unlike earlier stages that focus on relevance scoring based on individual user-item features, reranking has the opportunity to incorporate contextual relationships among the top-ranked items. However, many traditional reranking models focus solely on user-item pairwise interactions, ignoring the mutual influences among items displayed together in the final ad list.
While pairwise and listwise Learning-to-Rank (LTR) models aim to improve ranking by considering item pairs or item lists as inputs, they typically focus on optimizing the loss function—using labels like click-through data or relevance scores—rather than explicitly modeling inter-item dependencies in the feature space. This limitation can reduce the effectiveness of reranking in scenarios where the presence of one item affects the likelihood of another being clicked.
To address this, models such as the Deep Listwise Context Model (DLCM) [1] and the Personalized Reranking Model (PRM) [2] have been proposed. These models explicitly incorporate mutual item influence within the reranking process, aiming to refine the initial ranked list produced by the earlier ranking stage.
2.1 Deep Listwise Context Model (DLCM)
Originally developed for ranking refinement in information retrieval systems, DLCM is also highly applicable to ad reranking. The key idea behind DLCM is to encode intra-list dependencies using a recurrent neural network (RNN), specifically a Gated Recurrent Unit (GRU).
In DLCM, the top-N ranked list (produced by the ranking stage) is fed sequentially into a GRU in reverse order. Each item in the list is input at a different time step, and the GRU produces a hidden state representing the accumulated context. The final hidden state acts as a compact representation of the entire list, capturing the overall inter-item context.
DLCM also introduces a local ranking function, which takes into account both the GRU’s output for each item and the final context vector. This mechanism is conceptually similar to attention mechanisms used in RNN-based models like those in machine translation. This attention-like structure allows the model to adjust item scores based on their position and surrounding items.
2.2 Personalized Reranking Model (PRM)
Building on the ideas from DLCM, the Personalized Reranking Model (PRM) incorporates more advanced architecture by replacing GRU with a Transformer-like encoder. The self-attention mechanism in Transformers enables PRM to capture long-range dependencies and model interactions among all items simultaneously, overcoming the sequential processing limitations of RNNs.
In addition to modeling inter-item influence, PRM introduces personalized user embeddings at the input layer. These embeddings allow the model to tailor reranking decisions to individual user preferences, enhancing personalization and improving user engagement.
Thanks to the parallelizable nature of Transformers, PRM offers not only better performance in modeling context but also improved efficiency during training and inference. This makes it more scalable and practical for real-world advertising systems where both accuracy and latency are crucial.
3 Evaluation
3.1 Dataset Construction
As far as we know, there is no standardized reranking dataset or widely adopted approach tailored for online advertising systems. To bridge this gap, we construct a dataset that adapts the relevance-based labeling strategy from information retrieval while incorporating key advertising metrics such as effective Cost Per Mille (eCPM).
We define the reranking labels as follows:
Label 0: The ad was shown but not clicked.
Label 1 to N: The ad was clicked, with labels assigned based on the product of predicted CTR and bid price (i.e., eCPM). Higher labels correspond to higher eCPM values, indicating more valuable clicks.
This labeling scheme enables the reranking model to focus on not just clicks, but value-weighted clicks, aligning the objective more closely with business goals like revenue optimization.
Session Definition and Filtering: Each training sample corresponds to a session, defined as a set of ad impressions served in a single user request. Similar to preprocessing pipelines used for pCTR models, we apply strict filtering rules:
Sessions with low-quality signals (e.g., bots or anomalous behaviors) are removed.
Sessions with no clicks at all are excluded to ensure training focuses on meaningful user interactions.
This yields a high-quality dataset suitable for training and evaluating reranking models in advertising environments.
3.2 Evaluation Metrics
We adopt both offline ranking metrics and online business KPIs to assess model performance comprehensively.
Offline Metrics We use Normalized Discounted Cumulative Gain (NDCG) and Mean Average Precision (MAP) as our primary offline evaluation metrics.
NDCG@K
Normalized Discounted Cumulative Gain (NDCG) is a standard metric used to evaluate the quality of ranked lists by considering both the relevance of items and their positions. The core idea is that highly relevant items are more valuable when they appear near the top of the list, where users are more likely to engage with them.
The metric is based on Discounted Cumulative Gain (DCG), which applies a logarithmic discount to relevance scores according to their rank positions. DCG is calculated as follows:
$$DCG = \sum_{i}^{N} \frac{2^{r_i} -1}{\log_2 (i+1)}$$
Where \(r_i \) is the relevance label (e.g., our reranking label) of the item at position \(i\), \(N\) is the number of items in the list.
To enable a fair comparison across different sessions or queries, DCG is normalized by dividing it by IDCG (Ideal DCG), which represents the DCG of a perfectly ranked list where items are sorted in descending order of relevance.
$$NDCG= \frac{IDCG}{DCG}$$
In our evaluation, we use NDCG@K, which focuses on the top-K ranked items. This is especially relevant in advertising systems, where the top few positions capture the majority of user attention and generate the highest revenue impact.
MAP@K
Mean Average Precision at K (MAP@K) is a widely used metric that evaluates the ranking quality of binary relevance tasks—in this case, distinguishing between clicked and non-clicked items. Unlike NDCG, MAP treats all relevant items (clicks) equally, without considering varying degrees of relevance. MAP@K emphasizes early precision, rewarding models that place clicked (i.e., relevant) items near the top of the ranked list.
First, we define Precision@k as the fraction of clicked items within the top-k positions:
$$Precison@k = \frac{\text{# clicked items in top-k}}{k}$$
Then, MAP@k is computed by averaging the precision at each position i where a click occurs:
$$MAP@k = \frac{\sum_{i=1}{k} Precision@i * l_i}{k}$$
Where: \(l_i\) is 1 if the item at rank i was clicked, 0 otherwise.
This metric is particularly useful for evaluating how well the model surfaces relevant ads early in the result list.
3.3 Online Metrics
For online A/B testing, we focus on standard business-oriented KPIs in advertising systems:
CTR (Click-Through Rate):The ratio of clicks to total impressions, measuring user engagement.
ACP (Average Click Price): The average price paid per click, indicating bid competitiveness and advertiser cost.
RPM (Revenue Per Mille): Revenue generated per thousand impressions, reflecting monetization efficiency and the platform’s revenue from traffic.
4 References
[1] Ai, Qingyao, et al. "Learning a deep listwise context model for ranking refinement." The 41st international ACM SIGIR conference on research & development in information retrieval. 2018.
[2] Pei, Changhua, et al. "Personalized re-ranking for recommendation." Proceedings of the 13th ACM conference on recommender systems. 2019.
[3] Wang, Zhe, et al. "Cold: Towards the next generation of pre-ranking system." arXiv preprint arXiv:2007.16122 (2020).
Subscribe to my newsletter
Read articles from Jiangyu Zheng directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
