DIN: Deep Interest Network for Click-Through Rate Prediction

Abhay ShuklaAbhay Shukla
3 min read

DIN is published by Alibaba and Taobao in KDD’2018. The paper mentions that DIN was successfully deployed in the online display advertising system in Alibaba, serving the main traffic.

Full Paper: Deep Interest Network for Click-Through Rate Prediction

Motivation Behind DIN

DNN models with Embedding&MLP paradigm:

  • Map sparse inputs to low dimensional embedding vectors

  • Transform embedding vectors to fixed length vectors for each feature group

  • Concatenate all the resulting vectors before passing them to MLP layers

The bottleneck with this approach is that:

  • User feature vector learned from their behavior sequences is the same irrespective of the target item.

  • However, in context of a target item, only some part of the historical user behavior will be relevant and therefore the entire user history does not contribute equally to capture user interest w.r.t. a target item

The example in the paper explains it as: “a female swimmer will click a recommended goggle mostly due to the purchase of bathing suit rather than the shoes in her last week’s shopping list“.

Lets see what is proposed with DIN.

Deep Interest Network (DIN)

Feature sets used in online display advertising system in Alibaba are shown below:

  • In Embedding&MLP paradigm

    • One-hot features get converted to fixed-length feature vectors directly by the embedding layer

    • Variable length sequences such as item ids visited by the user are represented as multi-hot features, embedding are looked up for each index and then pooling operation is used to get fixed-length feature vector

In DIN, multi-hot features which contain rich user behaviors, are converted to fixed feature length while capturing relevant information in context of the target item.

DIN Architecture

Notice the changes in the architecture w.r.t. the Embedding&MLP paradigm.

DIN introduces a local activation unit which is applied on the user behavior features, and performs a weighted sum pooling to calculate user representation for a given candidate item A.

\(v_U(A)\) represents the user vector given a target item A

\(e_1, e_2, ..., e_H\) are the embeddings of the H items user has interacted with

\(v_A\) is the embedding of the target item A

\(a(.)\) is a feed-forward network with output as the activation weight

It keeps rest of the architecture same.

IMP: Compared to traditional attention method where sum of attention weights is constrained to 1 using softmax operation, this normalization is relaxed in DIN. Instead, sum of attention weights is treated as “an approximation of the intensity of activated user interests”.

Two more novel contributions mentioned in the paper are:

  • Mini-batch Aware Regularization where L2 regularization updates only the weights which are active in the forward pass and

  • Dice activation function may be viewed as a generalization of PReLu

Experiments

Dataset

Metrics

AUC and RelaImpr are used to measure performance.

Offline Results on Alibaba Dataset

Online Results

Online A/B testing in the display advertising system in Alibaba was conducted from 2017-05 to 2017-06. DIN trained with the proposed regularizer and activation function resulted in up to 10.0% CTR and 3.8% RPM(Revenue Per Mille) improvement compared with the BaseModel.

Other Insights

  • Ads are ranked by \(CTR^\alpha * bid\_price\) with \(\alpha > 1\), which controls the balance between CTR and RPM

  • Inference time is less than 10ms for hundreds of ads per visitor

  • Request batching is done to take full advantage of GPU

  • Concurrent kernel computation is done which allows execution of matrix computations to be processed with multiple CUDA kernels concurrently

0
Subscribe to my newsletter

Read articles from Abhay Shukla directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abhay Shukla
Abhay Shukla