Luong Attention

Abhay ShuklaAbhay Shukla
2 min read

Luong Attention is proposed in the paper Effective Approaches to Attention-based Neural Machine Translation with two variants

  • Global Attention

  • Local Attention

The proposed NMT model architecture consists of LSTM encoder to encode input sequences and LSTM decoder to predict the translation.

At each time step t in the decoding phase, given the target hidden state \(h_t\) and the source-side context vector \(c_t\), attentional hidden state is computed as,

$$\mathbf{\tilde{h}}_t = \tanh(\mathbf{W}_c[\mathbf{c}_t;\mathbf{h}_t])$$

and then tokens are predicted as,


The difference in global and local attention mechanisms is in the manner source-side context vector \(\mathbf{c}_t\) is computed.

Global Attention

Global attentional model considers all the hidden states of the encoder when deriving the context vector \(c_t\).

  • For the current target hidden state \(h_t\) and each source hidden state \(\bar{h}_s\), attention score is computed as,

$$\begin{align} a_t(s) &= align(h_t, \bar{h}_s) \\ &= \frac{exp(score(h_t, \bar{h}_s))}{\sum_{s'}exp(score(h_t, \bar{h}_{s'}))} \end{align}$$

  • Authors propose different methods to compute score,

$$\begin{equation} score(h_t, \bar{h}_s) = \begin{cases} h_t^T\bar{h}_s & dot\\ h_{t}^{T}W_a\bar{h}_s & general\\ v_{a}^{T}tanh(W_a[h_t;\bar{h}_s]) & concat \end{cases} \end{equation}$$

  • Global attention vector is then computed as weighted average over all the source states

$$c_t = \sum_s{a_t(s)\bar{h}_s}$$

Local Attention

Local attentional mechanism chooses to focus only on a small subset of the source positions per target word.

  • For each decoding step t, an aligned position \(p_t\) in source sequence can estimated by the following methods,

    • Monotonic alignment (local-m): simply set \(p_t = t\) assuming that source and target sequences are roughly monotonically aligned

    • Predictive alignment (local-p): for given input sequence length S, predict an aligned position as,

$$p_t = S \cdot sigmoid(v_p^⊤tanh(W_ph_t))$$

  • Context vector \(c_t\) is then derived as weighted average of encoder hidden states in the window \([p_t-D, p_t+D]\), where D is empirically estimated

  • Further, to favor alignment points near \(p_t\), a Gaussian distribution is placed centered around \(p_t\). As a result, the alignment weights are defined as,

$$\begin{align} a_t(s) &= align(h_t, \bar{h}_s) exp\left(-\frac{(s-p_t)^2}{2\sigma^2}\right) \\ \sigma &= \frac{D}{2} \end{align}$$

Where s is an integer within the context window centered at \(p_t\) and \(align(h_t, \bar{h}_s)\) is same as defined for Global Attention.

Subscribe to my newsletter

Read articles from Abhay Shukla directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abhay Shukla
Abhay Shukla