Luong Attention

Abhay ShuklaAbhay Shukla
2 min read

Luong Attention is proposed in the paper Effective Approaches to Attention-based Neural Machine Translation with two variants

  • Global Attention

  • Local Attention

The proposed NMT model architecture consists of LSTM encoder to encode input sequences and LSTM decoder to predict the translation.

At each time step t in the decoding phase, given the target hidden state \(h_t\) and the source-side context vector \(c_t\), attentional hidden state is computed as,

$$\mathbf{\tilde{h}}_t = \tanh(\mathbf{W}_c[\mathbf{c}_t;\mathbf{h}_t])$$

and then tokens are predicted as,

$$p(y_t|y_{

The difference in global and local attention mechanisms is in the manner source-side context vector \(\mathbf{c}_t\) is computed.

Global Attention

Global attentional model considers all the hidden states of the encoder when deriving the context vector \(c_t\).

  • For the current target hidden state \(h_t\) and each source hidden state \(\bar{h}_s\), attention score is computed as,

$$\begin{align} a_t(s) &= align(h_t, \bar{h}_s) \\ &= \frac{exp(score(h_t, \bar{h}_s))}{\sum_{s'}exp(score(h_t, \bar{h}_{s'}))} \end{align}$$

  • Authors propose different methods to compute score,

$$\begin{equation} score(h_t, \bar{h}_s) = \begin{cases} h_t^T\bar{h}_s & dot\\ h_{t}^{T}W_a\bar{h}_s & general\\ v_{a}^{T}tanh(W_a[h_t;\bar{h}_s]) & concat \end{cases} \end{equation}$$

  • Global attention vector is then computed as weighted average over all the source states

$$c_t = \sum_s{a_t(s)\bar{h}_s}$$

Local Attention

Local attentional mechanism chooses to focus only on a small subset of the source positions per target word.

  • For each decoding step t, an aligned position \(p_t\) in source sequence can estimated by the following methods,

    • Monotonic alignment (local-m): simply set \(p_t = t\) assuming that source and target sequences are roughly monotonically aligned

    • Predictive alignment (local-p): for given input sequence length S, predict an aligned position as,

$$p_t = S \cdot sigmoid(v_p^⊤tanh(W_ph_t))$$

  • Context vector \(c_t\) is then derived as weighted average of encoder hidden states in the window \([p_t-D, p_t+D]\), where D is empirically estimated

  • Further, to favor alignment points near \(p_t\), a Gaussian distribution is placed centered around \(p_t\). As a result, the alignment weights are defined as,

$$\begin{align} a_t(s) &= align(h_t, \bar{h}_s) exp\left(-\frac{(s-p_t)^2}{2\sigma^2}\right) \\ \sigma &= \frac{D}{2} \end{align}$$

Where s is an integer within the context window centered at \(p_t\) and \(align(h_t, \bar{h}_s)\) is same as defined for Global Attention.

0
Subscribe to my newsletter

Read articles from Abhay Shukla directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abhay Shukla
Abhay Shukla