Sequence Models and Attention Mechanism

Yichun ZhaoYichun Zhao
9 min read

Augment your sequence models using an attention mechanism, an algorithm that helps your model decide where to focus its attention given a sequence of inputs. Then, explore speech recognition and how to deal with audio data.

Learning Objectives


  • Describe a basic sequence-to-sequence model

  • Compare and contrast several different algorithms for language translation

  • Optimize beam search and analyze it for errors

  • Use beam search to identify likely translations

  • Apply BLEU score to machine-translated text

  • Implement an attention model

  • Train a trigger word detection model and make predictions

  • Synthesize and process audio recordings to create train/dev datasets

  • Structure a speech recognition project

Various Sequence To Sequence Architectures

Basic Models

for instance translating French sentence to English

Encoder: RNN model, trained by a corpus of French text, and using it to encode the French sentence, input \(x^{<1>} \) to \(x^{}\) then encoded as an embedding vector.

Decoder: taking this embedding vector as a input to the decoder RNN, as initial \(a^{<0>}\), then the model output the English word sequence corresponding to the French ones.

Picking Most Likely Sentence

comparing it to a standard language model and explaining why finding the best translation is a search problem. Aiming to find a global optimal sentence, but not locally max argument.

1. Language Model vs. Machine Translation

  • Language Model: The goal is to estimate the probability of a sentence. It's built as a neural network that predicts the next word in a sequence based on the preceding words. This model can be used for tasks like generating new text.

  • Machine Translation Model: This is a more complex version, known as a conditional language model. Its job is to generate a sentence in a target language (e.g., English) conditioned on a sentence in a source language (e.g., French).

The model is structured as an encoder-decoder network:

  • Encoder: This part reads the entire source sentence and compresses its meaning into a single numerical representation called a "context vector."

  • Decoder: This part acts like a language model. It takes the context vector from the encoder as its starting point and then generates the translation word by word, making each prediction based on the source sentence's meaning and the words it has already generated.


2. The Problem with Random Sampling

When a machine translation model generates a sentence, it produces a probability distribution for every possible next word. A naive approach would be to randomly sample words from this distribution.

  • Result: While this might sometimes produce a good translation, it often leads to awkward, ungrammatical, or even incorrect sentences.

  • Conclusion: We don't want a random translation. We want the single most likely translation that maximizes the conditional probability of the target sentence given the source sentence: \(P(target sentence∣source sentence).\)


To find the most likely translation, you might think of using greedy search: at each step, simply pick the word with the highest probability.

  • Why it Fails: A choice that is locally optimal (the best word right now) might lead to a globally suboptimal sentence. A word that seems like the most probable choice early on could force the model down a path that results in a less natural or correct overall translation.

  • The Big Picture: The total number of possible translations is astronomically large. It's impossible to check every single combination of words to find the absolute best one. For example, a 10-word sentence from a 10,000-word vocabulary has 10,00010 possible combinations.


Since an exhaustive search is not feasible, machine translation models use approximate search algorithms. These algorithms are not guaranteed to find the single best translation, but they do a good enough job to produce high-quality results.

The most common and effective of these algorithms is beam search, which will be the topic of the next lesson. It explores a limited number of the most promising paths at each step to find a high-quality translation without having to explore every possibility.

In order to avoid exhaustive search, in sequence-to-sequence model like machine translation and summarization etc, beam search is a decoding strategy to generate output sequences.

so wo don’t want to search all combinations, but selecting among some most possible ones.

How beam search works:

  • Instead of keeping only one best option (like greedy), it keeps the top-k candidates (the "beam width") at each decoding step.

  • At each step:

    1. Expand each candidate by all possible next tokens.

    2. Compute their cumulative probabilities (or log-likelihood).

    3. Keep the k most probable partial sequences.

  • This continues until an end-of-sequence token is generated or max length is reached.

Benefits:

* Produces better, more coherent sequences than greedy search.

* Approximates the global optimum sequence (most likely sentence).

* Beam width controls trade-off:

* Small beam → faster, less accurate.

* Large beam → more accurate, slower (diminishing returns after some point).

In short:
Beam search helps seq2seq models avoid short-sighted greedy decisions and generate higher-quality sequences by keeping multiple possible hypotheses in play.

Bean Search Algorithm

Let me phrase it step by step with your “top 3” idea:

  1. Start:

    • Instead of keeping only 1 best sequence (like greedy), beam search keeps the top k sequences (say k = 3).
  2. Expand:

    • For each of those k sequences, expand by all possible next tokens.

    • Each expanded sequence gets a cumulative probability (score):

      \(\text{new score} = \text{old score} + \log P(\text{next token} \mid \text{sequence so far})\)

  3. Prune:

    • Among all the expanded candidates, keep only the top k (e.g., 3).

    • Drop the rest (these are the “pruned” branches).

  4. Repeat:

    • Continue expanding → scoring → pruning at each step.

    • Stop when <end> token is reached or max length is hit.


✅ So yes:

  • Greedy search: keep 1 sequence only.

  • Beam search: keep k sequences alive (say 3).

  • Each time, you recompute cumulative probabilities for extensions and pick the top k again.


Attention Model Intuition

Imagine you’re reading a sentence and trying to translate or summarize it.

  • If you just look at one word at a time (like an RNN does), you may forget important earlier words.

  • Humans don’t read like that. When we process a word, we often look back at other relevant words.

👉 That’s what attention does:
It lets the model focus on the most relevant parts of the input sequence when producing each output word.


🧠 Simple Analogy

  • Think of a student writing an essay with notes.

  • For each sentence they write, they don’t memorize everything.

  • Instead, they look back at the notes and pay more attention to the relevant parts.

  • Attention is basically: “Given my current writing context, which notes are most useful?”


⚡ How It Works (Conceptually)

When generating a word in the output:

  1. The model looks at all input words.

  2. Assigns each input word a weight (attention score).

    • Relevant words get high weights.

    • Irrelevant words get low weights.

  3. Combines them into a context vector — a weighted average of input words.

  4. Uses that to predict the next output word.

Attention Model

In the original seq2seq with attention (before Transformers came along):


🔹 Without Attention

  • Encoder compresses the whole input sequence into one fixed vector (the “context vector”).

  • Decoder uses that vector to generate outputs step by step.

  • Problem: For long sentences, that single vector can’t capture all the information → model forgets.


🔹 With Attention

We add an Attention Layer between encoder and decoder.

  • The encoder still produces hidden states for each input word (not just one vector).

  • The decoder, at each time step, asks:
    “Which encoder hidden states are most relevant right now?”

  • The attention layer computes a weighted combination of all encoder states (using attention scores).

  • That “context vector” is fed into the decoder to produce the next word.

So yes ✅: you can think of the attention mechanism as an extra layer sitting between encoder and decoder, helping the decoder “look back” at the encoder outputs intelligently.


🔹 What happens at decoder step t (producing \(y^{}\)):

  1. Encoder side:

    • The encoder produced a sequence of hidden states: \(h^{<1>}, h^{<2>}, \dots, h^{}\)

    • Each \(h^{}\) contains context information about input token \(x^{}\).

  2. Attention mechanism (the “extra layer”):

    • For the current decoder hidden state \(s^{}\), we compute alignment scores with each encoder hidden state \(h^{}\). \(e^{} = \text{score}(s^{}, h^{})\)

    • Apply SoftMax to get attention weights: \(\alpha^{} = \frac{\exp(e^{})}{\sum_j \exp(e^{})}\)

    • These weights tell us how much attention to pay to each encoder hidden state when generating \(y^{}\).

  3. Context vector:

    • Weighted sum of encoder states: \(c^{} = \sum_i \alpha^{} h^{}\)

    • This \(c^{}\) is the “dynamic context” for generating \(y^{}\).

  4. Decoder generates output:

    • Decoder combines \(c^{}\) + its previous hidden state \(s^{}\) to produce the next token \(y^{}\).

✅ So you are absolutely right:

  • The encoder hidden states are the “context information.”

  • The attention weights decide how much each hidden state contributes at step t.

It’s like saying:
“To predict word \(y^{}\), I’ll mostly look at encoder hidden state 5 (weight 0.7), a bit at state 6 (weight 0.2), and slightly at state 2 (weight 0.1).”


Would you like me to draw a diagram showing decoder step tt, with arrows and weights to encoder hidden states (like attention scores as soft arrows)? That would make it crystal clear.

How attention weights are computed

the following shows a decoder which is different from the early one, each time step does not use predictions from the previous time step. The post-attention LSTM at time 't' only takes the hidden state 𝑠⟨𝑡⟩ and cell state 𝑐⟨𝑡⟩ as input. We have designed the model this way because unlike language generation (where adjacent characters are highly correlated) there isn't as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.

  • "e" is called the "energies" variable.

  • 𝑠⟨𝑡−1⟩ is the hidden state of the post-attention LSTM

  • 𝑎⟨𝑡′⟩ is the hidden state of the pre-attention LSTM.

  • 𝑠⟨𝑡−1⟩ and 𝑎⟨𝑡⟩ are fed into a simple neural network, which learns the function to output 𝑒⟨𝑡,𝑡′⟩.

  • 𝑒⟨𝑡,𝑡′⟩ is then used when computing the attention 𝛼⟨𝑡,𝑡′⟩ that 𝑦⟨𝑡⟩ should pay to 𝑎⟨𝑡′⟩.

In additive attention, the scoring function is essentially a one-layer feed-forward neural network that outputs a scalar score for each encoder hidden state.

At decoder step t:

  1. You have:

    • Decoder hidden state \(s_{t-1}\)

    • Each encoder hidden state \(h_i\)

  2. Compute a score (“energies”) for each \(h_i\): \(e_{t,i}​=v_a^⊤​tanh(Wa​[s_{t−1}​;h_i​)\)

    • Here \(W_a\)​ and \(v_a\)​ are learnable parameters.

    • This is basically a 1-layer MLP producing a scalar score.

  3. Normalize with SoftMax to get attention weights:

    \[\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_j \exp(e_{t,j})}\]

  4. Compute the context vector:

    \[c_t = \sum_i \alpha_{t,i} h_i\]

Audio Data Speech Recognition

projecting audio clip (x) to y, i.e. transcript

CTC cost for speech recognition

connectionist temporal classification

Basic rule: 1 step) collapse repeated characters including blank space, and 2 step) removing blank spaces.

0
Subscribe to my newsletter

Read articles from Yichun Zhao directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Yichun Zhao
Yichun Zhao

Developer | Adept in software development | Building expertise in machine learning and deep learning