This article gives an overview of the DeepMind's paper Accelerating Large Language Model Decoding with Speculative Sampling
Introduction
In Transformer models, sampling is often constrained by memory bandwidth, resulting in the time to generate a tok...