I recently watched Two Minute Papers' video on MuZero for video compression. The video left out a detail which I think is essential: how is distance between videos computed? The linked paper answers it: with PSNR.

PSNR stands for Peak Signal-to-Noise Ratio, and is defined thusly in the paper:

$$\operatorname{PSNR} := 10 \operatorname{log_{10}} \left(\frac{255^2}{\operatorname{MSE}(\operatorname{original}, \operatorname{encoded})}\right)$$

where MSE stands for Mean Squared Error, computed over all the (sub)pixels. What is not clear, from the part of the paper that defines PSNR, is how it is computed for a whole video (is MSE applied to single frames, or to whole videos?).

In any case, the associated distance metric is very low level: it does not take into account any semantics associated with the pixels beyond intensity. It operates on the representation of the frame/video as a tuple of intensities. It does not even take into account pixel proximity.

The metric is key

This type of metric makes sense for traditional video compression, as it is hard to engineer a higher-level metric that is still robust. However, it seems out of place when used with modern machine learning. Using machine learning itself, one ought to be able to define a better metric.

Once a better metric is bootstrapped using ML, the only reason I see to use PSNR would be to show that an ML approach can beat traditional video compression algorithms even on their own turf. Indeed, using PSNR should advantage traditional algorithms, as they have been fine-tuned for it, and it provides much less room for ML to shine.

It seems to me that, for more advanced, statistics-based video compression, having a better, more semantic distance metric is key. I expect much better ML compression to be possible with such a metric. Instead of a 6.28% reduction in size over state-of-the-art compression, with a proper distance metric, I think a, say, 10x reduction in video size is achievable in the relatively near future.

Below is a very rough draft of how a better distance metric could be constructed. Disclaimer: I have not looked further into the literature. It could well be that all this is known (and more), and that better metrics are actively being worked on. It is also just a thought. I am aware that getting such a metric to work is no easy feat, and that I probably missed major difficulties.

Better metric proposal outline

The metric below is tailored to videos recorded from the real world. However, the general approach could be adapted to other kinds of videos. It could also be used as part of a more general metric.

Semantic model

Train a model to reconstruct the evolving 3D (parsed) scene from a video, complete with camera (including its orientation, zoom level, …).

Such a model could (partially) get trained from virtual scenes. Parsing can also be learned from real videos by observing which parts of a scene can move independently.

It could go even further by annotating the scenes with estimated physical properties such as forces, accelerations, speeds, deformations, etc.

Scene mapping

Let a scene mapping be a structure-preserving morphism from one evolving scene to another.

Scene mapping distortion

Define a distortion function that, given a scene mapping, returns the amount of distortion introduced by the mapping. A distortion function would take into account object distortions, change of relative object placements, color changes, timing changes, etc.

Distance as minimal distortion

Train a second model to find scene mappings that minimize distortion. Once that model is trained, it can be used to measure the distance between two evolving scenes: apply the model to the two evolving scenes, and compute the distortion of the mapping it outputs. This is the distance between the two evolving scenes.

Together with the semantic model, it provides a way to define a metric between two videos. Finally, this metric can be used to train a compression model.

Adding realism back in

A missing piece in the procedure described so far is that the compressed video, while close to the original, might have some other issues: it may look uncanny, or off somehow. Perhaps some important properties of real videos would not be quite satisfied by the compressed video (e.g. proportions, timing relations).

In order to mitigate such issues, the compression model should additionally be kept in check by an adversarial discriminator, which tries to determine which of the two videos is the original (or it could also operate on the generated scenes).

Further possibilities and notes

Instead of (only) comparing scenes directly, higher-level semantics could also be compared, if a model is available to further interpret the scene. Such a model could be obtained from a different ML task about scene comprehension.

The semantic model could also return distributions over scenes (e.g. by being generative), rather than a single guess. The rest of the approach would have to be adjusted to work with scene distributions.

Musings on ML video compression