Βι¶ΉΤΌΕΔ

Research & Development

Posted by Issa Khalifeh, Marc GΓ³rriz Blanch on , last updated

The Βι¶ΉΤΌΕΔ has always pushed boundaries to achieve better video quality for both streaming and broadcasting - one example is the Βι¶ΉΤΌΕΔ’s contribution to the Ultra High Definition (UHD) standard. Many TVs now display broadcasts at 100Hz or more. Generally, broadcast content is recorded at a lower frame rate. Frame interpolation algorithms are deployed in new TVs to ensure that such content is played are at the frame rate required.

Read more about this work in our conference paper (I. Khalifeh, M. G. Blanch, E. Izquierdo and M. Mrak. ‘Multi-encoder Network for Parameter Reduction of a Kernel-based Interpolation Architecture’, in NTIRE, CVPR 2022)

Frame interpolation involves the generation of new frames from the existing input frames. The generated frames are then inserted back into the original video to produce a video sequence that can be played at 100fps or even for slow-motion generation. For example, if a video consists of 100 frames played at 25fps, after interpolating by a factor of 4, the output would have 400 frames (predicting the 4 frames that would be placed in between 2 adjacent input frames). If the interpolated video is played at 25fps, a slow-motion output would be obtained. If played at 100fps, the output would have the same duration as the original. Interpolation can be even used to replace defective frames in archive footage where many suffer from random noise and damages to the film. Interpolating such sequences enables a better experience for the viewer and lets us use this type of content in documentaries or other kinds of programmes.

One problem with many algorithms used by some TVs is that a lot of motion blur is produced in the interpolated frames. When these are integrated into the original sequence, user experience suffers as the motion blur detracts from the programme. Traditionally, the interpolated frame is generated by computing the motion in-between frames and using this information to warp the input frames. Whilst this approach has worked very well in the past, handling large motion, changes in brightness and occlusions (where one pixel appears in a frame but not the other) is problematic. Artificial intelligence (AI) interpolation algorithms allow this problem to be mitigated by being able to better generalise. Traditional algorithms are rule-based and thus are invariant to the scene that needs to be interpolated. AI algorithms learn spatial priors which better help with handling these problems more gracefully.

Our approach

Although AI algorithms have led to state-of-the-art performance in the field, there remains an issue regarding computational complexity and memory usage. We propose a simplification of a popular interpolation algorithm by using the power of multiple encoders. Instead of using deep encoders and decoders as normal, we propose using multiple shallow encoders and a single shallow decoder. We call this network Parameter Reduced Network for Video Frame Interpolation (PRNet). The architecture of the network is shown below.

Diagram of the network architecture.

The rationale behind this lies in the fact that many convolutional neural networks (CNNs) are over-parameterised, and can be seen by the multiple works in other fields on pruning, quantisation, and distillation which all aim to reduce model complexity. The proposed network removes the parameter heavy layers of the network. The removal of these features (without replacement) results in the loss of the intricate details deep layers produce and a significant drop in performance. By using multiple encoders, each encoder can focus on different features within the input image and thus mitigate the loss of these features. The image below shows how for, a 4 encoder model, each encoder learns different features from the inputs. Our approach is orthogonal to other parameter reduction techniques which can also be applied in conjunction with this.

Encoder comparisons images.

The figure above shows a visualisation of the occlusion map for different PRNet architectures with the subscript denoting how many encoders are used. PRNet4* is the four encoder PRNet architecture with rotations and the architecture of which is shown in Figure 1. Occlusion map generation varies depending on the architecture used, however, it can be observed that PRNet3, PRNet4 and PRNet4* all detect occlusions in the top left and right corners as well as on the pavement. These were not detected in the AdaCoFNet model (baseline neural network for kernel-based frame interpolation). Better occlusion reasoning can contribute to improved performance.

Parameter-reduced frame interpolation using machine learning

You can see the algorithm in action and how it compares to the original deeper AdaCoFNet model in this video. The sequence is from and is challenging to interpolate due to the presence of brightness changes and jitter. On the left is the original sequence, in the middle the PRNet4* output and to the right the output of the AdaCoFNet model which have more frames generated. As can be seen, the output is visually very similar to the AdaCoFNet model whilst using significantly less parameters. This indicates that challenging sequences can be handled well by our proposed model.

What’s next

We will be investigating further optimisations of the network and the transferability of the techniques we used. As the field is rapidly developing, integrating new developments such as transformers into our work could be an exciting new approach.

Topics