Βι¶ΉΤΌΕΔ

Video compression: Chroma intra-prediction using machine learning

Video compression is an essential part of high quality video streaming. We're exploring how to apply machine learning to the task.

Published: 9 September 2020
  • Marc GΓ³rriz Blanch

    Marc GΓ³rriz Blanch

    Graduate R&D Engineer
  • Saverio Blasi

    Saverio Blasi

    Lead Research Engineer
  • Marta Mrak

    Marta Mrak

    Lead R&D Engineer

Video compression has become an essential component of multimedia streaming. The convergence of digital entertainment prompted the development of advanced video coding technologies capable of tackling the increasing demand for higher quality video content.

At Βι¶ΉΤΌΕΔ Research & Development, we are exploring how to apply machine learning (ML) to improve video compression techniques and, how to interpret Convolutional Neural Networks (CNNs) to derive simplified and efficient implementations.

The perception of colour is important in many different circumstances. For example, our recent work on automatic colourisation using artificial intelligence (AI), has direct applications for the restoration of archived content. Video coding can also benefit from colour prediction (estimation) through better compression rates by exploiting the correlations between the luma (brightness of the light) and chroma (colour information) components of video frames.

Interactive Presentation - Βι¶ΉΤΌΕΔ R&D Showcase 2020

Choose what you would like our Visual Data Analytics team to explain in an interactive experience on trained neural networks and interpretable AI.

What we are doing

Our objective is to improve chroma intra-prediction using ML. Intra-prediction exploits redundancies within a video frame by predicting the content of specific areas using the neighbouring pixels. The size of the bitstream (steam of data) can be reduced, and we can achieve better compression rates by focusing on transmitting the differences between the prediction and the original frame rather than all the original frames. A colour frame is usually represented by three components: the luma and two chroma channels. Typically, video coding schemes first process the luma component and then use the compressed information plus the neighbouring chroma pixels, to compress the chroma components of the desired area.

Recently, researchers introduced the Cross-Component Linear Model (CCLM), which applies linear regression to predict the chroma from the luma. However, better predictions can be obtained by using more sophisticated ML techniques. Existing models based on CNNs provided significant improvements but with two main drawbacks: the increase of system complexity and the lack of control on which neighbouring pixels are needed to predict a single chroma sample.

We improved the existing approaches by introducing a novel methodology based on attention models, and by simplifying the most complex parts of the prediction network. These mechanisms are trained to decide which neighbouring samples are the best to contribute to the prediction of each chroma pixel. So for each prediction position, our model learns (from all the possible surrounding pixels) to attend or focus on the more informative ones. For example, as shown in the video below, the grey/blue samples in the boundary have more weight on the bottom left area whilst the brown samples contribute more to the top right area.

Although attention models like these have helped us to gain control over the reference samples and better understand the prediction process, the underlying neural networks are still very complex. So our work also focussed on the simplification of the network architecture to obtain a more compact and explainable model which requires less computational resources (fewer operations) to arrive at the predictions.

This open-source software is now available via the . You can read more about this work in our conference paper (M. G. Blanch, S. Blasi, A. Smeaton, N. E. O’Connor and M. Mrak, "", in Proc. of ICIP 2020) and journal paper (M. G. Blanch, S. Blasi, A. F. Smeaton, N. E. O’Connor and M. Mrak, "", in IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 2, pp. 366-377, Feb. 2021).

Our approach

Going deeper, our attention model is integrated into a hybrid neural network with three processing branches, as shown in the video below. The first branch (cross-component boundary branch) is a fully-connected network (FCN) that processes and encodes the colours on the boundary. We reduce the size of the FCN by using an autoencoder, a well-known deep learning technique that allows efficient data coding and compacts the input information.

In parallel, the second branch (luma convolutional branch) analyses the spatial patterns on the luma component, aiming to recognise portions of objects contained within the area we wish to compress. Then, the attention model fuses the information from both branches: transferring the encoded boundary colours processed with the first branch to the luma patterns extracted with the second. Finally, the combined features are transformed to actual colours using a third convolutional branch (the prediction head). Similar to our approach to interpreting CNNs for video coding, the CNNs in the second and third branch can be simplified by removing the non-linear elements (which transforms the outputs of each network’s layer). This technique allows us to devise how to compute the output of both branches without performing the numerous convolutions defined by the CNN layers. This process significantly reduces the number of parameters of the original network and accelerates the prediction process by reducing the number of operations.

Update, October 2021:

In addition, we also propose two schemes for spatial information refinement to improve the quality of the predictions: adding a down-sampling branch and adding location maps. A down-sampling filter is learnt, in order to select the most suitable down-sampling luma features for chroma prediction. Moreover, in order to allow the network to predict pixels with different importance levels, we use the position information of the current block and the boundary information to construct a feature map, called location map, which further guides the prediction process.

You can read more about the refinement methodology in our conference paper with our partners from the and : Z. Chengyi, S. Wan, T. Ji, M. Mrak, M. G. Blanch, and L. Herranz. "Spatial Information Refinement for Chroma Intra Prediction in Video Coding", in Proc. of APSIPA 2021) []. Moreover, an individual .

Our research aims to explain what ML is doing so that we can deploy it more reliably and reduce complexity. However, we also need to ensure that the simplifications do not impact the efficiency of the compression. So we have evaluated this along with the encoding and decoding time of both the original hybrid CNN and our simplified approach.

Our tests reveal that our attention model improves compression performance, while the simplification can significantly reduce the processing time while retaining the coding benefits of the original attention mechanism. While our work addressed complexity reduction modifying the network architecture, further simplifications can be obtained during the deployment process. We are aiming to look at hardware-aware implementations to integrate our system into future video codec solutions.

This work was carried out within the in collaboration with the and the .

Rebuild Page

The page will automatically reload. You may need to reload again if the build takes longer than expected.

Useful links

Theme toggler

Select a theme and theme mode and click "Load theme" to load in your theme combination.

Theme:
Theme Mode: