The original goal with Overmix was to reduce compression noise by averaging frames and this works quite well. However video compression is quite complex and simply averaging pixel values fails to take that into consideration. But since it really is quite complex I will as a start consider the case were each frame is simply compressed using JPEG. (There actually happens to be a video format called Motion JPEG which does just that.) Video is usually encoded using motion compensation techniques to reduce redundancy from other frames, so this is a major simplification. I am also using the same synthetic data from the de-mosaic post (which does not contain sub-pixel movement) by compressing each frame to JPEG at quality 25.
Before we begin we need to understand how the images were compressed. There are actually many good articles about JPEG compression written in an easy to understand manner, see for example Wikipedia’s example or the article on WhyDoMath.
As a quick overview: JPEG segments the image into 8×8 blocks which are then processed individually. This is done by transforming the block into the frequency domain using the Discrete cosine transform (DCT). Informally, this separates details (high frequencies) from the basic structure (low frequencies). The quality is then reduced using a Quantisation table which determines how precise each coefficient (think of it as a pixel) in the DCT is preserved. For lower quality settings, the quantisation table lowers the quality of the high frequencies more. The same table is used for all 8×8 blocks. After the quantisation step, each value is rearranged into a specific pattern and losslessy compressed.
Line-art tends to suffer of this compression since the sharp edge of a line requires precise high-frequency information to reproduce.
The idea to minimize the the JPEG compression is to estimate an image, which best explains our observed compressed JPEG images. Each image is offsetted to each other, so the 8×8 blocks will cover different areas in each image most of the time.
To make the estimate we use use the average of each pixel value like the original ignorant method. Then we will try to improve our estimate in an iterative manner. This is done by going trough all our JPEG compressed frames and try to recover the details lost after quantisation. For each 8×8 block in the compressed frame, we apply the DCT on the equivalent 8×8 area in the estimated image. Then for each coefficient we check if it is the same value as our compressed frame after the quantisation. If so, we will use the non-quantified coefficient to replace the one the the compressed frame. The idea is that we will slowly try to rebuild the compressed frame to its former glory, by using the accumulated information from the other frames to estimate a more precise coefficients that when compressed with the same settings would still produce our original compressed frame.
The figure shows a comparison at 2x magnification. From left to right: A single compressed frame, averaging all the frames pixel by pixel, the JPEG estimation method just explained after 300 iterations, and the original image the compressed frames was generated from.
The most significant improvement comes from the averaging, most of the obnoxious artifacts are gone and the image appears quite smooth. However it is quite blurry compared to the original and have a lot of ringing which is especially noticeable with the hand (second row).
The estimation improves the result a little bit, the edges are slightly sharper and it brings out some of the details, for example the button in the first row and the black stuff in the third row. However it still have ringing and even though it is reduced, it is a bit more noisy making it less pleasing visually.
The problem here is that the quantisation is quite significant and there are many possible images which could explain our compressed frames. Our frames simply does not contain enough redundancy to narrow it in. The only way to improve the estimation is to have prior knowledge about our image.
My quick and dirty approach was to use the noise reduction feature of waifu2x. It is based on a learning based approach and waifu2x have been specially trained on anime/line-art images. Each compressed frame was put through waifu2x on High noise reduction, and this was used as an initial estimate for each image and the coefficients were updated like usual. Afterwards the algorithm was run like before, and the result can be seen below:
From left to right: Single frame after being processed by waifu2x, average of the frames processed by waifu2x, the estimation without using waifu2x (unchanged from the previous comparison), and finally the estimation when using waifu2x.
The sharpness in the new estimate is about the same, but both ringing and especially noise have been reduced. It have a bit more noise than the averaged image, but notice that the details in the black stuff (third row) for the averaged image is even worse than the old average (see first comparison). This is likely due to it not being line-art and waifu2x therefore not giving great results. Nevertheless, the estimation still managed to bring it back as no matter how bad our estimate is, as it still enforces the resulting image to conform to the input images.
But two aspects makes this a “quick and dirty” addition. The prior knowledge is only applied to the initial estimate and ignored for the following iterations, which might be the reason it is slightly more noisy. It still worked here, but there is no guaranty. Secondly waifu2x is trained using presets and ignores how the image was actually compressed.