Using Saliency in progressive JPEG XL images

Tuesday, September 7, 2021

At Google, we are working towards improving the web experience for users. Getting images delivered fast is a crucial part of the web experience and progressive images can help getting the salient parts, detected by machine learning, first. When you look at an image, you don’t immediately look at the entire image, but tend to gaze at the most interesting, or “salient”, parts of the image first. When delivering images over the web, it is now possible to organize the data in such a way that the most salient parts arrive first. Ideally you don’t even notice that some less salient parts have not yet arrived, because by the time you look at those parts they have already arrived and rendered.

We will explain how this works with the new open source image format JPEG XL, but we’ll start by taking a step back and describing how images are currently delivered and rendered on the web.

How partial images are displayed on the web

It’s important that web sites including images load quickly, because waiting for images to load causes frustration. Two techniques in particular are used to make images appear fast: One is showing an approximation of the image before all bytes of the image are transmitted, often known as “progressive image loading.” Another is making the byte size of the image smaller by using strong image compression.

What is progressive image loading?

Some image formats are implemented in a way that does not allow any kind of progressive image loading; all the bytes of the image have to be received before rendering can begin. The next, most simple, type of image loading is sometimes called “sequential image loading.” For these images, the data is organized in a way that pixels come in a particular order, typically in rows and from top to bottom.

Formats with this kind of image loading include PNG, webp, and JPEG. The JPEG format allows more sophisticated forms of progressive images. Here, we can organize the data so that it comes in multiple scans, with each scan showing more detail than the previous one.

For example, even if only approximately 15% of the data for an image is loaded, it often already has decent results. See the following images comparing no progression:

100% of bytes loaded, original image
100% of bytes loaded, original image

15% of bytes loaded, no progressive image loading
15% of bytes loaded, no progressive image loading

15% of bytes loaded, sequential image loading
15% of bytes loaded, sequential image loading

100% of bytes loaded, original image
15% of bytes loaded, progressive JPEG

In the first scan, the progressive JPEG only has a small amount of information available for the image, (e.g. only the average color of 8x8 blocks). Known as the DC-only scan, because the average color of each 8x8 block is called DC-component in the discrete cosine transform, it is the basis of JPEG image compression. Check out this computerphile video on JPEG DCT for a basic introduction. Instead of displaying an image that consists of 8x8 blocks, JPEG rendering in Chrome and Firefox choose to render the preview with some smoothing, to provide a less distracting experience.

Progressive JPEG XLs

While the quality (and therefore byte-sizes) of the individual scans in a progressive JPEG image can be controlled, the order within a scan is still top to bottom, like in a sequential JPEG. JPEG XL goes beyond that by making it possible to send the data necessary to display all details of the most salient parts first, followed by the less salient parts. For example, in a portrait, we can decide to first send the bytes for the face, and then, for the out-of-focus background.

In general, progressive JPEG XL works in the following way:
  • There is always an 8x8 downsampled image available (similar to a DC-only scan in a progressive JPEG). The decoder can display that with a nice upsampling, which gives the impression of a smoothed version of the image.
  • The image is divided into square groups (typically of size 256 x 256) and it is possible to provide an order of these groups during encoding. In particular, we can order the groups by saliency and choose an order that anticipates where the viewer might look first, while not being disturbing.
While the format allows for a very flexible order of the groups, our current encoder chooses a starting group and then grows concentric squares around that group. This is because we expect that this will be less distracting to the user. To make successive updates even less noticeable, we smooth the boundary between groups for which all the data has arrived and those that still contain an incomplete approximation. One requirement of this technique is a good way of identifying where the salient parts of an image are, which is needed when encoding an image. This information is typically represented by a saliency map which can be visualized as a heatmap image, where the more salient parts are redder.

Original image next to saliency map image
Original image.                                                                                                             Saliency map.

Smooth DC-image next to image with group border
Smooth DC-image.                                                                                                  Image with group order.

Putting it all together, this is how the loading of the progressive JPEG XL will look: 

(a) JPEG XL image only

(b) JPEG XL image compared to sequential jpeg

(c) JPEG XL image compared with progressive jpeg (with 3 scans)

(d) JPEG XL image compared to no progression (grey image until the end)

How to find good saliency maps for images

Saliency prediction models (overview) aim at predicting which regions in an image will attract human attention. To predict saliency effectively, our model leverages the power of deep neural nets to consider both high level semantic signals like face, objects, shapes etc., as well as low or medium level signals like color, intensity, texture, and so on. The model is trained on a large scale public gaze/saliency data set, to make sure the predicted saliency best mimics human gaze/fixation behaviour on each image. The model takes an image as the input and output a saliency map, which can serve as a visual importance map, and hence help determine the decoding order for each region in the image. Example images and their predicted saliency are as follows:

Example images and their predicted saliency

At the time of writing (July 2021), Chrome and Firefox did not yet support decoding JPEG XL image progressively in the way we describe, but the spec does allow encoding arbitrary group orders.

Different users have different experiences when it comes to looking at images loading on the web.We hope that this way of progressively delivering images will improve user experience especially on lower-bandwidth connections.

By Moritz Firsching and Junfeng He – Google Research