Predicting visible ﬂicker in temporally changing images

Novel display algorithms such as low-persistence displays, black frame insertion, and temporal resolution multiplexing introduce temporal change into images at 40-180Hz, on the boundary of the temporal integration of the visual system. This can lead to ﬂicker, a highly-objectionable artifact known to induce viewer discomfort. The critical ﬂicker frequency (CFF) alone does not model this phenomenon well, as ﬂicker sensitivity varies with contrast, and spatial frequency; a content-aware model is required. In this paper, we introduce a visual model for predicting ﬂicker visibility in temporally changing images. The model performs a multi-scale analysis on the difference between consecutive frames, normalizing values with the spatio-temporal contrast sensitivity function as approximated by the pyramid of visibility. The output of the model is a 2D detection probability map. We ran a subjective ﬂicker marking experiment to ﬁt the model parameters, then analyze the difference between two display algorithms, black frame insertion and temporal resolution multiplexing, to demonstrate the application of our model.


Introduction
Displays are known to produce temporally changing signals, relying on the limited integration time of human visual system which then fuses this into a steady image.Well-established technologies, such as digital micromirror devices and projector colorwheels operate at such high frequencies that the change is indeed unperceivable.However, novel algorithms introduce temporal changes at 40-180 Hz, on the boundary of temporal integration of the visual system.This can lead to highly-objectionable flicker.
The visibility of flicker is commonly associated with the critical flicker frequency (CFF).However, many other factors such as the adaptation luminance and the spatial frequency of the stimulus have a significant impact.Therefore, a content-dependent model is needed to produce accurate predictions.Conventional video quality metrics are designed for lower temporal frequencies, and hence do not model high-frequency flicker detection accurately.
In this paper, we introduce a visual model for predicting flicker visibility in temporally changing images.The model performs a multi-scale analysis of the difference image between consecutive frames of the animation, normalizing contrast values with the spatio-temporal contrast sensitivity function from the pyramid of visibility [1].To calibrate the model parameters, we conduct a psychophysical experiment, collecting user markings for a wide range of content across different refresh rates.
The rest of the paper is structured as follows: first, we review the relevant publications on flicker sensitivity, temporal display algorithms and quality metrics.Then, we introduce our flicker model, describe the psychophysical experiment and the parameter fitting process.Finally, we demonstrate the application of our model by comparing flicker in two display algorithms.

Flicker sensitivity
Most artificial light sources do not produce a temporallystable amount of light; they are known to vary with time [2].The minimum frequency at which a light is perfectly fused and is perceived as steady is the flicker fusion threshold, or as also known: the critical flicker frequency (CFF).CFF depends on multiple factors: it is known to increase with log-luminance (Ferry-Porter Law) and angular size; to vary with spatial frequency and eccentricity -the most sensitive being the para-fovea [3,2].
CFF is typically measured for stimuli with full-on, full-off cycles; however, the amplitude of luminance change in display algorithms is often small.Hence, flicker visibility is better captured by the spatio-temporal contrast sensitivity function (CSF).CSF is usually characterized as a function of spatial frequency, temporal frequency, and background luminance.Studies suggest that contrast sensitivity across the retina is homogeneous when the stimulus size increases with the cortical magnification factor [2,4].On the other hand, Peli et al. argues that the contrast threshold increases exponentially with eccentricity, and the CSF should be scaled accordingly.In this paper, we assume a worstcase scenario, a free-viewing set-up with flicker attended by the most sensitive part of the retina.Sensitivity values can be interpreted as the inverse of the contrast thresholds at which flicker is observed by 50% probability.We use a slightly different definition of sensitivity, as the inverse of the contrast threshold at which flicker is observed by 50% of the population.Contrast in this context is commonly defined as Michelson contrast: where L min and L max are the minimum and maximum luminance values of a periodically changing signal.The pyramid of visibility [1] offers an approximate model of the CSF, matching previous foveal measurements for mediumto-high spatio-temporal frequencies.However, model parameters were not fully consistent when fitted to different datasets.Furthermore, the pyramid of visibility can explain flicker visibility for single spatial frequencies, but not necessarily for complex images.We apply this model in a novel way to multiple spatial frequency bands, refitting the parameters to our proposed application.
Flicker artifacts discussed in this paper are around the detection threshold, and as such, we focus on their detectability.We have not conducted extensive studies on long-term fatigue, and do not consider photosensitive epilepsy [5].

Display algorithms
There are a number of computer graphics and display algorithms that introduce temporal change into images [6,7,8,9,10], also referred to as temporal multiplexing algorithms.Commonly these modify a few consecutive frames in order to perceived resolution, reduce the computational or transmission cost.Chen et al. [6] adjusts pairs of frames to reduce motion blur on a 120 Hz display.Didyk et al. [8] modifies up to three frames to boost the apparent resolution of moving images.The latter introduces low-contrast flicker at 40 Hz -well below the CFF; authors mitigate this with a multi-scale CFF predictor.While our model is conceptually similar, we rely on a multi-scale CSF model, with psychophysical calibration for robustness.
Other algorithms introduce temporal change to reduce the computational cost of image generation.In virtual reality, Asynchronous SpaceWarp (ASW) re-projects every other frame based on image-space motion information [9]; black frame insertion (BFI) assumes that every other frame is black, boosting display luminance to match the target content.More recently, temporal resolution multiplexing (TRM) produces pairs of blurred and sharpened frames that are fused on the retina [10].In the paper we argued that TRM introduces less flicker than BFI.We now demonstrate how our flicker model can be utilized for validation.

Image and video metrics
Image and video metrics can be broadly categorized as quality metrics and visibility metrics.Full-reference quality metrics, such as PSNR, output a single quality value for the entire image, whereas visibility metrics, such as VDP, HDR-VDP [11,12] produce a distortion map, providing spatial information on the probability of detecting artifacts.Some metrics, such as SSIM produce a distortion map, but it is only the mean value over the image that is shown to correlate with subjective scores [13,14], SSIM distortion maps are not calibrated to psychophysical data.Our flicker prediction model is inspired by the visual difference predictors in the sense that it also outputs psychophysically calibrated detection probability maps rather than a single quality value.
Video quality is often evaluated by executing an image quality metric frame-by-frame and pooling quality values over time.Some video quality metrics, such as VQM and MOVIE [15], analyze temporal changes in videos.However, these metrics are designed for film content with frame rates typically below 50 Hz, and lack a robust flicker model for >50 Hz content.Furthermore, many novel display algorithms that we target are designed specifically for real-time computer generated (CG) content.Video quality metrics are not immediately applicable to CG, as CG content has a subjective perception unaffected by the well-established rules of the film industry such as the soap opera effect [16].

Flicker model
Temporal multiplexing algorithms often manipulate pairs of frames.Let us denote two consecutive frames as F i and F i+1 that could be, for instance, the reduced-resolution and sharpened frames of TRM; or a black frame and a luminance-boosted frame of BFI.The proposed visual model predicts whether displaying F i and F i+1 alternately at refresh rate R would result in perceivable flicker.The output is a P det (x, y) map corresponding to the percentage of the population detecting flicker at a pixel (x, y).
Our flicker predictor utilizes a spatio-temporal CSF in a multi-scale model with probability summation along the spatial frequency bands.For an overview of the pipeline, please see Figure 1.The model does not distinguish orientation-sensitive bands, often found in masking models [17].The model also assumes that eye movements have been already accounted for in image-space; i.e., the same pixels on F i and F i+1 will correspond to roughly the same photoreceptors on the retina.
As the source of most flicker is the change in luminance, we do not consider chromaticity here.First, we compute respective luminance values (Y i and Y i+1 ) of consecutive frames (F i and F i+1 ) based on a calibrated display model.Then, to find contrast, we compute the difference image: where x and y describe pixel location, and Y i (x, y) is the luminance of pixel (x, y) of the frame F i .The summed luminance of the consecutive frames can be similarly defined as: As flicker sensitivity varies with spatial frequency, the difference image (∆(x, y)) is decomposed into a Laplacian pyramid.Each layer of the pyramid is half the spatial resolution of the one above; the bottom layer capturing 2 cycles per visual degree (cpd) resolution or just below -e.g. for a 52 pixel-perdegree image the mid points of the spatial frequency bands are S i = {26, 13, 6.5, 3.25, 1.625} cpd.We use an undecimated pyramid, in which each band has the same resolution.
In each layer, Michelson contrast can be then computed as: where l is the Laplacian pyramid layer.To account for contrast sensitivity, we normalize contrast at each layer by a the spatio-temporal CSF (ρ).We use the pyramid of visibility, as it is parametric, and has been shown to provide a good fit to previous CSF measurements.
where W and F are the spatial and temporal frequencies as in the original paper, Y is the adapting luminance, and (c 0 , c W , c F , c L ) are parameters that we keep as free variables in our model.We assume that the mean fused image 0.5Y (x, y) provides a good estimate of the local adapting luminance.The normalized contrast is then: where S l is the spatial frequency of the layer, and R is the display refresh rate.We use R/2 to sample the temporal dimension of the CSF, as when modifying pairs of frames, this is the highest temporal frequency according to the Nyquist limit.
Next, to transform the normalized contrast into probabilities of detection, we use a Weibull psychometric function: where β controls the slope of the psychometric function, a free parameter in our model.In order to pool the probabilities across all layers, we use probability summation: Finally, to account for spatial pooling, we further convolve the probability map with a small Gaussian filter: where * denotes convolution, and G σ sp is a Gaussian kernel with σ sp being a free parameter in our model.

Flicker marking experiment
To tune the model parameters, ground-truth data is required on flicker perception in complex images.As flicker is often perceived in multiple parts of the image, and location information is crucial for P det (x, y), we designed a marking experiment.Such experiments have been utilized to calibrate similar metrics for image difference predictors [18,19].

Participants
Nineteen participants aged 18-40 with normal or correctedto-normal vision took part in the experiment.As the frequency of flicker was above 3 Hz, we ensured that no participant reported a history of photosensitive epilepsy.Informed consent was acquired before the beginning of the experiment, which involved briefing participants on the aim, the procedure, and potential risks of the experiment both verbally and in writing.Participants were offered a small financial compensation in the form of gift cards.

Setup
Participants were shown a 512×512 pixel flickering photograph in the center of a G-Sync capable ASUS ROG Swift PG279Q 28" monitor.The viewing distance was fixed at 65 cm, yielding an angular resolution of 52 pixels per degree (ppd).Images hence had a field of view of 9.85 • ; the rest of the monitor was filled with a gray background of 36 cd/m 2 .Accurate refresh rates were achieved with custom C++/OpenGL software and G-Sync.

Stimuli
Eighteen stimuli were created by flickering twelve color photographs (see Figure 2).The photographs provided a range of content from primitive zebra stripes, photographs of birds, buildings and people.For each trial, a spatially band-limited flicker was introduced at temporal frequency R.This was achieved by displaying a pair of images (F i and F i+1 ) alternating at R Hz such that, where F ref denotes the original reference image in gammacompressed rgb color space (rec.709primaries), * stands for 2D convolution, and g σ (x, y) is an isotropic Gaussian blur kernel with a standard deviation of σ .Inspired by the temporal resolution multiplexing algorithm [10], F i+1 was computed as  where ξ () is a gain-offset-gamma display model [20] transforming gamma-compressed rgb to linear rgb values.For stimulus generation we used a crude approximation of the display model of the ASUS monitor with r (x, y) = 0.99917r(x, y) 2.15 + 0.000825, (12) where r is the red channel.The same formula was applied to all color channels.For an example image pair, see Figure 3.Note that we do not use an accurate display model here, neither do we account for the limited dynamic range and color gamut of the display.This might introduce some spatial artifacts to the viewer, but this experiment did not attempt to establish overall visual quality, and such artifacts did not impede flicker detection.For a summary of the stimuli images, σ and R values, refer to Figure 4.For model calibration, specifically for the first step of conversion to luminance, we measured a more accurate display model with a Specbos 1211 spectroradiometer.

Task
Participants were asked to "mark (or paint) any part of the image where flicker is visible" -quoted from the briefing form.Flickering areas could be marked by holding down the left mouse button and moving the pointer around.Previous markings could be deleted with the right mouse button in a similar fashion.A circular mouse pointer was used with the diameter adjustable from 0.15 • (8 pixels) to 2 • (104 pixels) using the mouse wheel.Any marked area was highlighted and immediately stopped flickering.At the beginning of each trial, the mask was cleared.Participants were specifically asked to first mark (and hence remove) the strongest flicker first to minimize the effect of masking.During briefing and training it was highlighted that flicker might be more visible in the parafoveal area of vision, and hence looking at objects slightly off-center might reveal more flicker.This was to ensure that all participants utilize the free-viewing setup equally.
Each participant created a marking map for each stimulus three times, yielding 54 trials in total.The order of the trials was randomized.

Results
Figure 4 shows the flicker marking maps averaged over nineteen observers and three repetitions.As expected, flicker perception degrades with increasing refresh rates and increases with the blur σ .The markings, however, cannot be considered ground truth data for two reasons: (1) participants might make mistakes producing mis-markings, and (2) the finite size of the brush allows for limited precision.
Following the analysis in [18], markings can be considered the output of a stochastic process, where observers attend to a distortion with P att , and mis-mark a pixel with probability P mis = 0.01.Due to the small image size (9.85 • ×9.85 • ), and the characteristics of temporal sensitivity, we assumed P att = 1 for this experiment.
For each image in each of the 57 marking maps each (x, y) pixel takes a binary {0, 1} value depending on whether the participant marked it with the mouse.Assuming a detection probability P det (x, y), the data can be modeled as a binomial distribution.Accounting for the mis-markings, the likelihood of observing the collected data given a model is: where n = 57 is the number of all collected markings for an image, k is the number of trials where the pixel is marked to flicker, and P det (x, y) is the predicted detection probability of the model.

Parameter fitting
We posed the task of finding the best model parameters as a non-linear optimization problem, maximizing the average loglikelihood over the images.However, we observed that the effects of spatial pooling were masked by the finite paint brush size; therefore we decided to fix this parameter to a value comparable to the brush sizes (σ = 0.36 • ).Variable slope values in the psychometric function were also expected to create a range of local minima, hence we selected a single likely candidate β = 2.The remaining parameters are parameters from the pyramid of visibility which we restricted to physically sensible ranges (c W < 0 for decreasing sensitivity with temporal frequency; c F < 0 for decreasing sensitivity with spatial frequency; c L > 0 for increasing sensitivity with background luminance).
Our results are summarized in Table 1.When refitting the parameters from the original pyramid of visibility, the mean loglikelihood increases, as expected.Parameters show some deviation from the values fitted to the Robson measurements.While c 0 is comparable, increasing temporal frequencies attenuate sensitivity faster (lower c W values), increasing spatial frequencies attenuate sensitivity slower, and luminance amplifies sensitivity faster.Such deviations are to be expected due to the significantly more complex nature of the task presented in the marking experiment.
To analyze the possibility of over-fitting the model parameters to our dataset, we also executed a 3-fold cross-validation.For this, the dataset was randomly split into three 6-element groups.The models was fit to each two groups (training), then performance was evaluated on the third groups (test).Results in Table 1 indicate that the training and test likelihoods were comparable, and the optimum parameters did not differ significantly from the scenario when all 18 images were included in the training dataset.
Qualitatively we observed that model predictions for the experiment stimuli capture the flickering details well.The userproduced and the predicted markings are as shown in Figure 4.

Application
To demonstrate the utility of our model, we analyze the amount of flicker introduced by two state-of-the-art display algorithms: temporal resolution multiplexing (TRM) and black frame insertion (BFI).For TRM we assume the worst-case scenario and ignore the residual buffer (for full details, see [10]  frame insertion we assume that F i is completely black, while F i+1 is boosted to double the luminance.For representative content we selected three computer-generated images.
Our model can be used to establish the minimum refresh rate at which flicker is no longer perceivable when using either temporal techniques.As shown in Table 2, TRM generally requires lower refresh rates; it is unlikely to be perceived as flickery on 90 Hz, while BFI causes minor distortions even on 120 Hz.This is consistent with previous observations [10].

Conclusion
We presented a multi-scale visual model for predicting flicker, one of the most objectionable artifacts in display algorithms that introduce temporal change.Specifically, we assumed that such algorithms produce consecutive frames that are fused on the retina.Our model takes spatio-temporal sensitivity into account to predict if perfect fusion is be possible, and outputs a detection probability map showing a 2D image of flicker visibility.Model parameters were fitted to the results of a subjective marking experiment.Our model provided a good fit to the observed data.We demonstrate how the model can be utilized to analyze existing multiplexing algorithms: temporal resolution multiplexing, and black frame insertion.The main limitation of the current model is the lack of masking effects, including cross-channel spatial sensitivity masking and motion masking.In future work we wish to collect further data to address these issues.

Figure 1 :
Figure 1: Overview of the flicker predictor model.The input is a pair of color frames; the result of the model is a 2D probability of detection (P det ) map.

Figure 2 :
Figure 2: Pool of reference images used for the flicker marking experiment with a range of content.

Figure 3 :
Figure 3: Example stimulus pair.Reference frame is low-pass filtered (F i ), and sharpened (F i+1 ) to produce band-limited flicker.

Figure 4 : 2 :
Figure 4: Flicker markings (Data) and model predictions (P det ) overlaid on the 18 reference images.Blue indicates no flicker (P det ≈ 0), green to orange indicates strong perceivable flicker (P det → 1).Sub-captions state the standard deviation of the Gaussian kernel (in visual degrees), and the refresh rate at which F i and F i+1 were alternated (in Hz).

Table 1 :
[1]ameters, training and test fitness (measured as mean log-likelihood).Pyramid of visibility (first row) uses the parameters from the Robson fit from[1].When fitting to our dataset, the log likelihood increases (Fit to all).Cross # are indicate results of the 3-fold cross-validation.