As part of my research into optimizing 3D content delivery for dynamic virtual worlds, I needed to compare two screenshots of a rendering of a scene and come up with an
objective measure for how different the images are. For example, here are two images that I want to compare:
As you can clearly see, the image on the left is missing some objects, and the texture for the terrain is a lower resolution. A human could look at this and notice the difference, but I needed a numerical value to measure
how different the two images are.
This is a problem that has been well-studied in the vision and graphics research communities. The most common algorithm is the
Structural Similarity Image Metric (SSIM), outlined in the 2004 paper
Image Quality Assessment: From Error Visibility to Structural Similarity by Wang, et al. The most common previous methods were
PNSR and
RMSE, but those metrics didn't take into account the
perceptual difference between the two images. Another metric that takes into account perception is the 2001 paper
Spatiotemporal Sensitivity and Visual Attention for Efficient Rendering of Dynamic Environments by Yee, et al.
I wanted to compare a few of these metrics to try and see how they differ. Zhou Wang from the 2004 SSIM paper maintains a
nice site with some information about SSIM and comparisons to other metrics. I decided to start by taking the images of Einstein he created, labeled as Meanshift, Contrast, Impulse, Blur, and JPG:
The comparison programs I chose were
pyssim for SSIM,
perceptualdiff for the 2001 paper, and the
compare command from ImageMagick for RMSE and PNSR. The results I came up with for those five images of Einstein:
As you can see, PNSR and RMSE are pretty much useless at comparing these images. It treats them all equally. Note that the SSIM values are actually inverted so that lower values mean a lower error rate for both the SSIM and perceptualdiff graphs. The SSIM and perceptualdiff metrics seem to agree on Meanshift and JPG, but have a reverse ordering of the error rates for Contrast, Impulse, and Blur.
To get a better idea of how these algorithms would work for
my dataset, I took an example run of my scene rendering. This is a series of screenshots over time, with both the ground truth, correct image, and an image I want to compare to it. I plotted the error values for all of the four metrics over time:
Here we see a different story. RMSE and SSIM seem to be almost identical (with respect to their scale) and PNSR (inverted) and perceptualdiff also show similar shapes. All of the metrics seem to compare about equally in this case. You can find the code for generating these two graphs in my repository
image-comparison-comparison on GitHub.
I think the moral of the story is make sure you explore multiple options for your dataset. Your domain might be different from other areas where the relevant algorithms were applied. I actually ended up using perceptualdiff because it gives a nice, intuitive output value: the number of pixels perceptually different in the two images, and because it supports MPI, so it runs
really fast on machines with 16 cores.