As you can clearly see, the image on the left is missing some objects, and the texture for the terrain is a lower resolution. A human could look at this and notice the difference, but I needed a numerical value to measure how different the two images are.
This is a problem that has been well-studied in the vision and graphics research communities. The most common algorithm is the Structural Similarity Image Metric (SSIM), outlined in the 2004 paper Image Quality Assessment: From Error Visibility to Structural Similarity by Wang, et al. The most common previous methods were PNSR and RMSE, but those metrics didn't take into account the perceptual difference between the two images. Another metric that takes into account perception is the 2001 paper Spatiotemporal Sensitivity and Visual Attention for Efﬁcient Rendering of Dynamic Environments by Yee, et al.
I wanted to compare a few of these metrics to try and see how they differ. Zhou Wang from the 2004 SSIM paper maintains a nice site with some information about SSIM and comparisons to other metrics. I decided to start by taking the images of Einstein he created, labeled as Meanshift, Contrast, Impulse, Blur, and JPG:
The comparison programs I chose were pyssim for SSIM, perceptualdiff for the 2001 paper, and the compare command from ImageMagick for RMSE and PNSR. The results I came up with for those five images of Einstein:
As you can see, PNSR and RMSE are pretty much useless at comparing these images. It treats them all equally. Note that the SSIM values are actually inverted so that lower values mean a lower error rate for both the SSIM and perceptualdiff graphs. The SSIM and perceptualdiff metrics seem to agree on Meanshift and JPG, but have a reverse ordering of the error rates for Contrast, Impulse, and Blur.
To get a better idea of how these algorithms would work for my dataset, I took an example run of my scene rendering. This is a series of screenshots over time, with both the ground truth, correct image, and an image I want to compare to it. I plotted the error values for all of the four metrics over time:
Here we see a different story. RMSE and SSIM seem to be almost identical (with respect to their scale) and PNSR (inverted) and perceptualdiff also show similar shapes. All of the metrics seem to compare about equally in this case. You can find the code for generating these two graphs in my repository image-comparison-comparison on GitHub.
I think the moral of the story is make sure you explore multiple options for your dataset. Your domain might be different from other areas where the relevant algorithms were applied. I actually ended up using perceptualdiff because it gives a nice, intuitive output value: the number of pixels perceptually different in the two images, and because it supports MPI, so it runs really fast on machines with 16 cores.