Inception score(IS) and Fréchet inception distance(FID) explained

The images here are produced by a generative model i.e. denoising diffusion probabilistic models.
How do we measure the quality of images generated by it?
Both Inception score(IS) and Fréchet inception distance (FID) are metrics used to assess the quality of images generated by generative models like GANs, diffusion models etc.
Inception Score
A pretrained image classification model — inceptionv3 is used to calculate this score. The model inceptionv3 is pretrained on the ImageNet dataset and has 1000 classes/labels.
The images produced by the generative model are passed through inceptionv3 network and the probability of the image over each class/label is calculated.
What is this metric trying to capture?
- Images generated by our generative model should contain clear, sharp and distinct objects.
- The generative model should output diverse set of images from all different classes in ImageNet dataset.
y is the label from the image classification model and x is the image generated by the generative model.
p(y|x) is the conditional distribution and p^(y) is the marginal distribution computed using all generated images and thier corresponding conditional distributions.
When is the inception score maximized?
- Low entropy of the conditional distribution i.e. distribution of labels given the generated image. We want it to be close to a one hot vector. If the image classification model is confidently predicting a single label for each image, it means that the image is sharp and distinct.
- High entropy for the marginal distribution as the predictions of the classification model should be evenly distributed across all labels. We want the marginal distribution to be uniform. This would imply that our generative model is producing a diverse set of images and not a similar bunch.
Higher value of IS indicates better quality of images.
Fréchet inception distance (FID)
Similar to IS, FID also uses pre-trained inceptionv3 network.
FID not just passes the generated images, but also the real images i.e. ground truth through the inceptionv3 network.
FID fits a gaussian distribution to the generated as well real images. It then uses the Fréchet distance between two multivariate Gaussians.
FID is basically computing the distance between the two distributions by comparing the mean and standard deviation of the deepest layer in the inceptionv3.
References:-