Explainable Photorealistic Image Synthesis Using Diffusion Model With Multi-Scale Feature Fusion
Keywords:
Generative Models, Diffusion Models, Photorealistic Image Generation, Grad-CAM, U-Net ArchitectureAbstract
Generative models, that generate images from scratch, have lately drawn a lot of attention. The diffusion models are especially popular for their training process and their excellent noise modelling capabilities. However, there is still challenging for conditional text image synthesis. In the proposed system the basic level architecture of U-Net is redesigned with attention and residual blocks for capturing complex features of images. Along with this, use of multi-scale feature fusion technique helps to handle images of different resolutions. Diffusion Transformer and cross-modal attention mechanisms enhance realism and coherence in image quality. The model generates the photorealistic image from the latent representation with VAE decoder. Furthermore, employment of Grad-CAM ensures the model is clearer and gives insights on what portions of the image the model highlighting to during its generation process. In comparison with both the basic diffusion model and GANs, the developed diffusion model improved significantly in photorealism, detail sharpness, and numbers like FID and PSNR to be much greater than those of both.




