Explainable Photorealistic Image Synthesis Using Diffusion Model With Multi-Scale Feature Fusion

Authors

  • Sonal Fatangare Department of Computer Engineering, Vishwakarma Institute of Technology, (Savitribai Phule Pune University), Pune, Maharashtra, India.
  • Premanand Ghadekar Department of CSE-AIML, Vishwakarma Institute of Technology, (Savitribai Phule Pune University), Pune, Maharashtra, India.

Keywords:

Generative Models, Diffusion Models, Photorealistic Image Generation, Grad-CAM, U-Net Architecture

Abstract

Generative models, that generate images from scratch, have lately drawn a lot of attention. The diffusion models are especially popular for their training process and their excellent noise modelling capabilities. However, there is still challenging for conditional text image synthesis. In the proposed system the basic level architecture of U-Net is redesigned with attention and residual blocks for capturing complex features of images. Along with this, use of multi-scale feature fusion technique helps to handle images of different resolutions. Diffusion Transformer and cross-modal attention mechanisms enhance realism and coherence in image quality. The model generates the photorealistic image from the latent representation with VAE decoder. Furthermore, employment of Grad-CAM ensures the model is clearer and gives insights on what portions of the image the model highlighting to during its generation process. In comparison with both the basic diffusion model and GANs, the developed diffusion model improved significantly in photorealism, detail sharpness, and numbers like FID and PSNR to be much greater than those of both.

Downloads

Published

2026-06-01

How to Cite

Fatangare, S., & Ghadekar, P. (2026). Explainable Photorealistic Image Synthesis Using Diffusion Model With Multi-Scale Feature Fusion . International Journal of Artificial Intelligence and Machine Learning, 6(4s), 626–634. Retrieved from https://www.svedbergopen.com/index.php/ijaiml/article/view/494