Multi-Modal Deep Learning Architectures for Integrating Text, Image, and Sensor Data in Intelligent Systems

Authors

  • Neeraj Gupta Department of Computer Engineering & Applications, GLA University, Mathura.
  • V Anantha Lakshmi Assistant Professor, Department of CSE (Artificial Intelligence & Machine Learning), Pragati Engineering College, ADB Road, Surampalem, Near Peddapuram, Kakinada District, Andhra Pradesh, India - 533437.
  • Jeevajothi R Assistant Professor, Department of Management Studies, Meenakshi College of Arts and Science, Meenakshi Academy of Higher Education and Research.
  • Nirmal Keshari Swain Assistant Professor, Departmentof Information Technology, , Vardhaman College of Engineering, Shamshabad, Hyderabad, India - 501 218.
  • Dr. Ravi Thangjam Professor, School of Business, Aditya University, Surampalem, Andhra Pradesh, Pin 533437.
  • Ganesh Korwar Associate Professor, Mechanical Engineering, Vishwakarma Institute of Technology, Pune, Maharashtra, 411037.
  • Mahi Singh School of Sciences, Noida international University, Uttar Pradesh 203201, India.

Keywords:

Multi-modal deep learning, Intelligent systems, Sensor fusion, CNN, Transformer networks, LSTM, Artificial intelligence, Smart systems

Abstract

The fast pace of artificial intelligence and deep learning has also increased the rate at which intelligent systems whose ability to process heterogeneous data across various modalities are developed. Traditional learning strategies that focus on single mode tend to have less contextual knowledge and predictive validity because they cannot take advantage of complementary information in a variety of data sources. This research paper suggests a multi-modal deep learning model to join text, image and sensor information into intelligent systems. The suggested framework integrates the transformer-based learning of textual representations, convolutional neural networks-driven visual features with long short-term memory-based analysis of temporal sensors in a coherent fusion framework. A hybrid fusion mechanism with attention is presented to enhance learning cross-modal representations to enable the greater use of contexts when making decisions. Benchmark multimodal datasets of textual descriptions, samples of images and real-time sensor measurements were used in experimental evaluation. Its proposed architecture was able to outperform other unimodal and traditional multimodal methods, with an accuracy of 96.8, a precision of 96.2, a recall of 95.9 and a F1-score of 96.0. The generalization capability and strength in repeated experiments were statistically verified by the use of 10-fold cross-validation. Moreover, the framework had lower inference latency and reasonable computational performance to run in real time. The suggested system has a lot of potential in the health care monitoring, intelligent surveillance, industrial automation, autonomous systems, and human, machine interaction setting where there is need to have solid heterogeneous data assemblages.

Downloads

Published

2026-05-12

How to Cite

Gupta, N., Lakshmi, V. A., R, J., Swain, N. K., Thangjam, D. R., Korwar, G., & Singh, M. (2026). Multi-Modal Deep Learning Architectures for Integrating Text, Image, and Sensor Data in Intelligent Systems. International Journal of Artificial Intelligence and Machine Learning, 6(2s), 719–727. Retrieved from https://www.svedbergopen.com/index.php/ijaiml/article/view/253

Most read articles by the same author(s)

Similar Articles

<< < 13 14 15 16 17 18 19 > >> 

You may also start an advanced similarity search for this article.