Applications of Large Language Models in Multimodal Learning

Peiyang Yu; Xiaochuan Xu; Jiani Wang

doi:10.5281/zenodo.14001455

Authors

Peiyang Yu Carnegie Mellon University
Xiaochuan Xu Carnegie Mellon University
Jiani Wang Stanford University

DOI:

https://doi.org/10.5281/zenodo.14001455

ARK:

https://n2t.net/ark:/40704/JCTAM.v1n4a13

Disciplines:

Computer Sciences

Subjects:

Large Language Models

References:

25

Keywords:

Large Language Models (LLMs), Multimodal Learning, Cross-modal Tasks, Few-shot Learning, Cross-modal Generation

Abstract

In this paper, we provide a systematic review of the emerging field on applications for Large Language Models (LLMs) in multimodal learning, especially how such methodologies help improve orchestrated task performance by integrating different modalities like images, text, and audio. Multimodal learning is a field where we combine various types of data to make models learn multiple attributes and generate meaningful outputs. It is widely applied in image captioning, cross-modal retrieval, sentiment analysis, and speech recognition. It reviews the main multimodal learning approaches, such as feature extraction, modality alignment, and fusion strategies (early fusion, late fusion, and hybridization), and the performance of LLMs in cross-modal tasks. It highlights the present technological challenges, emphasizing concerns regarding computational resource utilization, model complexity, as well as a lack of multimodal fusion. Lastly, the article provides some suggestions for future applications on how to better integrate modalities and few-shot learning in cross-modal generation models. It also discusses ways to make multimodal machine translation systems run faster using less distributed computational power.

Author Biographies

Peiyang Yu, Carnegie Mellon University

Information Networking Institute, Carnegie Mellon University, Pittsburgh, PA, 15213, peiyangy@alumni.cmu.edu.

Xiaochuan Xu, Carnegie Mellon University

Information Networking Institute, Carnegie Mellon University, Pittsburgh, PA, 15213, xiaochux@alumni.cmu.edu.

Jiani Wang, Stanford University

Department of computer science, Stanford university, Stanford CA 94305, jwang.tech@outlook.com.

References

Liu, W., Zhou, L., Zeng, D., Xiao, Y., Cheng, S., Zhang, C., ... & Chen, W. (2024). Beyond Single-Event Extraction: Towards Efficient Document-Level Multi-Event Argument Extraction. arXiv preprint arXiv:2405.01884.

Zhang, Y., Wang, F., Huang, X., Li, X., Liu, S., & Zhang, H. (2024). Optimization and Application of Cloud-based Deep Learning Architecture for Multi-Source Data Prediction. arXiv preprint arXiv:2410.12642.

Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E., & Hussain, A. (2023). Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion, 91, 424-444.

Liang, P. P., Zadeh, A., & Morency, L. P. (2024). Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10), 1-42.

Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E., & Hussain, A. (2023). Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion, 91, 424-444.

Bačić, B., Feng, C., & Li, W. (2024). JY61 IMU SENSOR EXTERNAL VALIDITY: A FRAMEWORK FOR ADVANCED PEDOMETER ALGORITHM PERSONALISATION. ISBS Proceedings Archive, 42(1), 60.

Liu, W., Cheng, S., Zeng, D., & Hong, Q. (2023, July). Enhancing Document-level Event Argument Extraction with Contextual Clues and Role Relevance. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 12908-12922).

Leong, H. Y., Gao, Y. F., Shuai, J., Zhang, Y., & Pamuksuz, U. (2024). Efficient Fine-Tuning of Large Language Models for Automated Medical Documentation. arXiv preprint arXiv:2409.09324.

Rashkin, H., Nikolaev, V., Lamm, M., Aroyo, L., Collins, M., Das, D., ... & Reitter, D. (2023). Measuring attribution in natural language generation models. Computational Linguistics, 49(4), 777-840.

Miah, M. S. U., Kabir, M. M., Sarwar, T. B., Safran, M., Alfarhood, S., & Mridha, M. F. (2024). A multimodal approach to cross-lingual sentiment analysis with ensemble of transformer and LLM. Scientific Reports, 14(1), 9603.

Scherrer, N., Shi, C., Feder, A., & Blei, D. (2024). Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems, 36.

Asaithambi, S. P. R., Venkatraman, R., & Venkatraman, S. (2023). A thematic travel recommendation system using an augmented big data analytical model. Technologies, 11(1), 28.

Ataallah, K., Shen, X., Abdelrahman, E., Sleiman, E., Zhu, D., Ding, J., & Elhoseiny, M. (2024). Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413.

Marchisio, K., Ko, W. Y., Bérard, A., Dehaze, T., & Ruder, S. (2024). Understanding and mitigating language confusion in llms. arXiv preprint arXiv:2406.20052.

Atrey, K., Singh, B. K., & Bodhey, N. K. (2024). Multimodal classification of breast cancer using feature level fusion of mammogram and ultrasound images in machine learning paradigm. Multimedia Tools and Applications, 83(7), 21347-21368.

Quiles Pérez, M., Martínez Beltrán, E. T., López Bernal, S., Horna Prat, E., Montesano Del Campo, L., Fernández Maimó, L., & Huertas Celdran, A. (2024). Data fusion in neuromarketing: Multimodal analysis of biosignals, lifecycle stages, current advances, datasets, trends, and challenges. Information Fusion, 105, 102231.

Atz, K., Cotos, L., Isert, C., Håkansson, M., Focht, D., Hilleke, M., ... & Schneider, G. (2024). Prospective de novo drug design with deep interactome learning. Nature Communications, 15(1),

Wang, D. (Ed.). (2016). Information Science and Electronic Engineering: Proceedings of the 3rd International Conference of Electronic Engineering and Information Science (ICEEIS 2016), January 4-5, 2016, Harbin, China. CRC Press.

Khemani, B., Patil, S., Kotecha, K., & Tanwar, S. (2024). A review of graph neural networks: concepts, architectures, techniques, challenges, datasets, applications, and future directions. Journal of Big Data, 11(1), 18.

Akkem, Y., Biswas, S. K., & Varanasi, A. (2024). A comprehensive review of synthetic data generation in smart farming by using variational autoencoder and generative adversarial network. Engineering Applications of Artificial Intelligence, 131, 107881.

Ghassemiazghandi, M. (2024). An Evaluation of ChatGPT's Translation Accuracy Using BLEU Score. Theory and Practice in Language Studies, 14(4), 985-994.

Li, X., & Liu, S. (2024). Predicting 30-Day Hospital Readmission in Medicare Patients: Insights from an LSTM Deep Learning Model. medRxiv. doi:10.1101/2024.09.08.24313212

Lu, J. (2024). Optimizing E-Commerce with Multi-Objective Recommendations Using Ensemble Learning.

Liu T, Wu Y, Ye A, Cao L, Cao Y. Two-stage sparse multi-objective evolutionary algorithm for channel selection optimization in BCIs. Frontiers in Human Neuroscience. 2024 May 22;18:1400077.

Yu, P., Cui, V. Y., & Guan, J. (2021, March). Text classification by using natural language processing. In Journal of Physics: Conference Series (Vol. 1802, No. 4, p. 042010). IOP Publishing.