Cross-Task Multi-Branch Vision Transformer for Facial Expression and Mask Wearing Classification

Armando Zhu; Keqin Li; Tong Wu; Peng Zhao; Bo Hong

doi:10.5281/zenodo.11083875

Authors

Armando Zhu Carnegie Mellon University
Keqin Li AMA University
Tong Wu University of Washington
Peng Zhao Microsoft
Bo Hong Northern Arizona University

DOI:

https://doi.org/10.5281/zenodo.11083875

References:

52

Keywords:

Vision Transformer, Facial Expression Recognition, Facial Mask Wearing Classification, Deep Learning

Abstract

With wearing masks becoming a new cultural norm, facial expression recognition (FER) while taking masks into account has become a significant challenge. In this paper, we propose a unified multi-branch vision transformer for facial expression recognition and mask wearing classification tasks. Our approach extracts shared features for both tasks using a dual-branch architecture that obtains multi-scale feature representations. Furthermore, we propose a cross-task fusion phase that processes tokens for each task with separate branches, while exchanging information using a cross attention module. Our proposed framework reduces the overall complexity compared with using separate networks for both tasks by the simple yet effective cross-task fusion phase. Extensive experiments demonstrate that our proposed model performs better than or on par with different state-of-the-art methods on both facial expression recognition and facial mask wearing classification task.

Author Biographies

Armando Zhu, Carnegie Mellon University

Armando Zhu obtained his Master of Science degree in Software Engineering from Carnegie Mellon University. His research interests include computer vision and machine learning.

Keqin Li, AMA University

Affiliation: AMA University, Philippines.

Tong Wu, University of Washington

Affiliation: University of Washington.

Peng Zhao, Microsoft

Affiliation: Microsoft, China.

Bo Hong, Northern Arizona University

Affiliation: Northern Arizona University.

References

Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009.

Sun, Chen, et al. "Revisiting unreasonable effectiveness of data in deep learning era." Proceedings of the IEEE international conference on computer vision. 2017.

Li, Panfeng, Youzuo Lin, and Emily Schultz-Fellenz. "Contextual hourglass network for semantic segmentation of high resolution aerial imagery." arXiv preprint arXiv:1810.12813 (2018).

Chen, Chun-Fu Richard, Quanfu Fan, and Rameswar Panda. "Crossvit: Cross-attention multi-scale vision transformer for image classification." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

Turan, Cigdem, and Kin-Man Lam. "Region-based feature fusion for facial-expression recognition." 2014 IEEE International Conference on Image Processing (ICIP). IEEE, 2014.

Zhu, Ziwei, and Wenjing Zhou. "Taming heavy-tailed features by shrinkage." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.

Farzaneh, Amir Hossein, and Xiaojun Qi. "Facial expression recognition in the wild via deep attentive center loss." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2021.

Wang, Kai, et al. "Region attention networks for pose and occlusion robust facial expression recognition." IEEE Transactions on Image Processing 29 (2020): 4057-4069.

Shi, Ge, Jason Smucny, and Ian Davidson. "Deep learning for prognosis using task-fmri: A novel architecture and training scheme." Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022.

Ma, Fuyan, Bin Sun, and Shutao Li. "Robust facial expression recognition with convolutional visual transformers." arXiv preprint arXiv:2103.16854 2.6 (2021): 7.

Ding, Wenhao, et al. "Vehicle pose and shape estimation through multiple monocular vision." 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2018.

Osherov, Elad, and Michael Lindenbaum. "Increasing cnn robustness to occlusions by reducing filter support." Proceedings of the IEEE International Conference on Computer Vision. 2017.

Weng, Yĳie, Jianhao, Wu. "Fortifying the global data fortress: a multidimensional examination of cyber security indexes and data protection measures across 193 nations". International Journal of Frontiers in Engineering Technology 6. 2(2024).

Jagadeeswari, C., and M. Uday Theja. "Performance evaluation of intelligent face mask detection system with various deep learning classifiers." International Journal of Advanced Science and Technology 29.11s (2020): 3074-3082.

Yao, Jiawei, et al. "Ndc-scene: Boost monocular 3d semantic scene completion in normalized device coordinates space." 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 2023.

Yao, Jiawei, et al. "Building lane-level maps from aerial images." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.

Ge, Shiming, et al. "Detecting masked faces in the wild with lle-cnns." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Tabassum, Tarafder Elmi, et al. "Integrating GRU with a Kalman Filter to Enhance Visual Inertial Odometry Performance in Complex Environments." Aerospace 10.11 (2023): 923.

Read, Andrew J., et al. "Prediction of Gastrointestinal Tract Cancers Using Longitudinal Electronic Health Record Data." Cancers 15.5 (2023): 1399.

Zhao, Peng, et al. "HTN planning with uncontrollable durations for emergency decision-making." Journal of Intelligent & Fuzzy Systems 33.1 (2017): 255-267.

Wang, Wenhai, et al. "Pyramid vision transformer: A versatile backbone for dense prediction without convolutions." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

Li, Keqin, et al. "The application of Augmented Reality (AR) in Remote Work and Education." arXiv preprint arXiv:2404.10579 (2024).

Goodfellow, Ian J., et al. "Challenges in representation learning: A report on three machine learning contests." Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20. Springer berlin heidelberg, 2013.

Ru, Jingyu, et al. "A Bounded Near-Bottom Cruise Trajectory Planning Algorithm for Underwater Vehicles." Journal of Marine Science and Engineering 11.1 (2022): 7.

Liu, Tianrui, Qi, Cai, Changxin, Xu, Bo, Hong, Jize, Xiong, Yuxin, Qiao, Tsungwei, Yang. "Image Captioning in News Report Scenario". Academic Journal of Science and Technology 10. 1(2024): 284–289.

Zhao, Peng, Chao Qi, and Dian Liu. "Resource-constrained Hierarchical Task Network planning under uncontrollable durations for emergency decision-making." Journal of Intelligent & Fuzzy Systems 33.6 (2017): 3819-3834.

Qi, Chao, et al. "Hierarchical task network planning with resources and temporal constraints." Knowledge-Based Systems 133 (2017): 17-32.

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

Liu, Dian, et al. "Hierarchical task network-based emergency task planning with incomplete information, concurrency and uncertain duration." Knowledge-Based Systems 112 (2016): 67-79.

Li, Panfeng, Mohamed Abouelenien, and Rada Mihalcea. "Deception Detection from Linguistic and Physiological Data Streams Using Bimodal Convolutional Neural Networks." arXiv preprint arXiv:2311.10944 (2023).

Atulya Shree, Kai Jia, Zhiyao Xiong, Siu Fai Chow, Raymond Phan, Panfeng Li, & Domenico Curro. (2022). Image analysis.

Jin Wang, JinFei Wang, Shuying Dai, Jiqiang Yu, Keqin Li. "Research on emotionally intelligent dialogue generation based on automatic dialogue system." arXiv preprint arXiv:2402.11447 (2024).

Levi, Gil, and Tal Hassner. "Emotion recognition in the wild via convolutional neural networks and mapped binary patterns." Proceedings of the 2015 ACM on international conference on multimodal interaction. 2015.

Xin, Yi, et al. "MmAP: Multi-modal Alignment Prompt for Cross-domain Multi-task Learning." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 14. 2024.

Xin, Yi, et al. "Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey." arXiv preprint arXiv:2402.02242 (2024).

Wang, Jun, et al. "Facex-zoo: A pytorch toolbox for face recognition." Proceedings of the 29th ACM international conference on multimedia. 2021.

Liu, Hao, et al. "Deep Reinforcement Learning for Mobile Robot Path Planning." arXiv preprint arXiv:2404.06974 (2024).

Wang, Xiaosong, et al. "Advanced Network Intrusion Detection with TabTransformer." Journal of Theory and Practice of Engineering Science 4.03 (2024): 191-198.

Liu, Tianrui, et al. "News recommendation with attention mechanism." arXiv preprint arXiv:2402.07422 (2024).

Li, Shan, Weihong Deng, and JunPing Du. "Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Yuan, Li, et al. "Tokens-to-token vit: Training vision transformers from scratch on imagenet." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

Wang, Hong-Wei, et al. "Review on hierarchical task network planning under uncertainty." Acta Autom. Sin 42 (2016): 655-667.

He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Lyons, Michael, et al. "Coding facial expressions with gabor wavelets." Proceedings Third IEEE international conference on automatic face and gesture recognition. IEEE, 1998.

Liu, Tianrui, Changxin, Xu, Yuxin, Qiao, Chufeng, Jiang, Jiqiang, Yu. "Particle Filter SLAM for Vehicle Localization". Journal of Industrial Engineering and Applied Science 2. 1(2024): 27–31.

Castellano, Giovanna, Berardina De Carolis, and Nicola Macchiarulo. "Automatic emotion recognition from facial expressions when wearing a mask." Proceedings of the 14th Biannual Conference of the Italian SIGCHI Chapter. 2021.

Su, Jing, et al. "Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review." arXiv preprint arXiv:2402.10350 (2024).

Loey, Mohamed, et al. "Fighting against COVID-19: A novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection." Sustainable cities and society 65 (2021): 102600.

Liu, Tianrui, Qi, Cai, Changxin, Xu, Bo, Hong, Fanghao, Ni, Yuxin, Qiao, Tsungwei, Yang. "Rumor Detection with A Novel Graph Neural Network Approach". Academic Journal of Science and Technology 10. 1(2024): 305–310.

Lucey, Patrick, et al. "The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression." 2010 ieee computer society conference on computer vision and pattern recognition-workshops. IEEE, 2010.

Cross-Task Multi-Branch Vision Transformer for Facial Expression and Mask Wearing Classification

Authors

DOI:

References:

Keywords:

Abstract

Author Biographies

Armando Zhu, Carnegie Mellon University

Keqin Li, AMA University

Tong Wu, University of Washington

Peng Zhao, Microsoft

Bo Hong, Northern Arizona University

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Digital Distribution

Indexing & Abstracting

Information

Announcements

Call for Reviewers

Current Issue