JOURNAL ARTICLE

Multi-level Multi-modal Feature Fusion for Action Recognition in Videos

Abstract

Several multi-modal feature fusion approaches have been proposed in recent years in order to improve action recognition in videos. These approaches do not take full advantage of the multi-modal information in the videos, since they are biased towards a single modality or treat modalities separately. To address the multi-modal problem, we propose a Multi-Level Multi-modal feature Fusion (MLMF) for action recognition in videos. The MLMF projects each modality to shared and specific feature spaces. According to the similarity between the two modal shared features space, we augment the features in the specific feature space. As a result, the fused features not only incorporate the unique characteristics of the two modalities, but also explicitly emphasize their similarities. Moreover, the video's action segments differ in length, so the model needs to consider different-level feature ensembling for fine-grained action recognition. The optimal multi-level unified action feature representation is achieved by aggregating features at different levels. Our approach is evaluated in the EPIC-KITCHEN 100 dataset, and achieved encouraging results of action recognition in videos.

Keywords:
Modal Modality (human–computer interaction) Feature (linguistics) Computer science Artificial intelligence Action (physics) Pattern recognition (psychology) Representation (politics) Modalities Feature vector Action recognition Space (punctuation)

Metrics

1
Cited By
0.12
FWCI (Field Weighted Citation Impact)
18
Refs
0.39
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Gait Recognition and Analysis
Physical Sciences →  Engineering →  Biomedical Engineering
Diabetic Foot Ulcer Assessment and Management
Health Sciences →  Medicine →  Endocrinology, Diabetes and Metabolism

Related Documents

BOOK-CHAPTER

Multi-level Fusion for Multi-modal Human Action Recognition

Ziliang GanLei JinXiaojuan Wang

Lecture notes in electrical engineering Year: 2025 Pages: 132-142
JOURNAL ARTICLE

Human Action Recognition Based On Multi-level Feature Fusion

Yueshen XuGuang-can XIAOXiaofen Tang

Journal:   Advances in computer science research Year: 2015
JOURNAL ARTICLE

Enhanced multi-modal emotion recognition using the feature level fusion

Aziguli WulamuYuheng WuXin LiuYao ZhangJinghan XuYang Zhang

Journal:   Engineering Applications of Artificial Intelligence Year: 2025 Vol: 162 Pages: 112447-112447
JOURNAL ARTICLE

Semantic2Graph: graph-based multi-modal feature fusion for action segmentation in videos

Junbin ZhangPei-Hsuan TsaiMeng‐Hsun Tsai

Journal:   Applied Intelligence Year: 2024 Vol: 54 (2)Pages: 2084-2099
© 2026 ScienceGate Book Chapters — All rights reserved.