JOURNAL ARTICLE

6-DoF Grasp Detection Method Based on Vision Language Guidance

Xixing LiJiahao ChenRui WuTao Liu

Year: 2025 Journal:   Processes Vol: 13 (5)Pages: 1598-1598   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

The interactive grasp of robots can grasp the corresponding objects according to the user’s choice. Most interactive grasp methods based on deep learning comprise visual language and grasp detection models. However, in existing methods, the trainability and generalization ability of the visual language model is weak, and the robot cannot cope well with grasping small target objects. Therefore, this paper proposes a 6-DoF grasp detection method guided by visual language, which converts text instructions and RGBD images of the scene to be grasped into inputs and outputs for the 6-DoF grasp posture of the object corresponding to the text instructions. In order to improve the trainability and feature extraction ability of the visual language model, a multi-head attention mechanism combined with hybrid normalization is designed. At the same time, a local attention mechanism is introduced into the grasp detection model to enhance the global and local information interaction ability of point cloud data, thereby improving the grasping ability of the grasp detection model for small target objects. The method proposed in this paper first uses the improved visual language model to predict the plane position information of the target object, then uses the improved grasp detection model to predict all the graspable postures in the scene, and finally uses the plane position information to filter out the graspable postures of the target object. The visual language model and grasp detection model proposed in this paper have achieved excellent performance in various scenarios of public datasets while ensuring a specific generalization ability. In addition, we also conducted real grasp experiments, and the 6-DoF grasp detection method based on visual language guidance proposed in this paper achieved a grasp success rate of 95%.

Keywords:
GRASP Computer vision Computer science Artificial intelligence Human–computer interaction Programming language

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
38
Refs
0.17
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Robot Manipulation and Learning
Physical Sciences →  Engineering →  Control and Systems Engineering
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Hand Gesture Recognition Systems
Physical Sciences →  Computer Science →  Human-Computer Interaction
© 2026 ScienceGate Book Chapters — All rights reserved.