Grounded Instruction Understanding with Large Language Models: Toward Trustworthy Human-Robot Interaction

Ekele Aga Ogbadu; Stephanie Lukin; Cynthia Matuszek

doi:10.13016/m2nmvg-fpo9

ScienceGate Book Chapters

JOURNAL ARTICLE

Grounded Instruction Understanding with Large Language Models: Toward Trustworthy Human-Robot Interaction

Ekele Aga Ogbadu Stephanie Lukin Cynthia Matuszek

Year: 2025 Journal: Maryland Shared Open Access Repository (USMAI Consortium)

DOI: 10.13016/m2nmvg-fpo9

Get Full-Text PDF Get Analytical Report

Abstract

Understanding natural language as a representational bridge between perception and action is critical for deploying autonomous robots in complex, high-risk environments. This work investigates how large language models (LLMs) can support this bridge by interpreting unconstrained human instructions in urban disaster response scenarios. Leveraging the SCOUT corpus, a multimodal dataset capturing human-robot dialogue through Wizard-of-Oz experiments, we construct SCOUT++, aligning over 11,000 visual frames with language commands and robot actions. We evaluate three instruction classification approaches: a neural network trained on tokenized text, GPT-4 using text alone, and GPT-4 with synchronized visual input. Results show that while GPT-4 (text-only) outperforms traditional models in accuracy, its multimodal variant exhibits degraded performance, often producing vague or hallucinated outputs. These findings expose the challenges of reliably grounding language in visual context and raise questions about the trustworthiness of foundation models in safety-critical settings. We contribute SCOUT++, a reproducible multimodal pipeline, and benchmark results that shed light on the capabilities and current limitations of vision-language models for risk-sensitive human-robot interaction.

Keywords:

Hallucinating Bridge (graph theory) Context (archaeology) Perception Natural language Construct (python library) Trustworthiness Action (physics) Robot

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.78

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Social Robot Interaction and HRI

Social Sciences → Psychology → Social Psychology

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Grounded Instruction Understanding with Large Language Models: Toward Trustworthy Human-Robot Interaction

Abstract

Metrics

Topics

Related Documents

Grounded Instruction Understanding with Large LanguageModels: Toward Trustworthy Human-Robot Interaction

Multi-turn Instruction Invocation on Human-Robot Interaction by Large Language Models

Human–robot interaction through joint robot planning with large language models

Large Language Models as Zero-Shot Human Models for Human-Robot Interaction

Toward Trustworthy Large Language Models: A Survey and Synthesis of Techniques for Reliable and Grounded Reasoning