Language grounding in deep reinforcement learning for dynamic goal-oriented robotics

Röder, Frank

doi:https://doi.org/10.15480/882.17098

Language grounding in deep reinforcement learning for dynamic goal-oriented robotics

Citation Link: https://doi.org/10.15480/882.17098

Publikationstyp

Doctoral Thesis

Date Issued

2026

Sprache

English

Author(s)

Röder, Frank

Advisor

Ay, Nihat

Referee

Murena, Pierre-Alexandre

Title Granting Institution

Technische Universität Hamburg

Place of Title Granting Institution

Hamburg

Examination Date

2026-02-10

Institute

Data Science Foundations E-21

TORE-DOI

10.15480/882.17098

TORE-URI

https://hdl.handle.net/11420/63067

Citation

Technische Universität Hamburg (2026)

Researchers have long attempted to teach robots and other embodied artificial agents to follow instructions, approaching language as the primary medium for communication, knowledge transfer, and cognition. While toddlers excel at language acquisition and utilizing it for problem-solving, robots and voice-based assistants struggle to achieve a grounded and robust understanding of natural language due to conversational noise, such as disfluencies and polysemy. This thesis investigates the limitations in language grounding that currently hinder the development of intelligent agents to comprehend and execute lingual goals, as well as their capacity to revise misinterpretations arising from underspecified or ambiguous instructions. We utilize a sparse reward-driven language-conditioned reinforcement learning setup and leverage insights from cognitive science and developmental psychology, presented in the following two pillars. The first pillar explores the utilization of linguistic feedback and egocentric speech as mechanisms for learning from unsuccessful outcomes, by implementing a synthetic caretaker that provides feedback when the agent deviates from the expected course of actions. Unintended deviations may prove beneficial as alternative goal specifications, potentially satisfying different objectives. For instance, a robot might be assigned to prepare a cup of tea, but ends up brewing coffee instead, thereby accomplishing an unintended objective, in this case a different goal. In the case of egocentric speech, our research focuses on developing a multimodal translation model, designed to generate appropriate goal specifications based on observed behaviors. The model retrospectively predicts suitable goal commands that align with the observed actions, used for learning in hindsight. Both approaches of linguistic feedback and egocentric speech aim to emulate aspects of language development in young children and significantly enhance sample efficiency in robotic reinforcement learning. The second pillar addresses the challenge of action correction, specifically targeting erroneous behaviors stemming from misinterpretations of goal specifications. We identify three distinct categories of misunderstanding: ambiguities arising from underspecified statements, unintentional miscommunications (e.g., erroneously conveyed intentions), and discrepancies in common ground between the instructor and the robotic agent. Instead of learning with a different goal specification in hindsight, like in the first pillar, we aim to correct the misunderstanding through further verbal input from the operator. This provides an additional challenge for the agent, which needs to reconsider the original language goal given the new context and the returned action correction. By implementing a novel approach that incorporates the uncertainty about the actual goal and utilizing our methods from the first pillar, we demonstrate that egocentric speech significantly improves learning by generating action corrections in hindsight. We highlight this context-sensitive hindsight approach as the first in this domain to enhance the resolution of misunderstandings.