Surgical instrument-tissue interaction recognition with multi-task-attention video transformer

Purpose The recognition of surgical instrument-tissue interactions can enhance the surgical workflow analysis, improve automated safety systems and enable skill assessment in minimally invasive surgery. However, current deep learning methods for surgical instrument-tissue interaction recognition often rely on static images or coarse temporal sampling, limiting their ability to capture rapid surgical dynamics. Therefore, this study systematically investigates the impact of incorporating fine-grained temporal context into deep learning models for interaction recognition.
Methods We conduct extensive experiments with multiple curated video-based datasets to investigate the influence of fine-grained temporal context for the task of instrument-tissue interaction recognition using video transformer with spatio-temporal feature extraction capabilities. Additionally, we propose a multi-task-attention module that utilizes cross-attention and a gating mechanism to improve communication between the subtasks of identifying the surgical instrument, atomic action, and anatomical target.
Results Our study demonstrates the benefit of utilizing the fine-grained temporal context for recognition of instrument-tissue interactions, with an optimal sampling rate of 6-8 Hz identified for the examined datasets. Furthermore, our proposed MTAM significantly outperforms state-of-the-art multi-task video transformer on the CholecT45-Vid and GraSP-Vid datasets, achieving relative increases of 4.8% and 5.9% in surgical instrument-tissue interaction recognition, respectively.
Conclusions In this work, we demonstrate the benefits of using a fine-grained temporal context rather than static images or coarse temporal context for the task of surgical instrument-tissue interaction recognition. We also show that leveraging cross-attention with spatio-temporal features from various subtasks leads to improved surgical instrument-tissue interaction recognition performance. The project is available at: https://lennart-maack.github.io/InstrTissRec-MTAM

Subjects

Deep learning

Video transformer

Surgical triplet recognition

Surgical activity recognition

DDC Class

617: Surgery, Regional Medicine, Dentistry, Ophthalmology, Otology, Audiology

006: Special computer methods

Funding(s)

Centre of Excellence of Al for Sustainable Living and Working

Projekt DEAL

Lizenz

https://creativecommons.org/licenses/by/4.0/

Publication version

publishedVersion

Name

s11548-025-03546-3.pdf

Type

Main Article

Size

5.02 MB

Format

Adobe PDF

Options

Surgical instrument-tissue interaction recognition with multi-task-attention video transformer