Maack, LennartLennartMaackCam, BerkBerkCamLatus, SarahSarahLatusMaurer, TobiasTobiasMaurerSchlaefer, AlexanderAlexanderSchlaefer2025-11-172025-11-172025-11-11International Journal of Computer assisted Radiology and Surgery (in Press): (2025)https://hdl.handle.net/11420/58746Purpose The recognition of surgical instrument-tissue interactions can enhance the surgical workflow analysis, improve automated safety systems and enable skill assessment in minimally invasive surgery. However, current deep learning methods for surgical instrument-tissue interaction recognition often rely on static images or coarse temporal sampling, limiting their ability to capture rapid surgical dynamics. Therefore, this study systematically investigates the impact of incorporating fine-grained temporal context into deep learning models for interaction recognition. Methods We conduct extensive experiments with multiple curated video-based datasets to investigate the influence of fine-grained temporal context for the task of instrument-tissue interaction recognition using video transformer with spatio-temporal feature extraction capabilities. Additionally, we propose a multi-task-attention module that utilizes cross-attention and a gating mechanism to improve communication between the subtasks of identifying the surgical instrument, atomic action, and anatomical target. Results Our study demonstrates the benefit of utilizing the fine-grained temporal context for recognition of instrument-tissue interactions, with an optimal sampling rate of 6-8 Hz identified for the examined datasets. Furthermore, our proposed MTAM significantly outperforms state-of-the-art multi-task video transformer on the CholecT45-Vid and GraSP-Vid datasets, achieving relative increases of 4.8% and 5.9% in surgical instrument-tissue interaction recognition, respectively. Conclusions In this work, we demonstrate the benefits of using a fine-grained temporal context rather than static images or coarse temporal context for the task of surgical instrument-tissue interaction recognition. We also show that leveraging cross-attention with spatio-temporal features from various subtasks leads to improved surgical instrument-tissue interaction recognition performance. The project is available at: https://lennart-maack.github.io/InstrTissRec-MTAMen1861-6429International journal of computer assisted radiology and surgery2025Springer Science and Business Media LLChttps://creativecommons.org/licenses/by/4.0/Deep learningVideo transformerSurgical triplet recognitionSurgical activity recognitionTechnology::617: Surgery, Regional Medicine, Dentistry, Ophthalmology, Otology, AudiologyComputer Science, Information and General Works::006: Special computer methodsSurgical instrument-tissue interaction recognition with multi-task-attention video transformerJournal Articlehttps://doi.org/10.15480/882.1614410.1007/s11548-025-03546-310.15480/882.16144Journal Article