Squeeze and multi-context attention for polyp segmentation

Artificial Intelligence-based Computer Aided Diagnostics (AI-CADx) have been proposed to help physicians in reducing misdetection of polyps in colonoscopy examination. The heterogeneity of a polyp's appearance makes detection challenging for physicians and AI-CADx. Towards building better AI-CADx, we propose an attention module called Squeeze and Multi-Context Attention (SMCA) that re-calibrates a feature map by providing channel and spatial attention by taking into consideration highly activated features and context of the features at multiple receptive fields simultaneously. We test the effectiveness of SMCA by incorporating it into the encoder of five popular segmentation models. We use five public datasets and construct intra-dataset and inter-dataset test sets to evaluate the generalizing capability of models with SMCA. Our intra-dataset evaluation shows that U-Net with SMCA and without SMCA has a precision of 0.86 ± 0.01 and 0.76 ± 0.02 respectively on CVC-Clin-icDB. Our inter-dataset evaluation reveals that U-Net with SMCA and without SMCA has a precision of 0.62 ± 0.01 and 0.55 ± 0.09 respectively when trained on Kvasir-SEG and tested on CVC-ColonDB. Similar results are observed using other segmentation models and other public datasets.


| INTRODUCTION
Colon Cancer can be fatal if not detected early and as such, poses a huge risk to public health. It is the third most common cause of cancer in the US. 1 One of the earliest signs of colon cancer is the emergence of polyps in the colon and rectum. Early detection and removal of polyps can increase the survival rate to 90%. 2 To this end, colonoscopy is performed to detect the presence of colorectal polyps. The problem with manual inspection is that polyps can be misdetected because they have heterogeneous morphological characteristics. Hence, there is an ongoing effort to develop Computer Aided Diagnosis Systems (CADx) that limit the number of misdetections. 3 Artificial Intelligence (AI) based Polyp Segmentation is a paradigm of AI-CADx where an AI model is purposed with the task of classifying the pixels that belong to polyps in images. Specifically, deep learningbased AI methods show promising results. 4 It is believed that AI-CADx will reduce the burden of a physician and lead to better patient care. It is also argued that CADx solutions could potentially be an alternative to manual screening. Therefore, it is of paramount importance that the accuracy and precision of deep learning-based AI-CADx are improved.
As argued by Jha et al., 5 robustness and generalizability are two key aspects that need to be handled if we want CADx systems in clinical practice. The robustness is the ability of the CADx to perform reliably within an accepted error margin for all kinds of colonoscopic images. Generalization is the ability of the CADx to segment polyps reliably and accurately from images belonging to a wide range of image distributions. Solving these two aspects are key to making reliable AI-CADx for polyp segmentation. Figure 1 shows the variations in appearance and morphological features of polyps across different datasets.
Towards learning robust and generalizing features for polyp segmentation, we propose a module called "Squeeze and Multi-Context Attention" (SMCA), an attention module that re-calibrates feature maps based on attention weights computed from the aggregated polyp and context features at multiple receptive fields. In doing so, we leverage the global context and the local context at multiple receptive fields to provide spatial and channel attention. In comparison, Squeeze and Excite (SE) 10 module extracts only global context through global average pooling to provide channel attention. Attention gates (AG) 11 provide spatial attention by calculating attention weights from coarser signals for each feature in a feature map. However, our module combines the channel attention mechanism from SE and the spatial attention mechanism from AG to compute attention weights that provide attention in the channel and spatial dimensions. Additionally, we perform the channel and spatial attention at multiple receptive fields. A point to note is that SMCA is a self-attention module whereas AG is an attention module. We evaluate the effectiveness of our module by incorporating it into multiple deep learningbased segmentation models, namely: U-Net, 12 Attention U-Net, 11 R2U-Net, 13 R2AU-Net 14 and ResUNet++. 15 In ResUNet++, we replace SE module with SMCA module. Towards robustness, we evaluate the five models with and without SMCA on four public datasets. Towards generalization, we construct inter-dataset test sets and evaluate the segmentation models with and without SMCA on them. Finally, we compare the attention maps of the convolution kernels of U-Net with and without our SMCA module using Grad-Cam++ 16 to qualitatively illustrate the differences in the feature representation.
In summary, our contributions are as follows: • We propose an attention module called SMCA that takes global and local context at multiple receptive fields to re-calibrate the feature maps.
F I G U R E 1 Sample images from Kvasir-Seg, 6 ETIS-Larib, 7 CVC-ColonDB 8 and CVC-ClinicDB 9 illustrating the variations in appearance and morphological features of polyps shown with red circles • We check the performance changes due to SMCA by extensively evaluating five models with and without SMCA through five-fold cross validation on four public datasets that is, Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB and Kvasir-Sessile. • We check the generalizing ability through extensive inter-dataset evaluation that is, we train our models with and without SMCA on Kvasir-SEG and CVC-Clinic. We then evaluate these models on four public datasets, which the model has not seen before. • We plot the attention maps of the convolution kernels of a U-Net with and without our SMCA. The comparison of the attention maps at multiple hierarchies highlights the differences in the feature representation of the two models.

| RELATED WORK
Previously, hand-crafted features were used to detect and segment polyps. In Reference 8, authors proposed a three-stage process for polyp segmentation. They performed region segmentation followed by region description and finally, region classification. The authors of 17 used shape as a discriminatory feature instead of texture. The reason behind their proposal was that small polyps had predominantly elliptical shapes. In Reference 18, authors proposed a dictionary learning approach by extracting hue histogram features and used support vector machine to classify normal and polyp images. However, the limitation of hand-crafted features is that they do not generalize well to unseen images. Furthermore, the complexity of these proposed solutions greatly limit the applicability in real-world scenarios. The limitations posed by hand-crafted feature extraction have been circumvented by using Convolutional Neural Networks (CNN). CNNs have shown great success in the polyp segmentation task. In the MICCAI polyp segmentation challenge, most of the proposed models were based on CNNs and the winning model was also a CNN. 19 Since U-Net 12 came into existence, it and its variants have been commonly used in medical image segmentation. 20 From the literature, it can be observed that the modifications proposed by authors have mostly been in convolution operations, attention blocks and feature aggregation blocks. With respect to changes in convolution operations, Alam et al. 21 replaced the encoder of U-Net with ResNet-50 and Sun et al. 22 extracted better features by using dilated convolution. Towards the use of attention blocks, one of the earliest architectures was the Attention U-Net 11 which incorporated attention gates to improve segmentation of abdominal regions from CT images. Rundo et al. 23 introduced SE modules into U-Net to improve prostate zonal segmentation. On similar lines, Jha et al. 15 created a variant of ResUNet 24 for polyp segmentation by introducing SE module and attention gates. In Reference 25, the authors introduced a spatial attention layer to a U-Net for the task of polyp segmentation. In Reference 26, authors introduced an attention module called "Focus Gate" that uses spatial and channel attention to calculate the attention weights. The authors demonstrated that their dual attention-gated U-Net called "Focus Net" outperformed state-of-the-art models. With respect to feature aggregation blocks, Mahmud et al. proposed PolypSeg-Net 27 where sequential depth dilated inception (DDI) blocks were used to aggregate features from different receptive fields.
From the aforementioned works, we observed that using channel and spatial attention blocks and aggregating features from multiple receptive fields were beneficial for segmentation. SMCA module was constructed with these two ideas in mind. Specifically, our SMCA module captures information at multiple receptive fields of a feature map by using average and max pooling of varying kernel sizes. The extracted information from the multiple receptive fields is passed through convolutional blocks to calculate spatial and channel attention weights per receptive field. The channel and spatial attention weights from multiple receptive fields are combined to calculate the final attention weights which are used to re-calibrate the original feature map.
The literature also revealed that majority of the existing works evaluated their models on test sets, which were derived from the same datasets. [28][29][30][31] An exception to this trend was the recently published work of Jha et al. 5 They performed inter-dataset evaluation to prove the generalizing capability of their proposed model. However, they performed one-fold cross validation for all their experiments. We take this a step further and perform five-fold cross validation experiments to prove the advantages of incorporating SMCA into models to increase its generalizing ability.

| Network architecture
In this sub-section, we briefly describe the various models we considered for this study and illustrate through diagrams where we placed the SMCA module in the models.

| U-Net architecture
The architecture of the proposed U-Net with SMCA is shown in Figure 2. It consists of encoder, decoder and the SMCA module. The encoder extracts feature through a series of encoding blocks. As information passes down the encoder block, low-level features are converted to high-level features. An encoder block is a series of two convolution operations followed by SMCA and Max Pooling. Before information is passed to the next encoder block, SMCA enhances the features extracted. In the baseline U-Net, the SMCA module is not present. The number of kernels increases in subsequent encoder blocks as follows: 32, 64, 128, 256 and 512. The decoder F I G U R E 2 The five segmentation models used in our work. The original models do not have SMCA module. In ResUNet++, we replaced SE Layer with SMCA. /1 and /2 represent stride 1 and stride 2. 2 Â 2 and 3 Â 3 denote kernel sizes. Â2 next to Upsample denotes the scale of upsampling. All Upsampling operations are bilinear interpolation block is similar to the encoder with the additional operation being that it concatenates features from the encoder with the upsampled features from the previous decoder block. The decoder kernels decrease in every subsequent decoder block as follows: 256, 128, 64 and 32.

| Attention U-Net architecture
The architecture of the proposed Attention U-Net with SMCA is shown in Figure 2. It consists of encoder, decoder, SMCA module and an additional attention gate. 32 Similar to U-Net, SMCA enhances the features extracted from an encoder block and then the feature is passed to the next block. The SMCA module is absent in the baseline Attention U-Net. The number of kernels increases after every encoder block as follows: 32, 64, 128, 256 and 512. On the other hand, the de-coder kernels decrease as follows: 256, 128, 64 and 32.

| R2U-Net architecture
In this variation of U-Net, Recurrent Residual Convolutional Neural Networks (RRCNN) are introduced. The authors propose the inclusion of these two modules primarily for two reasons. First, the inclusion of residual units helps in training deep architectures as it minimises the occurrence of vanishing and exploding gradients. Secondly, recurrent units ensure better feature representations arising from the accumulation of feature maps. This network achieved state-of-the-art in Skin Lesion Segmentation. 33 The encoder and decoder structures are shown in Figure 2. The number of kernels increases after every encoder block as follows: 64, 128, 256, 512 and 1024. On the decoder side, the kernels reduce with every decoder block as follows: 512, 256, 128 and 64.

| R2AU-Net architecture
In this variation of U-Net, attention gates introduced in Attention U-Nets are used in R2U-Net. Inclusion of attention gate further strengthens the feature representation of the network. The encoder and decoder structures are shown in Figure 2. The number of kernels used in the encoder and decoder blocks is the same as R2U-Net.

| ResUNet++ architecture
ResUNet++ is a segmentation model constructed to improve polyp segmentation performance. This model is built upon ResUNet. 24 This architecture has feature enhancement modules such as Residual Blocks, SE module, Attention Gates and Atrous Spatial Pyramid Pooling (ASPP). 34 The SE layers are included after every residual block in the encoder. Additional skip connections are introduced to propagate information from the encoder blocks to attention gates. The filters in the encoder section increase as follows: 32, 64, 128, 256 and 512. In the decoder section, the filters decrease with each decoder block as follows: 512, 256, 128, 64 and 32. Altogether, ResUNet++ has one stem block (See Figure 2), three encoder blocks and three decoder blocks. The final decoder block has an ASPP layer and a 1 Â 1 convolution for channel reduction.

| Feature enhancement modules
We consider feature enhancement modules to encompass modules that manipulate feature maps through convolution operations or recalibrate the feature maps by computing attention weights. In this section, we describe the various feature enhancement modules used in the five segmentation models.

| Attention gates
Attention gates were first proposed by Chen et al. 35 Since its introduction, several segmentation models have been used it. In our work, three of the models (Attention U-Net, R2AU-Net, ResU-net++) use attention gates. The reason for using the attention mechanism is that it highlights the relevant information in a feature map while suppressing the irrelevant information. In doing so, the feature representation of the segmentation model is strengthened and therefore, semantic information is preserved as information hows through the network.

| ResNet block
As more layers are added to a network, gradients may either vanish or explode. 36 This can result in the network not converging during training. To alleviate this problem, residual blocks have been introduced. Residual blocks create a short connection from the input that is added to the output. With this simple trick, the gradients how properly during backprogation and vanishing and exploding gradients are prevented. Altogether, residual units are a combination of two convolution layers, Batch Normalization (BN), ReLu and short connection. Residual Blocks have been used in ResUNet++. A diagram of the ResNet Block is shown in the bottom right corner of Figure 2.

| Residual recurrent block (RR Block)
While Residual blocks typically short the input after two consecutive convolution layers, the RR blocks create a short connection between input and output after every convolution layer. A diagram of the Residual Recurrent Block is shown in the bottom right corner of Figure 2. In this diagram, two recurrent blocks are connected sequentially.

| Squeeze and excite
Formally, the SE Layer is described as follows: is a residual block parameterized by two convolutional layers ϕ 1 and ϕ 2 . x R CÂHÂW and X R CÂH'ÂW' are the input and output feature maps. w se . are the excitation weights and it is computed as follows: RELU is the relu activation and GAP is the global average pooling operation. Υ denotes the sigmoid activation. SE module re-weights the features across the channel dimension by applying GAP on the individual channels of the feature map x. GAP reduces the feature map to a scalar. The vector produced by GAP is fed to two consecutive linear layers parameterised by W 1 and W 2 . The final sigmoid layer is used to compute the 'excitation' weights. The excitation weights are used to reweight the channels of features as shown in Equation (1).
SE module provides channel attention by encoding the global context. Essentially, GAP reduces the features across the channel dimension to a vector of scalars, which represents the encoding of global context. The vector of scalars are passed through a Fully Connected (FC) network with one hidden layer. The hidden layer, which is of lower dimension than the channel dimension, in conjunction with the sigmoid activation function capture non-linear dependencies that exist across the channel dimension of the feature map. Through this process, features which are more important are scaled higher than features which contribute lesser to the segmentation task. The features are scaled along the channel dimension through the global context encoding. However, SE module uses only one receptive field dependent on the height and width of the feature map to provide channel attention. It was observed that the combination of different receptive fields boost semantic segmentation performance suggesting that both local and global context is beneficial for semantic segmentation. 37,38 Therefore, we argue that capturing only the global context to reweight the feature along the channel dimension is insufficient. We propose using global and local level context at multiple receptive fields to re-weight feature maps. To this end, we propose SMCA that we discuss in the next section.

| Squeeze and multi-context attention
We propose a module that uses global and local context for re-weighting the feature maps. SMCA encodes global context using SE module and encodes local context at multiple receptive fields using Average and Max Pooling operations. Average and Max Pooling of various strides and kernel sizes capture the local context at various receptive fields. They are inexpensive as they do not have any learnable parameters. In our experiments, we use strides of 2, 4 and 8 and kernels of size 2, 4 and 8, respectively to capture the local context at increasing receptive fields. The average and max pooling operations are followed by squeeze operation through 1 Â 1 convolutions and convolution operations through 3 Â 3 kernels that capture relavant channel and spatial information. The outputs of the 'Conv Squeeze Block' and 'Conv Normal Block' (See Figure 3) are added. The channel interdependencies are captured by the 'Conv Squeeze Block' and the relevant spatial information is preserved by the 'Conv Normal Block' thus providing channel and spatial attention respectively. Formally, we can define the SMCA module as follows: where x is the input feature map, X is the output feature map and w smca is the multi-context attention weights used to recalibrate the input map. w smca is computed of three spatial and channel attention weights at different receptive fields as shown as follows: For the sake of brevity, we remove the parameterization notations for the residual block F (.) and the 'squeeze' block F sq (., r ). F sq (., r ) is a special convolutional block where bottleneck is introduced in the channel dimension by reducing the channel dimension by a factor of r using 1 Â 1 convolution. AP (., n) and M P (., n) represent the Average and Max Pooling operations where n denotes the stride n and kernel size n Â n. Finally, the multi-context weights are upsampled bilinearly by the corresponding factor to match the input feature map dimensions. Formally it can be described as follows: where φ (., k ) denotes the upsampling operation by a factor k.

| Dataset details
We have used the following datasets for training and evaluating our models (Table 1).
• KVASIR-SEG contains 1000 images annotated by endoscopists from Oslo University. Each image contains atleast one polyp.
• CVC-ColonDB consists of 380 images from 15 colonoscopy videos. Each image shows at least one polyp.

| Implementation details
For our intra-dataset experiments, from each dataset, 10% of the images were randomly selected to construct the test set. The remaining images in the dataset were split into five portions of equal sizes. Leave-one-fold-out strategy was then used to construct the training and cross-validation sets. We evaluated U-Net, Attention

| Loss function
The choice of loss function is particularly crucial in polyp segmentation as we have an imbalance between the number of positive class samples (polyp pixels) and negative class samples (background pixels). If the class imbalance is not considered in the loss function, the model may converge to sub-optimal solution. Additionally, in medical applications, reducing false negative predictions typically takes precedence over reducing false positive prediction. Concretely, segmenting polyp pixels is more important than falsely segmenting nonpolyp pixels as polyps. Therefore, there have been several works that tackled the class imbalance problem. Yeung et al. 39 propose a unified asymmetric focal loss that prevents suppression of gradients of classes that occur infrequently. Additionally, Ma et al. 40 perform a thorough analysis of the contribution of 20 loss functions on 4 segmentation tasks. The literature reveals that Tversky loss 41 can weigh the influence of false negative class prediction over false positive class prediction when computing the gradients for model training. Therefore, we use Tversky loss for our experiments. It is an asymmetric similarity measure between predicted segmentation map and ground truth map. It is a generalization of Dice similarity coefficient (DSC) and Jaccard index. The Tversky loss is calculated as the mean of Tversky index (TI). Tversky Index is calculated as follows: where i is the ith pair of predicted and ground truth segmentation map. T P, F N and F P are the true positive, false negative and false positive count. α and β are the weights associated with the false positive and false negative count. β > α forces the model to improve the recall more than precision and vice versa. We set α = 0.4 and β = 0.6 based on grid search. We use these values for all our experiments. Finally, the Tversky loss over the mini-batch of size B can be defined as follows:

| Evaluation metrics
The models are evaluated using DSC, mean intersection over union (mIoU), precision and recall. The metrics are computed as follows: Precision where TP or True Positive is the total number of pixels in the predicted segmentation mask classified as polyp pixels and are actually polyp pixels, FP or False Positive is the total number of pixels classified as back-ground pixels but are polyp pixels and TN or True Negative is the total number of pixels, which belong to the background class and are predicted as background pixels.

| RESULTS
In this section, we report the findings of our intra-dataset and inter-dataset experiments. First, we report the segmentation metrics of each model separately. To this end, we report each model with and without SMCA and report the performance differences. Next, we report the results of our inter-dataset experiments by taking all the models together.
The qualitative comparison of our intra-dataset experiments is shown in Figure 4. The qualitative comparison of our inter-dataset experiments with training sets Kvasir-SEG and CVC-ClinicDB are shown in Figures 5 and 6, respectively.

| Evaluation of U-Net
The results of our intra-dataset experiments on U-Net are presented in Table 2. We observe that SMCA improves all the metrics for Kvasir-SEG, CVC-Clinic and Kvasir-Sessile. Notably, the DSC improves by 5.1%, mIoU by 8.8%, precision by 7.8% and recall by 2.3% for CVC-Clin-icDB. SMCA brings improvement to Kvasir-Sessile dataset which contains images of polyps (less than 10 mm) that are hard to segment. Specifically, the DSC improves by 2.2%, mIoU by 3%, precision by 9.5% and recall by 9%.

| Evaluation of attention U-Net
The results of our intra-dataset experiments on Attention U-Nets are presented in Table 3. We report that SMCA shows improvement of all metrics for all the four datasets. The largest improvement is shown on CVC-ColonDB with increase of 65% for DSC, 100% for mIoU, 86.4% for precision and 5% for recall. Similar to U-Nets, we observe that SMCA improves the performance on Kvasir-Sessile dataset. Another observation is that the Attention U-Net (with and without SMCA) performs worse relative to U-Net on Kvasir-SEG, CVC-ColonDB, CVC-ClinicDB and Kvasir-Sessile.  Table 5 shows the results of our intra-dataset evaluation of R2AU-Net. We report that SMCA shows a general improvement in segmentation metrics on all the datasets. Similar to R2U-Net, the model trained on Kvasir-SEG and CVC-ColonDB shows notable improvements due to SMCA. We find that the DSC, mIoU, precision and recall improve by 17.3%, 25.8%, 27.9% and 6.4% on Kvasir-SEG. We also observe that the recall of R2AU-Net with SMCA is almost at par with R2AU-Net without SMCA. Furthermore, the performance improvements on Kvasir-Sessile are negligible in comparison to the other three datasets.

| Inter-dataset evaluation
In this section, we report the results of our inter-dataset experiments. The purpose of the inter-dataset evaluation is  to further test the generalizability of models with our SMCA module when the test set is not derived from the same dataset. We use images of Kvasir-SEG and CVC-ClinicDB to construct our training sets similar to Jha et al. 5 The images in all these datasets are recorded with different imaging apparatus, have imaging artifacts such as illumination changes, motion blurring, gastrointestinal artifacts, and so forth. Furthermore, the shape and appearance of the polyps vary from dataset to dataset. Therefore, it is expected that there will be a drop in segmentation performance.   Table 8 shows the inter-dataset evaluation of the five models with and without SMCA trained on CVC-Clin-icDB. Altogether, we report improvements in most of our models. For example, U-Net with SMCA shows 35%, 50%, 41% and 3% increase in DSC, mIoU, precision and recall when tested on CVC-ColonDB compared to baseline U-Net. An observation that can be drawn is that the interdataset performance of models trained on Kvasir-SEG is better than models trained on CVC-ClinicDB. We believe this is the case because the images of Kvasir-SEG have higher contrast than CVC-Clinic and also, the polyps in Kvasir-SEG are more diverse in size, shape, color and appearance. We conjecture that these attributes of the training set play a role.

| Choice of channel compression ratio
Choosing the correct channel compression ratio is important as it is mainly responsible for re-weighting the information across the channel dimension. Therefore, we performed experiments to find the ideal channel compression ratio r for our SMCA module. We chose U-Net as our baseline architecture and used Kvasir-SEG dataset to perform a five-fold cross validation experiment.
Observing the results in Table 9, we chose the channel compression ratio 2 for all our intra-dataset and interdataset experiments.

| Summary of results
Looking at the quantitative results of intra-dataset experiments (See Tables 2-6), we can draw the following observations: (i) SMCA improves the performance when incorporated into five popular segmentation models; (ii) SMCA has a greater impact on larger models than on smaller models (see Tables 4 and 5 vs. Table 2); (iii) On an average, all the models perform the best on Kvasir-SEG, followed by CVC-ClinicDB, CVC-ColonDB and Kvasir-Sessile; (iv) SMCA when incorporated into ResUNet++ performs better than the baseline. Our results indicate that the SMCA is a better attention module compared to SE module. When observing the results of the inter-dataset experiment, we can draw the following observations: (i) Models with SMCA perform better than models without SMCA; (ii) Models generalize better when trained on Kvasir-SEG than on CVC-Clinic; (iii) Models with fewer trainable parameters perform better than models with more parameters.

| Discussion on intra-dataset evaluation
From the intra-dataset experiments, we conclude that models with SMCA show improvements in segmentation metrics. This demonstrates that our module is versatile and can act as a plug-in module to various deep learning architectures. We see that lightweight models such as the U-Net perform better in all the datasets compared to models with more parameters (ResUNet++, Attention U-Net, R2U-Net and R2AU-Net). We believe this to be the case because our training dataset is small due to the five-fold cross validation experiments. As such, chances of overfitting larger and deeper models are higher than shallow models 42 when training on small datasets. The authors of ResUNet++ 15 use augmentation schemes such as center crop, random crop, horizontal hip, vertical hip, scale augmentation, random rotation, cutout, brightness augmentation, and so forth. In our case, we use only random vertical and horizontal hip. Thus, we argue that using more augmentation methods will improve the performance of the larger models. Additionally, we observe that the boost in metrics due to SMCA to larger models is greater than the boost in metrics on U-Nets (See Table 4 vs. Table 2). We conjecture that the SMCA is able to counter the overfitting tendency by introducing a regularizing effect. The regularising effect is more prominent in larger models and thereof, the improvement in segmentation performance due to SMCA is more in larger models than smaller models.

| Discussion on inter-dataset evaluation
The inter-dataset evaluation of models is an important and necessary technique to test generalizing capabilities of the models. Our work builds on the cross-dataset experiments of Jha et al. 5 We believe that inter-dataset evaluation of models is important if we want to realise AI-CADx in clinical settings. Deep learning models perform poorly when the test set and training set have diverging image distributions. We think training and test set image distribution divergence will be a common problem faced in the polyp segmentation domain primarily because the images are recorded under different conditions (e.g., with different recording devices, different light scources, etc.). Furthermore, as mentioned previously, the polyps appear in various sizes, shapes and appearances. Additionally, the experience of the physician who is doing the colonoscopy will also affect the quality of the images. As such, performing inter-dataset evaluation should become a standard criteria to demon. strate the generalizing capabilities of AI-CADx for polyp segmentation. Our work is a step forward in this direction. In our work, we perform inter-dataset evaluation to demonstrate the improvements in generalizing capability of baseline models due to SMCA module. From Tables 7 and 8 we see that SMCA improves the generalizing capabilities of five segmentation models. Our results indicate that the feature recalibration of SMCA is beneficial towards learning robust features for polyp segmentation. We believe that SMCA learns robust features for multiple reasons. First, the use of max and average pooling with different kernel sizes allow the model to extract highly activated features and average activation of features simultaneously over varying receptive fields. The highly activated features are primarily due to polyp pixels and relevant background information necessary for polyp segmentation. Thus, the max pooling passes the highest activated polyp and background features and the average pooling passes an average of highly activated and lowly activated features which can be considered to be the context information of the polyp. Max and average pool with large kernels help to pass large context information around the polyps. Similarly, max and average pool with small kernels help to pass small context information around the polyp. We can draw an analogy for the working of SMCA to a physician inspecting the area around a suspected polyp lesion to demarcate a polyp mass from a non-polyp surrounding tissue. The use of receptive fields of different sizes in SMCA is analogous to a physician inspecting a large and small area around a suspected polyp lesion. A large area of inspection provides more context of a polyp's position in relation to the colorectal surface whereas a small area of inspection provides the necessary detail to distinguish a polyp lesion from the non-polyp colorectal surface. Second, the resulting feature maps from the multiple average and max pooling operations are passed through "Conv Squeeze Block" and "Conv Normal Block". These blocks are used to provide channel and spatial attention, respectively. Third, the attention weights of the "Conv Squeeze Block" and "Conv Normal Block" computed from multiple receptive fields are added together to compute the final attention weights. This, in effect, allows small and large-sized polyp features and its corresponding contexts to contribute towards recalibration of the original feature map.
We also observe that models trained on Kvasir-SEG show better inter-dataset performance than models trained on CVC-ClinicDB. We conjecture that the higher contrast and larger variation (in terms of size, shape and appearance) of polyp images on Kvasir-SEG compared to the CVC-ClinicDB enabled the model to learn better generalizing features. Despite the differences in the training dataset, baseline models with SMCA show improvement in segmentation metrics suggesting the robust representations learned due to SMCA.
6.4 | Visualizing the effectiveness of our SMCA module The ultimate objective of proposing AI-CADx in polyp segmentation is to improve clinical decision-making by using human intelligence capabilities in conjunction with AI-CADx. Further, with the latest advancements in human-in-the-loop annotation tools, generating annotated datasets have become easier. 43 Therefore, the combined capabilities of AI and human intelligence may lead to efficient AI workhows and clinical workhows. While generating annotated datasets have become more efficient, AI models are still considered a black box. 44 As argued by Rundo et al., 45 one of the many challenges in installing AI-CADx in clinical practice is the lack of interpretability and explainability. Therefore it is of utmost importance to develop AI-CADx that are interpretable. Then these systems can gather trust amongst physicians and patients alike. Interpretability is mostly ignored in many works dealing with polyp segmentation. We believe that the risks posed by AI-CADx deployed in healthcare industry are far greater than in other industries. The risk of a model making a false prediction can have lifethreatening consequences. Therefore, it is of utmost importance to understand the decision-making process of machine learning models. This will help in understanding the pitfalls of deep learning models and help in finding techniques to redress them. We believe this will enable research in design and development of network architectures that are more reliable and have bettergeneralizing capabilities. Our work is a step forward in this direction. Visualising the feature representation of a model with and without SMCA can offer better insight than simply reporting the segmentation metrics. To this end, we use M3D-CAM's 46 implementation of Grad Cam++ to visualise the gradient weighted attention maps of the two convolution kernels at each "Conv Block" (See Figure 2) in the encoder and decoder of U-Net. In Figure 7, we present the side-by-side comparison of the attention maps from the "Conv Blocks" of baseline U-Net and U-Net with SMCA. The qualitative comparison of attention maps shows the difference in learned representation of both the networks. We visualise the attention maps of the convolution kernels in the "Conv Block" because each "Conv Block" from the second layer onwards receives the re-calibrated feature maps of the SMCA. Furthermore, "Conv Block" is present in both the baseline U-Net and U-Net with SMCA. Therefore, it serves as a good entity to make a fair comparison.
One of the many challenges in segmentation is effectively retaining important semantic information along with high-level concepts as information propagates through the network. Cascade of max pool operations in CNNs result in learning of high-level concepts at the expense of granular information such as edge and color being lost. However, preserving the important low-level features alongside the high-level concepts can improve the precision and accuracy of segmentation maps. 47 Our qualitative analysis indicates that SMCA enables the CNN to preserve low-level features relevant for semantic segmentation in the deeper layers. The re-calibration of the encoder features using the max and average pooling operations of varying kernels help in the extraction of relevant polyp and context features at multiple scales. Additionally, computing spatial and channel attention weights from these extracted features helps in preserving important low-level semantic features while allowing the formation of high-level concepts. Our results indicate that this allows the models with SMCA to make more precise and accurate segmentation maps. Looking at the activation map of "Conv Block" at Conv Layer 3 of Figure 7 for U-Net with SMCA, we observe that the convolution kernels that receive re-calibrated feature maps activate low-level semantic concepts occurring throughout the image. This indicates that SMCA re-calibrates the feature map to preserve important low-level semantic concepts as information propagates through the network. In comparison, the attention maps of the baseline U-Net in Conv Layer 3 shows limited activation implying loss of relevant semantic information in the deeper layers. Similarly, activation maps at Conv Layer 4 of U-Net with SMCA show more activity than activation maps of baseline U-Net. Observing the activation maps at Conv Layer 5, we see that both U-Net and U-Net with SMCA learn high-level concepts. However, U-Net with SMCA does it while preserving the low-level semantic information in the preceeding layers. On the decoder side (See Conv Layer 6, Conv Layer 7, Conv Layer 8, Conv Layer 9), we see that the activation maps are more prominent for U-Net with SMCA and they start resembling the final segmentation map from Conv Layer 7 onwards. We conjecture that the closer resemblance of the activation maps to the predicted segmentation map for U-Net with SMCA is because the decoder is able to use the preserved lowlevel semantic features passed to it through skip connections. This, in effect, leads to the prediction of more accurate segmentation maps.

| Limitations
The models do not generalize well to unseen image distributions. All models perform better when test sets and the training set are from the same dataset. Although our module redresses this problem to an extent, there is more progress to be made in generalization. Self-supervision is an emerging area of research that makes models generalize better to unseen distributions. 48 We think there are significant advantages of this learning paradigm which can expedite the implementation of AI-CADx in clinical settings. Furthermore, our work is a retrospective study which is very different from prospective clinical application. The images in the datasets are selected by expert gastroenterologist. Prospective clinical use-case would involve testing the models on colonoscopy videos. Furthermore, our training set consists of polyps being present in every image. Our models are not trained to consider endoscopic images having any polyps.

| CONCLUSION
In this paper, we present a novel module called SMCA. We incorporated SMCA to five segmentation models: U-Net, Attention U-Net, R2U-Net, R2AU-Net and ResUNet++. We extensively evaluated the performance of the mentioned models with and without SMCA on four public polyp segmentation datasets (Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, Kvasir-Sessile). We report that models with SMCA perform better than baseline models. To further test the generalizing ability, we perform rigorous inter-dataset experiments. In the first interdataset experiment, we train all the models with and without SMCA on Kvasir-SEG and test it on CVC-ColonDB, CVC-ClinicDB and ETIS-Larib Polyp DB. In F I G U R E 7 Visualization of the attention maps at each convolutional kernel in the bottlenecks. There are two attention maps at each layer because there are two convolution operations at each "Conv Block" (See Figure 2).
The attention maps with the pink borders are from the U-Net with SMCA, the attention maps without borders are from U-Net without SMCA the second experiment, we train all the models on CVC-ClinicDB and test them on CVC-ColonDB, Kvasir-SEG and ETIS-Larib Polyp DB. Finally, to better understand the impact of SMCA on features learned by models, we render the attention maps from the convolution kernels of U-Net with and without SMCA using Grad-CAM++.
The qualitative comparison further illustrates that models with SMCA learn features that preserve important semantic cues throughout the depth of the network. This partially suggests why the models with SMCA predict more accurate segmentation maps.
In summary, SMCA recalibrates the feature maps through simultaneous spatial and channel attention. The spatial and channel attention weights are computed through the extraction of relevant edge and context features at multiple scales. Our results suggest that models with SMCA can segment large and small polyps better than their baseline counterparts. Additionally, we report that SMCA-based models generalize better. This is demonstrated through our extensive intra-dataset and interdataset experiments. We think that SMCA will improve the segmentation performance for tasks where the objects to segment appear in different sizes such as brain tumor segmentation 49 where tumors appear in multiple sizes. In non-medical application, SMCA may be beneficial in segmenting regions of interest in urban scenes such as cars, traffic lights and pedestrians 50 which too appear in different sizes. As future work, we want to incorporate SMCA into the decoder network and analyse the changes in the performance. Additionally, we want to analyse the performance changes by pre-training the SMCA-based model through self-supervision.