Key innovation is to have a Transformer decoder come up with a set of binary masks and classes in a parallel way. This was then improved in the MaskFormer paper, which showed that the "binary mask classification" paradigm also works really well for semantic segmentation.
Mask2Former extends this to instance segmentation by further improving the neural network architecture. its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. Hence, we've evolved from separate architectures to what researchers now refer to as "universal image segmentation" architectures, capable of solving any image segmentation task. Interestingly, these universal models all adopt the "mask classification" paradigm, discarding the "per-pixel classification" paradigm entirely.
from recommended Huggingface Blog (all rights and credits with them):
https://huggingface.co/blog/mask2former
"We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K). "
from https://huggingface.co/docs/transformers/main/model_doc/mask2former
Arxiv pre-print (all rights with authors):
Masked-attention Mask Transformer for Universal Image Segmentation
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar
https://arxiv.org/abs/2112.01527
#ai
#imagesegmentation
#transformers
9 Comments