29th IEEE International Conference on Image Processing, ICIP 2022, Bordeaux, Fransa, 16 - 19 Ekim 2022, ss.4233-4237
The costly process of obtaining semantic segmentation labels has driven research towards to weakly supervised semantic segmentation (WSSS) methods, with only image-level labels available for training. The lack of dense semantic scene representation requires methods to increase complexity to obtain additional semantic information (i.e. object/stuff extent and boundary) about the scene. This is often done though increased model complexity and sophisticated multi-stage training/refinement procedures. However, the lack of 3D geometric structure of a single image makes these efforts desperate at a certain point. In this work, we propose to harness (inverse) depth maps estimated from one single image via a monocular depth estimation model to integrate the 3D geometric structure of the scene into the segmentation model. In light of this proposal, we develop an end-to-end segmentation-based network model and a self-supervised training process to train for semantic masks from only image-level annotations in a single stage. Our experiments show that our one-stage method achieves comparable segmentation performance (val: 64.32, test: 64.91) on Pascal VOC when compared with those significantly more complex pipelines and outperforms SOTA single-stage methods.