Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction


Creative Commons License

Ozkan S., AKAR G.

16th IEEE International Conference on Computer Vision (ICCV), Venice, İtalya, 22 - 29 Ekim 2017, ss.3094-3100 identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/iccvw.2017.366
  • Basıldığı Şehir: Venice
  • Basıldığı Ülke: İtalya
  • Sayfa Sayıları: ss.3094-3100
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Frame-level visual features are generally aggregated in time with the techniques such as LSTM, Fisher Vectors, NetVLAD etc. to produce a robust video-level representation. We here introduce a learnable aggregation technique whose primary objective is to retain short-time temporal structure between frame-level features and their spatial interdependencies in the representation. Also, it can be easily adapted to the cases where there have very scarce training samples. We evaluate the method on a real-fake expression prediction dataset to demonstrate its superiority. Our method obtains 65% score on the test dataset in the official MAP evaluation and there is only one misclassified decision with the best reported result in the Chalearn Challenge (i.e. 66.7%). Lastly, we believe that this method can be extended to different problems such as action/event recognition in future.