1. Visual Turing Challenge
    This line of research focuses on building machines that answer questions about image's content as well as exploring different ways of benchmarking such machines on this complex and subjective task. Another goal lies on distilling and understanding main challenges behind such Visual Question Answering task. In this project, we introduce a dataset for the real-world visual question answering task, a set of automatic performance measures, and propose and examine two architectures: a symbolic and neural-based. The symbolic approach, based on the semantic parser, explicitly uses a chain of perception, knowledge representation, and formal deduction system to retrieve the answer. An alternative neural approach is the end-to-end, jointly trained and scalable architecture, which builds upon CNN and LSTM, and generates multiple words answer based on an image and a question. More recent architectures such as Relation Networks are built for the purpose of relational reasoning, and yield a human-like performance on CLEVR. A part of this project was mentioned in Bloomberg Business, Science, and MIT Technology Review.

  2. Learning Spatial Relations
    Despite of strong progress on object recognition and image-to-text retrieval techniques, surprisingly little has been done on incorporating spatial representation in the inference process. In this work, we propose a pooling interpretation of spatial relations and show how it improves image-to-text retrieval task. We show improvements on previous work on two datasets as well as we provide additional insights on a new dataset with an explicit focus on spatial relations.

  3. Learning Smooth Pooling Regions for Visual Recognition
    This line of research argues for a joint and discriminative training of two last stages of the multi-stages recognition architectures, namely: pooling and classification. Here, we introduce and examine a learnable variant of the pooling stage, which couples together a classifier with a novel aggregation operator. The experimental evaluation shows that our approach significantly improves over similar recognition architectures with hand-designed pooling stage.