
This line of research focuses on building machines that answer questions about image's content as well as exploring different ways of benchmarking such machines on this complex and subjective task. Another goal lies on distilling and understanding main challenges behind such Visual Question Answering task. In this project, we introduce a dataset for the real-world visual question answering task, a set of automatic performance measures, and propose and examine two architectures: a symbolic and neural-based. The symbolic approach, based on the semantic parser, explicitly uses a chain of perception, knowledge representation, and formal deduction system to retrieve the answer. An alternative neural approach is the end-to-end, jointly trained and scalable architecture, which builds upon CNN and LSTM, and generates multiple words answer based on an image and a question. More recent architectures such as Relation Networks are built for the purpose of relational reasoning, and yield a human-like performance on CLEVR. A part of this project was mentioned in Bloomberg Business, Science, and MIT Technology Review.