## Introduction

Semantic understanding of an image goes well beyond object detection: By looking at a picture, we can infer relationships between objects, understand what happened prior to this image and what will happen next. Current computer vision algorithms still struggle at this task.

To try and challenge this problem, the authors provide four contributions as follows:

• They formalize the task at hand, naming it Visual Commonsense Reasoning
• They introduce a new dataset, VCR, tailored for this task
• They introduce Adversarial Matching, a way to transform natural language annotations into questions
• They provide a model for this task, called the Recognition to Cognition Network

## Visual Commonsense Reasoning (VCR)

To properly understand the content of an image, its global context, a model must be able to answer questions based on the image’s meaning. Moreover, as a human can, it should be able to reason as to why it gave the answer. This ensures that the model chose the answer for the right reasons. Questions from the dataset are written in a mix of rich natural language as well as detection tags (such as [person2]) to help provide an unambiguous link between the textual description of an object and its location. Above and below are categories of questions and examples of questions/answers/reasoning triplets

## Dataset

The dataset proposed consists of clips from the Youtube channel MovieClips 1 as well as the Large Scale Movie Description Challenge dataset2. Mask-RCNN was then used to detect objects and filter “uninteresting scenes”. That is, scenes where less than two persons and too few objects were present. They then used Amazon’s Mechanical Turk to write up to three question/answer/rationale triplets per selected frame.

It was too demanding and error-prone to ask the workers to also write false responses and rationales to the provided questions. Instead, the authors propose Adversarial Matching, a way to turn natural language into multiple choice tests. The counterfactuals must be as relevant to the context while not being too similar to the correct answer. To do so, they used two models (BERT and ESIM+ELMo) and fed them questions $$q_i$$ and responses $$r_i$$ and evaluated their relevance $$P_{rel}$$ and similarity $$P_{sim}$$. They then performed maximum-weight bipartite matching on a weight matrix given by