Understanding Vision with Language

Visual understanding consists of inferring scene properties from images. Some properties might refer to mid-level scene attributes such as motion or depth, while others may relate to high-level features like semantic segmentation.

In this part, we will focus on the semantics of visual processing; that is, the association between visual stimuli and meaning. The goal is to infer from visual input what the significance of what we see is. Therefore, there is a strong connection between semantic visual processing and natural language processing.

Outline