Understanding Vision with Language
Visual understanding consists of inferring scene properties from images. Some properties might refer to mid-level scene attributes such as motion or depth, while others may relate to high-level features like semantic segmentation.
In this part, we will focus on the semantics of visual processing; that is, the association between visual stimuli and meaning. The goal is to infer from visual input what the significance of what we see is. Therefore, there is a strong connection between semantic visual processing and natural language processing.
Outline
- Chapter 50 Object Recognition describes how to learn to recognize and localize objects in images, assigning words to them. 
- Chapter 51.3 Learning Visual Representations from Language Supervision explores the role of language as a representation of the visual world and its connection with vision systems.