What Do We Really See When We Look at an Image? A New Theory of Perspective Perception

In his article Toward a Theory of Perspective Perception in Pictures (2024), Aaron Hertzmann offers a radically fresh take on how we perceive three-dimensional space in pictures. Contrary to traditional theories that assume a coherent, linear perspective from a single point of view, Hertzmann argues that there is no single perspective in how we actually experience visual images. Instead, he proposes a two-stage theory of pictorial perception grounded in eye movement and localized perception.

What’s wrong with linear perspective?

Hertzmann begins by questioning a long-held assumption: that linear perspective is the most accurate way to represent the world. As he writes, “Few classical paintings strictly obey linear perspective, nor do the best computational photography techniques to avoid distortion”

The proposal: a local and dynamic vision.

Hertzmann’s key hypothesis is that 3D shape perception happens first on a local scale—at the point of gaze. Each time we fixate on a spot in an image, we perceive a small part of its form. As our eyes move, our brain stitches these fragments into a broader (but often unstable) interpretation. In his words: “The global interpretation of a picture may be fragmented, ambiguous, and change over time”

Why does this matter?

It disrupts the myth of a total, static view: We don’t see the whole picture at once. We build a mental image from moving glimpses.
It validates artistic distortion: Multiple projections, local inconsistencies, and spatial breakdowns are not mistakes—they’re legitimate visual strategies.
It aligns with how modern visual technologies work: Tools like content-aware projections and panoramic stitching may reflect how we actually see better than standard cameras.

Key quote:

“Each fixation has its own projection, but due to the nature of foveal and peripheral vision, shape information is primarily gathered from a small region around the fixation”

Why is this relevant now?

In the age of generative AI, where images are mass-produced by algorithms trained on millions of photos and renders, Hertzmann’s theory asks us to rethink the underlying visual ideology: Why do these models still replicate a single, fixed perspective? Could we imagine new kinds of images built on movement, fragmentation, or multiplicity?

Hertzmann gives us a hint: there is no single correct perspective. There are as many perspectives as there are gazes in motion.