Remarks on Spatial Localization in VLMs

Sun, 17 Dec 2023 00:00:00 -0800

Prelude

This all started when I oversaw this tweet from Timothee Darcet (co-first author on DINOv2)

https://x.com/TimDarcet/status/1726320282028360131?s=20

This was in response to people overreacting to how the final problem in computer vision was for AI to tell the difference between a blueberry muffin and a chihuahua, which, to be fair, is a rather funny joke. It turns out that AI models can do this quite well though, and have been able to already even since CLIP came out! So what’s the big deal?

Vlm on Tyler Zhu

Remarks on Spatial Localization in VLMs

Prelude