Mirage: Multimodal Reasoning in VLMs Without Rendering Images

While VLMs are strong at understanding both text and images, they often rely solely on text when reasoning, limiting their ability to solve tasks that require visual thinking, such as spatial puzzles. People naturally visualize…

Continue Reading