The CLIP framework has become foundational in multimodal representation learning, particularly for tasks such as image-text retrieval. However, it faces several limitations: a strict 77-token cap on text input, a dual-encoder design…
UniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs
