How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov*, Oğuzhan Fatih Kar*, Amir Zamir* ICLR, 2026 descriptionWe benchmark top multimodal models like GPT-4o and Gemini on standard vision tasks using a prompt-based framework. While these models are strong generalists, especially on semantic tasks, they still trail behind specialized vision models, particularly in geometry.
On Evaluation of Vision Datasets and Models using Human Competency Frameworks
Rahul Ramachandran, Tejal Kulkarni, Charchit Sharma, Deepak Vijaykeerthy, Vineeth N Balasubramanian DMLR@ICML, 2024 descriptionWe use Item Response Theory (IRT) to assess model calibration, select informative data subsets, and demonstrate the usefulness of various latent parameters for analyzing and comparing models and datasets in computer vision.
|