Breaking Common Sense: WHOOPS!
A Vision-and-Language Benchmark of Synthetic and Compositional Images

Ben Gurion University of the Negev,   The Hebrew University of Jerusalem,
Allen Institute for Artificial Intelligence,   University of Washington
*Equal Contribution

arXiv Medium




Tasks Evaluation Explanation of Violation Evaluation ICCV poster NeurIPS poster


Analysis Dataset


Analysis Explorer

What makes these images weird?


Albert Einstein is holding a smartphone

Albert Einstein died in 1955, decades before the first smartphone was invented (1994)


A lit candle is sitting inside a tightly sealed glass jar

A candle needs a constant supply of oxygen to burn, which does not exist in a sealed bottle, so it is unlikely to see a burning candle inside a sealed bottle


A wolf howls at the sun

Wolves are known to howl on a full moon, so it may not be customary to do it in the middle of the day

We collect normal (synthetic, not weird) and natural (non-synthetic, not weird) images to investigate the main challenge in WHOOPS!. BLIP2 model performs well on non-weird cases but struggles on weird ones, indicating that weirdness is the primary challenge, not synthesis.


Poster Presentations


        doi = {10.48550/ARXIV.2303.07274},

        url = {},

        author = {Bitton-Guetta, Nitzan and Bitton, Yonatan and Hessel, Jack and Schmidt, Ludwig and Elovici, Yuval and Stanovsky, Gabriel and Schwartz, Roy},

        keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},

        title = {Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images},

        publisher = {arXiv},

        year = {2023},

        copyright = { perpetual, non-exclusive license}