Breaking Common Sense: WHOOPS!
A Vision-and-Language Benchmark of Synthetic and Compositional Images

Ben Gurion University of the Negev,   The Hebrew University of Jerusalem,
Allen Institute for Artificial Intelligence,   University of Washington
*Equal Contribution
arXiv

🤗

Dataset

🤗

Explorer
Tasks Evaluation Explanation of Violation Evaluation Medium

What makes these images weird?

Paris



Albert Einstein is holding a smartphone

Albert Einstein died in 1955, decades before the first smartphone was invented (1994)

Paris



A lit candle is sitting inside a tightly sealed glass jar

A candle needs a constant supply of oxygen to burn, which does not exist in a sealed bottle, so it is unlikely to see a burning candle inside a sealed bottle

Paris



A wolf howls at the sun

Wolves are known to howl on a full moon, so it may not be customary to do it in the middle of the day

We collect normal (synthetic, not weird) and natural (non-synthetic, not weird) images to investigate the main challenge in WHOOPS!. BLIP2 model performs well on non-weird cases but struggles on weird ones, indicating that weirdness is the primary challenge, not synthesis.

Paper

BibTeX

@misc{https://doi.org/10.48550/arxiv.2303.07274,
        doi = {10.48550/ARXIV.2303.07274},

        url = {https://arxiv.org/abs/2303.07274},

        author = {Bitton-Guetta, Nitzan and Bitton, Yonatan and Hessel, Jack and Schmidt, Ludwig and Elovici, Yuval and Stanovsky, Gabriel and Schwartz, Roy},

        keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},

        title = {Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images},

        publisher = {arXiv},

        year = {2023},

        copyright = {arXiv.org perpetual, non-exclusive license}
      }