Breaking Common Sense: WHOOPS!
A Vision-and-Language Benchmark of Synthetic and Compositional Images

Ben Gurion University of the Negev,   The Hebrew University of Jerusalem,
Allen Institute for Artificial Intelligence,   University of Washington
*Equal Contribution
arXiv

🤗

Dataset

🤗

Explorer
Evaluation Colab

What makes these images weird?

Paris



Albert Einstein is holding a smartphone

Albert Einstein died in 1955, decades before the first smartphone was invented (1994)

Paris



A lit candle is sitting inside a tightly sealed glass jar

A candle needs a constant supply of oxygen to burn, which does not exist in a sealed bottle, so it is unlikely to see a burning candle inside a sealed bottle

Paris



A wolf howls at the sun

Wolves are known to howl on a full moon, so it may not be customary to do it in the middle of the day

We collect normal (synthetic, not weird) and natural (non-synthetic, not weird) images to investigate the main challenge in WHOOPS!. BLIP2 model performs well on non-weird cases but struggles on weird ones, indicating that weirdness is the primary challenge, not synthesis.

Paper

Leaderboard

To submit your results to the leaderboard, please send a mail to: yonatanbitton1@gmail.com.

WHOOPS! benchmark presents 4 tasks: Explanation-of-violation, Image Captioning, Image-text Matching and Visual Quesion Answering (VQA).
Evaluation colab implemented for 3 tasks: Image Captioning, Image-text Matching and VQA.
The task of Explanation-of-violation is currently using human evaluation. Do you want to compute your results on this task? please send us mail.

Image Captioning

Image-Text Matching

Visual Question Answering

Explanation of Violation

BLIP2 FlanT5-XXL FT

177

BLIP2 FlanT5-XXL Text-only FT

94

BLIP2 FlanT5-XXL FT

57

Ground-truth Caption → GPT3 (Oracle)

94

BLIP2 FlanT5-XL FT

174

BLIP2 FlanT5-XXL FT

84

BLIP2 FlanT5-XL FT

55

Predicted Caption → GPT3

33

BLIP2 FlanT5-XXL

120

BLIP2 FlanT5-XL FT

81

BLIP2 FlanT5-XXL

55

BLIP2 FlanT5-XXL FT

27

BibTeX

@misc{https://doi.org/10.48550/arxiv.2303.07274,
        doi = {10.48550/ARXIV.2303.07274},

        url = {https://arxiv.org/abs/2303.07274},

        author = {Bitton-Guetta, Nitzan and Bitton, Yonatan and Hessel, Jack and Schmidt, Ludwig and Elovici, Yuval and Stanovsky, Gabriel and Schwartz, Roy},

        keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},

        title = {Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images},

        publisher = {arXiv},

        year = {2023},

        copyright = {arXiv.org perpetual, non-exclusive license}
      }