🤗
Dataset🤗
Explorer Tasks Evaluation Explanation of Violation Evaluation ICCV poster NeurIPS poster🤗
Analysis Dataset🤗
Analysis ExplorerAlbert Einstein died in 1955, decades before the first smartphone was invented (1994)
A candle needs a constant supply of oxygen to burn, which does not exist in a sealed bottle, so it is unlikely to see a burning candle inside a sealed bottle
Wolves are known to howl on a full moon, so it may not be customary to do it in the middle of the day
Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense.
For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field.
Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney.
We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual.
Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities.
WHOOPS! is a dataset of 500 synthetic images and 10,874 annotations designed to challenge AI models' ability to reason about commonsense and compositionality. To construct WHOOPS!, we collaborate with designers who use text-to-image models such as Midjourney and DALL-E to generate images that would be challenging (or even impossible) to collect otherwise.
WHOOPS! contains commonsense-defying image from a wide range of reasons, deviations from expected social norms and everyday knowledge.
The WHOOPS! benchmark includes four tasks:
Models significantly lag behind human performance. For example, on identification, the best end-to-end fine-tuned BLIP2 FlanT5-XXL model achieves at best 73%. For explanation, even the oracle model (which is given access to a ground-truth, human-authored description of the image) only achieves a performance of 68%, falling substantially short of human performance (95%). We also added auto-eval results that are correlated with the human-eval. These results indicate that our dataset provides a challenging benchmark for the development of next-generation vision-and-language models.
The zero-shot results highlight the strengths and weaknesses of each model. Zero-shot BLIP2 demonstrates a substantial improvement over the other models. But even the supervised models have significant room for improvement, especially in VQA (maximum BEM score is 57%) and image captioning
@misc{https://doi.org/10.48550/arxiv.2303.07274,
doi = {10.48550/ARXIV.2303.07274},
url = {https://arxiv.org/abs/2303.07274},
author = {Bitton-Guetta, Nitzan and Bitton, Yonatan and Hessel, Jack and Schmidt, Ludwig and Elovici, Yuval and Stanovsky, Gabriel and Schwartz, Roy},
keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images},
publisher = {arXiv},
year = {2023},
copyright = {arXiv.org perpetual, non-exclusive license}
}