Title: CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Authors: Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, Ranjay Krishna
Word Count: Approximately 7,600 words
Estimated Read Time: 25-30 minutes
Summary:
The paper introduces CREPE, a benchmark for evaluating the compositional reasoning abilities of vision-language foundation models. Compositionality refers to the ability to understand and generate complex visual scenes or statements by combining simpler parts. The benchmark covers two important aspects of compositionality: systematicity and productivity.
Systematicity evaluates a model’s ability to systematically recombine known visual concepts in unseen combinations. CREPE includes three systematicity splits based on whether models have seen all concepts (“Seen Compounds”), only the individual concepts (“Unseen Compounds”), or neither (“Unseen Atoms”). CREPE finds that most models’ performance decreases when evaluating on unseen combinations of concepts, especially for models trained on the larger LAION-400M dataset.
Productivity evaluates a model’s ability to comprehend visual concepts of increasing compositional complexity. CREPE includes captions ranging from 4 to 12 visual concepts. It finds that most models struggle on captions with higher compositional complexity, with retrieval performance nearing random chance.
Overall, CREPE finds that vision-language foundation models trained with contrastive loss struggle at compositional reasoning, even when trained on large datasets and using large model architectures. CREPE aims to provide a comprehensive benchmark to track the emergence of compositionality as vision-language models improve.
CREPE provides large-scale datasets with ground truth image-caption pairs and hard negative captions to evaluate both systematicity and productivity. Hard negatives differ from ground truth captions in minimal ways to isolate model failure modes.
Researchers can use CREPE to identify gaps in current foundation models relating to compositionality. Improving compositionality could make models more controllable and robust. CREPE’s hard negative generation approach could also be used to improve the training of compositional models.
CREPE relies on a scene graph representation to define compositional language. The generated hard negatives are noisy, especially swapping and negation foils. Evaluating productivity also relies on generated captions.