Intel will not provide or guarantee development of or support for this project, including but not limited to, maintenance, bug fixes, new releases or updates.
Patches to this project are no longer accepted by Intel.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the community, please create your own fork of the project.
Our paper "Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation" accepted at ECCV 2024. If you find the code useful, please cite the following paper.
@article{xiongtextual,
title={Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation},
author={Xiong, Peixi and Kozuch, Michael and Jain, Nilesh}
}- Overview
- Results
- Dataset
- Evaluation Metrics
Text-to-image generation plays a pivotal role in computer vision and natural language processing by translating textual descriptions into visual representations. However, understanding complex relations in detailed text prompts filled with rich relational content remains a significant challenge.
To address this, we introduce a novel task: Logic-Rich Text-to-Image generation. Unlike conventional image generation tasks that rely on short and structurally simple natural language inputs, our task focuses on intricate text inputs abundant in relational information. To tackle these complexities, we collect the Textual-Visual Logic dataset, designed to evaluate the performance of text-to-image generation models across diverse and complex scenarios.
To better evaluate understanding and reasoning in the text-to-image generation task, we have compiled a novel dataset comprising 15,213 samples. Each sample includes a long, content-rich text prompt and its corresponding images. To assess the degree of reasoning required, we have established six categories for the logical-rich text-to-image generation (LRT2I) task
For task evaluation, to effectively assess the model performance in generating images that comprehend the structural information in text input, specific evaluation metrics, rather than a pixel-level measurement, are required. Given the ill-posed nature of the problem in below figure, these metrics should concentrate on aligning entity presences and their respective relations between the ground-truth and generated images. Consequently, the objectives of the task should also prioritize these aspects.
We adapted evaluation metrics from previous works, which emphasizes relational information similar to our study, making its methodology applicable for our evaluation. We adopted two main metrics:
-
Object Presence Matches: Evaluates model accuracy in identifying and generating objects from text prompts, comparing objects in generated images and ground truth. Metrics include average precision (AP), average recall (AR), and F1 score for each scene.
-
Object Position Relation Matches: Assesses spatial accuracy by comparing object positions in generated images with the ground truth, indicating the model's understanding of spatial dynamics from the text. The relational similarity (RSIM) is set to measure object arrangement. The RSIM is articulated as:
Here, recall represents the ratio of detected objects in the generated image in relation to those in the ground-truth.
Under review by Intel Governance, Risk, and Compliance Department.


