ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye^1,3*, Xianyi He^1,3*, Zongjian Li^1,3*, Bin Lin^1,3*, Shenghai Yuan^1,3*,

Zhiyuan Yan¹*, Bohan Hou¹, Li Yuan^1,2

¹Peking University, Shenzhen Graduate School, ²Peng Cheng Laboratory, ³Rabbitpre AI

*Equal Contribution.

💡 We also have other image edit projects that may interest you ✨.

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin and Zongjian Li, etc.

🌍 Introduction

ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks.

To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality.

Using ImgEdit, we train ImgEdit-E1, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design.

For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation. It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1.

🔥 News

[2025.07.27] We update our leaderboard here, add ICEdit, BAGEL, OmniGen2, Flux-Kontext-dev and Ovis-U1 on ImgEditBench.
[2025.06.24] We update some results of ImgEdit-Bench, including BAGEL and Uniworld-V1.
[2025.06.03] We have open-sourced UniWorld-V1, which inherits the powerful editing capabilities of ImgEdit-E1. It is trained on 700K subset of ImgEdit. For more details, please refer to https://github.com/PKU-YuanGroup/UniWorld-V1.
[2025.05.26] We have finished upload the ImgEdit datasets together with original dataset.

TODO

Release ImgEdit datasets.
Release ImgEdit original dataset(with dense caption, object-level bounding box and object-level segmentation mask).
Release data curation pipelines.
Release benchmark datasets.

💡 ImgEdit Dataset

📣 Overview

Data statistic

Single Turn tasks and Multi-Turn tasks

We comprehensively categorize single-turn image editing tasks into 10 tasks and multi-turn image editing tasks into 3 tasks.

We provide some cases from our dataset.

🖥️ ImgEdit Pipeline

Data Preparation & Filter (using Laion-aes dataset, only retain samples with aesthetic score greater than 4.75, then use Qwen2.5VL-7B to generate dense caption and use GPT-4o to generate short caption.)
Generate bounding box and segamentation mask (using Yolo-world and SAM2) and filter with CLIP.
Generate diverse edit prompts using GPT-4o
Task-based editing pipelines using ComfyUI
Data Quality Filter using GPT-4o We provide dataset after segamentation, since it maybe a good dataset to train other models(VLMs)

⚖️ Data Format

Preprocess Data Json Format See the rle_to_mask to convert string-type mask into image.

  {
      "path": "00095/00019/000197953.jpg", # image path
      "cap": [
          "The image depicts a couple dressed in wedding attire standing on a rocky cliff overlooking ..."
      ],  # dense caption provided by Qwen2.5-VL-7B
      "resolution": {
          "height": 1333,
          "width": 2000
      },  # resolution of the image
      "aes": 6.132755279541016, # aesthetic score of the image
      "border": [
          176
      ], # useless
      "tags": {
          "background": [
              "ocean",
              "sky",
              ...
          ], # background nouns, extracted by gpt4o from dense caption
          "object": [
              "bride",
              "bow tie"
              ...
          ], # object nouns, extracted by gpt4o from dense caption
          "summary": "A couple in wedding attire poses on a rocky cliff overlooking a scenic ocean, creating a romantic coastal setting." # short caption sumarrized by gpt4o from dense caption
      },
      "segmentation": {
          "background": [
              {
                  "class_name": "sky",
                  "bbox": [
                      0.505828857421875,
                      1.1066421270370483,
                      2000.0,
                      738.3450317382812
                  ], # yoloworld bbox
                  "mask": "xxx..", # mask provided with string
                  "score": 0.9921875, # yoloworld score
                  "clip_score": 0.9597620368003845, # clip score for the corresponding area
                  "aes_score": 4.34375 # score of the corresponding area
              },
              {
                  "class_name": "ocean",
                  ...
              },
              ...
          ],
          "object": [
              {
                  "class_name": "bride", 
                  ...
              },
              {
                  "class_name": "bow tie",
                  ...
              },
              {
                  "class_name": "bow tie",
                  ...
              },
          ...
          ],
          "box_format": "xyxy"
      },
      "bg_count": {
          "ocean": 1,
          "sky": 1,
          ...
      }, # count the same object and background name 
      "obj_count": {
          "bow tie": 2,
          "bride": 1,
          ...
      }
  }

ImgEdit Dataset Format

The final ImgEdit dataset are in parquet format, the input and output images can be found in .tar files.

from datasets import load_dataset
ds = load_dataset("parquet", data_files="./remove_part5.parquet")
print(ds['train'][0])
# {'input_images': ['results_remove_laion_part5/00094_00030_000304748/original.png'], 'output_images': ['results_remove_laion_part5/00094_00030_000304748/result.png'], 'prompt': 'Remove the group of people snowshoeing in winter clothing located in the far-right upper-middle of the image.'}

For multi-turn dataset:

from datasets import load_dataset
ds = load_dataset("parquet", data_files="./version_backtracking_part0.parquet")
print(ds['train'][0])
# {'data': [{'input_images': ['results_version_backtracking_part0/00031_00020_000209651/origin_0.png'], 'output_images': ['results_version_backtracking_part0/00031_00020_000209651/result_0.png'], 'prompt': 'add a green vest in the middle-right area of the image, covering a torso sized approximately from mid-waist to chest'}, {'input_images': ['results_version_backtracking_part0/00031_00020_000209651/origin_1.png'], 'output_images': ['results_version_backtracking_part0/00031_00020_000209651/result_1.png'], 'prompt': 'replace the green vest with a brown leather jacket'}, {'input_images': ['results_version_backtracking_part0/00031_00020_000209651/origin_2.png'], 'output_images': ['results_version_backtracking_part0/00031_00020_000209651/result_2.png'], 'prompt': 'withdraw the previous round of modifications, adjust the green vest in round1 to have a brighter shade of green and a subtle quilted texture'}]}

🧳 Download Dataset

Preprocess Dataset

- ImgEdit_recap_mask/
  - laion-aes/ #  tar files with filtered laion-aes image
    - 00000.tar 
    - ...
  - jsons/ #  jsons with caption, bbox, and mask
    - part0.tar
    - ...

ImgEdit Dataset

- ImgEdit/
  - Multiturn/ #  multi turn image data
    - results_content_memory_part2.tar.split.000
    - ...
  - Singleturn/ #  single turn image data
    - action_part1.tar.split.000
    - ...
  - Parquet/ #  prompts and image paths for all tasks 
    - add_part0.parquet
    - ...
  - ImgEdit_judge/ # model checkpoint in Qwen2.5-VL format
      - config.json
      - model-00001-of-00004.safetensors
      - ...
  - all_dataset_gpt_score.json # all postprocess score 
  - Benchmark.tar # dataset for benchmark

We release both preprocess dataset and imgedit dataset. The dataset is available at HuggingFace, or you can download it with the following command. Some samples can be found on our paper and github.

huggingface-cli download --repo-type dataset \
sysuyy/ImgEdit \
--local-dir ...

huggingface-cli download --repo-type dataset \
sysuyy/ImgEdit_recap_mask \
--local-dir ...

tar packages are spiltted into pieces, use cat a.tar.split.* > a.tar to merge them.

🛠️ Setups for ImgEdit pipeline

WIP

🎖️ ImgEdit-Bench

ImgEdit-Bench consists of three key components: a basic editing suite that evaluates instruction adherence, editing quality, and detail preservation across a diverse range of tasks; an Understanding-Grounding-Editing (UGE) suite, which increases task complexity through challenging instructions (e.g., spatial reasoning and multi-object targets) and complex scenes such as multi-instance layouts or camouflaged objects; and a multi-turn editing suite, designed to assess content understanding, content memory, and version backtracking.

More Quantitative Cases:

⚒️ Setups for ImgEdit-Bench

Basic-Bench: See Basic_bench for details.

Understanding-Grounding-Editing(UGE)-Bench: See UGE_bench for details.

Multi-Turn-Bench: See Multiturn_bench for details and more cases.

⚒️ Setups for ImgEdit-Judge

You should setup your environment following Qwen2.5-VL
Download the ImgEdit_Judge checkpoint from huggingface.
We give a demo code as follow, you should change the prompt with the corresponding tasks in prompts.json to get the best performance.

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
prompt = ""
# Load the processor

# Load the model with recommended configurations
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    # "/mnt/workspace/yangye/Qwen2.5-VL-7B-Instruct",
    "ImgEdit_Judge/checkpoint/path",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    local_files_only=True,
)
min_pixels = 1016064  # we train our model with this settings
max_pixels = 1354752  # we train our model with this settings
processor = AutoProcessor.from_pretrained("ImgEdit_Judge/checkpoint/path", min_pixels=min_pixels, max_pixels=max_pixels)  

messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt.replace("<edit_prompt>", edit_prompt)},
                {"type": "image", "image": original_path},
                {"type": "image", "image": result_path},
            ],
        }
    ]

    # Prepare for inference
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(model.device)

    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, max_new_tokens=2048)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

👍 Acknowledgement

This project wouldn't be possible without the following open-sourced repositories: Open-Sora Plan, Grounded-SAM-2, improved-aesthetic-predictor, Qwen2.5-VL, YOLO-World, Laion-dataset, ComfyUI, Stable Diffusion, and Flux.

📜 Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@article{ye2025imgedit,
  title={Imgedit: A unified image editing dataset and benchmark},
  author={Ye, Yang and He, Xianyi and Li, Zongjian and Lin, Bin and Yuan, Shenghai and Yan, Zhiyuan and Hou, Bohan and Yuan, Li},
  journal={arXiv preprint arXiv:2505.20275},
  year={2025}
}
@article{lin2025uniworld,
  title={UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation},
  author={Lin, Bin and Li, Zongjian and Cheng, Xinhua and Niu, Yuwei and Ye, Yang and He, Xianyi and Yuan, Shenghai and Yu, Wangbo and Wang, Shaodong and Ge, Yunyang and others},
  journal={arXiv preprint arXiv:2506.03147},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
Benchmark		Benchmark
assets		assets
inpaint-workflow		inpaint-workflow
tools		tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ImgEdit: A Unified Image Editing Dataset and Benchmark

🌍 Introduction

🔥 News

TODO

💡 ImgEdit Dataset

📣 Overview

Data statistic

Single Turn tasks and Multi-Turn tasks

🖥️ ImgEdit Pipeline

⚖️ Data Format

🧳 Download Dataset

🛠️ Setups for ImgEdit pipeline

🎖️ ImgEdit-Bench

⚒️ Setups for ImgEdit-Bench

⚒️ Setups for ImgEdit-Judge

👍 Acknowledgement

📜 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Languages

PKU-YuanGroup/ImgEdit

Folders and files

Latest commit

History

Repository files navigation

ImgEdit: A Unified Image Editing Dataset and Benchmark

🌍 Introduction

🔥 News

TODO

💡 ImgEdit Dataset

📣 Overview

Data statistic

Single Turn tasks and Multi-Turn tasks

🖥️ ImgEdit Pipeline

⚖️ Data Format

🧳 Download Dataset

🛠️ Setups for ImgEdit pipeline

🎖️ ImgEdit-Bench

⚒️ Setups for ImgEdit-Bench

⚒️ Setups for ImgEdit-Judge

👍 Acknowledgement

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages