FOFPred is a diffusion-based model that predicts future optical flow from a single image guided by natural language instructions. Given an input image and a text prompt describing a desired action (e.g., "Moving the water bottle from right to left"), FOFPred generates optical flow predictions that visualize how objects would move to accomplish that action.
pip install diffusers==0.34.0import torch
from diffusers import DiffusionPipeline
from PIL import Image
pipeline = DiffusionPipeline.from_pretrained(
"Salesforce/FOFPred",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).cuda()
input_image = Image.open("/UPDATE/IMAGE/PATH")
generator = torch.Generator(device="cuda").manual_seed(42)
results = pipeline(
prompt="UPDATE/PROMPT",
input_images=[input_image],
width=256,
height=256,
max_input_image_side_length=512,
max_pixels=65536,
num_inference_steps=1,
max_sequence_length=1024,
text_guidance_scale=5.0,
image_guidance_scale=2.0,
negative_prompt="",
generator=generator,
output_type="pt",
frame_count=4,
)
output_tensor = results.images[0] # [F, C, H, W]- Language-Guided Flow Prediction — Control motion predictions using natural language descriptions
- Single-Image Input — Predict future motion from just one frame
- Multi-Frame Flow Output — Generates 4 sequential flow frames showing temporal progression
- Interactive Visualization — CoTracker-style arrow overlays for intuitive flow visualization
- Efficient Inference — Single-step inference capability
FOFPred combines several components building off the OmniGen2 project:
| Component | Model | Description |
|---|---|---|
| V-LLM | Qwen2.5-VL-3B-Instruct |
Multimodal understanding of images and text |
| DiT | OmniGen2Transformer3DModel |
Modification of OmniGen2Transformer to generate frame sequences |
| VAE | black-forest-labs/FLUX.1-dev |
VAE (AutoencoderKL model) |
| Scheduler | FlowMatchEulerDiscreteScheduler |
Efficient flow-matching sampler used in OmniGen2 |
If you wish to create your own env for training, use the following.
conda create -n fofpred python=3.11
pip install torch==2.6.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu124
curl -LsSf https://astral.sh/uv/install.sh | sh
uv pip install -r requirements.txt
uv pip install flash-attn==2.7.4.post1 --no-build-isolationOptionally install ffmpeg in case your system does not have it (used for torchcodec library).
conda install ffmpegLaunch the Gradio web interface:
export PYTHONPATH=$PYTHONPATH:$PWD
python app.pyThen open http://localhost:7860 in your browser.
📊 Output Visualization
FOFPred provides three visualization modes in the demo:
- Arrow Visualization — CoTracker-style sparse grid arrows showing motion direction
- Raw Flow Output — HSV-encoded optical flow (color = direction, saturation = magnitude)
- Alpha Blend — Flow overlaid on input image for context
Optional Arguments:
| Argument | Description | Default |
|---|---|---|
--share |
Create a public Gradio link | False |
--port |
Port for the web server | 7860 |
--enable_model_cpu_offload |
Offload model to CPU (saves VRAM) | False |
--enable_sequential_cpu_offload |
Sequential CPU offload (minimal VRAM) | False |
import torch
from fofpred.pipelines.fofpred.pipeline_fofpred import FOFPredPipeline
from fofpred.schedulers.scheduling_flow_match_euler_discrete import FlowMatchEulerDiscreteScheduler
from PIL import Image
# Load the pipeline
pipeline = FOFPredPipeline.from_pretrained(
"path/to/pretrained_models/hf_upload",
torch_dtype=torch.bfloat16,
).to("cuda")
# Load input image
input_image = Image.open("example_images/small_office.jpeg")
# Set scheduler
pipeline.scheduler = FlowMatchEulerDiscreteScheduler()
# Generate optical flow prediction
results = pipeline(
prompt="Moving the water bottle from right to left.",
input_images=[input_image],
width=256,
height=256,
num_inference_steps=1,
num_images_per_prompt=4,
frame_count=4,
generator=torch.Generator(device="cuda").manual_seed(42),
output_type="pt",
)
# Access generated flow frames: shape [B, F, C, H, W]
flow_frames = results.imagesWe welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the Apache License 2.0. See LICENSE.txt for details.
Copyright (c) 2025 Salesforce, Inc.
We thank the authors of following projects for their codebases and model checkpoints.
