Skip to content

SalesforceAIResearch/FOFPred

FOFPred: Language-Driven Future Optical Flow Prediction

FOFPred Overview

FOFPred is a diffusion-based model that predicts future optical flow from a single image guided by natural language instructions. Given an input image and a text prompt describing a desired action (e.g., "Moving the water bottle from right to left"), FOFPred generates optical flow predictions that visualize how objects would move to accomplish that action.


🚀 Quick Start

pip install diffusers==0.34.0
import torch
from diffusers import DiffusionPipeline
from PIL import Image

pipeline = DiffusionPipeline.from_pretrained(
    "Salesforce/FOFPred",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()

input_image = Image.open("/UPDATE/IMAGE/PATH")

generator = torch.Generator(device="cuda").manual_seed(42)
results = pipeline(
    prompt="UPDATE/PROMPT",
    input_images=[input_image],
    width=256,
    height=256,
    max_input_image_side_length=512,
    max_pixels=65536,
    num_inference_steps=1,
    max_sequence_length=1024,
    text_guidance_scale=5.0,
    image_guidance_scale=2.0,
    negative_prompt="",
    generator=generator,
    output_type="pt",
    frame_count=4,
)

output_tensor = results.images[0]  # [F, C, H, W]

✨ Features

  • Language-Guided Flow Prediction — Control motion predictions using natural language descriptions
  • Single-Image Input — Predict future motion from just one frame
  • Multi-Frame Flow Output — Generates 4 sequential flow frames showing temporal progression
  • Interactive Visualization — CoTracker-style arrow overlays for intuitive flow visualization
  • Efficient Inference — Single-step inference capability

🏗️ Architecture

FOFPred combines several components building off the OmniGen2 project:

Component Model Description
V-LLM Qwen2.5-VL-3B-Instruct Multimodal understanding of images and text
DiT OmniGen2Transformer3DModel Modification of OmniGen2Transformer to generate frame sequences
VAE black-forest-labs/FLUX.1-dev VAE (AutoencoderKL model)
Scheduler FlowMatchEulerDiscreteScheduler Efficient flow-matching sampler used in OmniGen2

📦 Installation

If you wish to create your own env for training, use the following.

conda create -n fofpred python=3.11
pip install torch==2.6.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu124
curl -LsSf https://astral.sh/uv/install.sh | sh
uv pip install -r requirements.txt
uv pip install flash-attn==2.7.4.post1 --no-build-isolation

Optionally install ffmpeg in case your system does not have it (used for torchcodec library).

conda install ffmpeg

🏃 Inference

Interactive Demo

Launch the Gradio web interface:

export PYTHONPATH=$PYTHONPATH:$PWD
python app.py

Then open http://localhost:7860 in your browser.

📊 Output Visualization

FOFPred provides three visualization modes in the demo:

  1. Arrow Visualization — CoTracker-style sparse grid arrows showing motion direction
  2. Raw Flow Output — HSV-encoded optical flow (color = direction, saturation = magnitude)
  3. Alpha Blend — Flow overlaid on input image for context

Optional Arguments:

Argument Description Default
--share Create a public Gradio link False
--port Port for the web server 7860
--enable_model_cpu_offload Offload model to CPU (saves VRAM) False
--enable_sequential_cpu_offload Sequential CPU offload (minimal VRAM) False

Python API

import torch
from fofpred.pipelines.fofpred.pipeline_fofpred import FOFPredPipeline
from fofpred.schedulers.scheduling_flow_match_euler_discrete import FlowMatchEulerDiscreteScheduler
from PIL import Image

# Load the pipeline
pipeline = FOFPredPipeline.from_pretrained(
    "path/to/pretrained_models/hf_upload",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Load input image
input_image = Image.open("example_images/small_office.jpeg")

# Set scheduler
pipeline.scheduler = FlowMatchEulerDiscreteScheduler()

# Generate optical flow prediction
results = pipeline(
    prompt="Moving the water bottle from right to left.",
    input_images=[input_image],
    width=256,
    height=256,
    num_inference_steps=1,
    num_images_per_prompt=4,
    frame_count=4,
    generator=torch.Generator(device="cuda").manual_seed(42),
    output_type="pt",
)

# Access generated flow frames: shape [B, F, C, H, W]
flow_frames = results.images

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.


📄 License

This project is licensed under the Apache License 2.0. See LICENSE.txt for details.

Copyright (c) 2025 Salesforce, Inc.


🔗 Acknowledgement

We thank the authors of following projects for their codebases and model checkpoints.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages