Support decoder-native transforms #526

scotts · 2025-02-27T15:53:17Z

🚀 The feature

Decoders usually can do more than just decode. They often also implement transformations on what they decode. For example, see the list of FFmpeg filters, which is the FFmpeg term for a transformation. Several users have requested a way to apply these decoder-native transformations.

Motivation, pitch

The main motivation is performance. We have benchmarks which show, for example, that it is much faster for FFmpeg to resize a frame while decoding than it is to pass a decoded frame to a transformation that lives outside of the decoder.

The counter pitch is that exposing such a functionality can be potentially dangerous for users. For example, there are many different ways one can resize a video frame, and it's important that users know which one they are using. We have observed model performance problems with resize specifically: users use a different resize algorithm during inference than they used for training, causing problems that are difficult to detect. Any design for this feature should include mitigations for this problem.

donthomasitos · 2025-02-27T16:21:30Z

Sounds great!

scotts · 2025-02-27T17:01:39Z

Design ideas for this feature, following the principles that:

Users specify what they want, and we translate that to what FFmpeg understands. We never just pass a string directly from users to FFmpeg.
We can use testing to enforce and quantify closeness of transforms to those already in the PyTorch ecosystem.

API

We create the module torchcodec.decoders.transform_specs. Let's use cropping and resizing as an illustration of what could be in this module:

@dataclass
class Crop:
    width: int
    height: int
    x:  int
    y:  int

    def _ffmpeg_str() -> str:
        """Logic which knows how to correctly turn the 
           user spec into the exact string FFmpeg needs.
        """
        pass

@dataclass
class Resize:
    width: int
    height: int

    def _ffmpeg_str() -> str:
        pass

The actual specifications that FFmpeg accepts are much richer than this, and are in fact their own expression language per transformation. See, for example, the documentation for the crop filter. We would only expose subsets of this functionality as needed.

Users could then pass a sequence of transformations to a decoder at initialization:

decoder = VideoDecoder(
    "vid.mp4",
    transforms=[
        Resize(width=640, height=480),
        Crop(width=32, height=32, x=0, y=0),
    ]
)

In the actual implementation, we would call the _ffmpeg_str() method on each transformation when setting up FFmpeg.

The mitigation

The surface area of even a single FFmpeg transformation is large; see the links above. The first mitigation is that we control how much of that surface area we expose through our specs.

The second mitigation is that we only expose transformations that we can translate to a corresponding transform in torchvision.transforms.v2. Crop above would become the FFmpeg crop filter and maps to the TorchVision crop transform. Resize above would become FFmpeg's scale and TorchVision's Resize.

We would then have tests which actually ensure that the output from the decoder when it uses the FFmpeg filters is within a certain tolerance of what is returned when unchanged decoded frames are transformed by the TorchVision transforms.

I'm not sure the best way to maintain that mapping. We could have a similar method on the spec object, such as something like _get_tv2_spec() -> Callable. But that would require TorchCodec taking an explicit dependence on TorchVision. Alternatively, we could maintain the mapping only in testing. We would want the mapping to be a part of our documentation so that we can guide users.

This does mean that the cost of each new decoder-native transform we support is high.

NicolasHug · 2025-02-28T11:50:45Z

Thanks for the proposal @scotts.

I think we should explore the alternative of accepting native torchvision.transforms.v2 transform classes as parameter. I.e. something like:

decoder = VideoDecoder(
    "vid.mp4",
    transforms=[
        torchvision.transforms.v2.Resize(...),
        torchvision.transforms.v2.RandomCrop(...),
    ]
)

This way, users can rely on the existing torchvision API which they're familiar with, and on our side we don't need to invent something new.

And importantly, this would allow us to enable a lot of critical transformation features for users, in particular random parameter sampling and random application of transforms. Let me explain using the suggested API from #526 (comment):

Crop(width=32, height=32, x=0, y=0),

Passing x and y for a crop is usually not what users want. Users want to pass width and height, but they expect the crop to be random. I.e. they want x and y to be randomly sampled. Similarly, with horizontal / vertical flips (which I suspect we'll want to support), those are only useful when they are randomly applied with a given probability. Another example is with Resize where users sometimes don't want to specify the output H and W, they instead just want to specify only H (or just W) and preserve the aspect ratio.

All of these scenarios are natively supported by torchvision, and they provide a lot of value to users, so I think we should try to support those in TorchCodec. But I definitely don't think Torchcodec should implement these logics. If we accept native torchvision transforms as parameters, I think we should be able to rely on the make_params() methods of the v2 transforms, where most (all?) of the logic is implemented. See e.g. the random sampling logic for x and y of crop: https://github.com/pytorch/vision/blob/dcd1e4213394883b688d75df110839cede1193f4/torchvision/transforms/v2/_geometry.py#L879-L888

Re dependency:

for us as dev it doesn't change much, torchvision is already a dev/test dependency
for users, torchvision would become an optional dependency. But I guess we could argue that currently, torchvision already is a somewhat non-optional dependency, since this is the only current way of running transforms.

scotts · 2025-02-28T15:51:55Z

@NicolasHug, I think that's a great idea! I didn't understand the TorchVision transforms well enough, and the objects themselves are already the "spec" that I want. Going with my approach would basically mean duplicating the interfaces in TorchCodec.

Roughly how I think this could work:

Users create the TorchVision transform object and pass it to a VideoDecoder.
At startup, the VideoDecoder call make_params() on each TorchVision transform.
We then map that TorchVision transform to an FFmpeg filter, and map the params to the filter params. I'm not sure exactly how this would work; we'll have to figure out if we want to have all of this logic on the TorchCodec side or maybe make some changes of the TorchVision side to make this easier. This is also where all error checking would happen, as there will probably be combinations of TorchVision transformation params that can't be expressed as an FFmpeg filter.

scotts · 2025-03-22T01:08:46Z

Making a note that we've gotten a request to be able to apply the fps filter from FFmpeg which allows users to change the frame rate of the video.

scotts added the enhancement New feature or request label Mar 8, 2025

NicolasHug mentioned this issue Apr 15, 2025

Loading videos in grayscale #642

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support decoder-native transforms #526

Support decoder-native transforms #526

scotts commented Feb 27, 2025

donthomasitos commented Feb 27, 2025

scotts commented Feb 27, 2025

NicolasHug commented Feb 28, 2025

scotts commented Feb 28, 2025

scotts commented Mar 22, 2025

Support decoder-native transforms #526

Support decoder-native transforms #526

Comments

scotts commented Feb 27, 2025

🚀 The feature

Motivation, pitch

donthomasitos commented Feb 27, 2025

scotts commented Feb 27, 2025

API

The mitigation

NicolasHug commented Feb 28, 2025

scotts commented Feb 28, 2025

scotts commented Mar 22, 2025