Skip to content

Support decoder-native transforms #526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
scotts opened this issue Feb 27, 2025 · 5 comments
Open

Support decoder-native transforms #526

scotts opened this issue Feb 27, 2025 · 5 comments
Labels
enhancement New feature or request

Comments

@scotts
Copy link
Contributor

scotts commented Feb 27, 2025

🚀 The feature

Decoders usually can do more than just decode. They often also implement transformations on what they decode. For example, see the list of FFmpeg filters, which is the FFmpeg term for a transformation. Several users have requested a way to apply these decoder-native transformations.

Motivation, pitch

The main motivation is performance. We have benchmarks which show, for example, that it is much faster for FFmpeg to resize a frame while decoding than it is to pass a decoded frame to a transformation that lives outside of the decoder.

The counter pitch is that exposing such a functionality can be potentially dangerous for users. For example, there are many different ways one can resize a video frame, and it's important that users know which one they are using. We have observed model performance problems with resize specifically: users use a different resize algorithm during inference than they used for training, causing problems that are difficult to detect. Any design for this feature should include mitigations for this problem.

@donthomasitos
Copy link

Sounds great!

@scotts
Copy link
Contributor Author

scotts commented Feb 27, 2025

Design ideas for this feature, following the principles that:

  1. Users specify what they want, and we translate that to what FFmpeg understands. We never just pass a string directly from users to FFmpeg.
  2. We can use testing to enforce and quantify closeness of transforms to those already in the PyTorch ecosystem.

API

We create the module torchcodec.decoders.transform_specs. Let's use cropping and resizing as an illustration of what could be in this module:

@dataclass
class Crop:
    width: int
    height: int
    x:  int
    y:  int

    def _ffmpeg_str() -> str:
        """Logic which knows how to correctly turn the 
           user spec into the exact string FFmpeg needs.
        """
        pass

@dataclass
class Resize:
    width: int
    height: int

    def _ffmpeg_str() -> str:
        pass

The actual specifications that FFmpeg accepts are much richer than this, and are in fact their own expression language per transformation. See, for example, the documentation for the crop filter. We would only expose subsets of this functionality as needed.

Users could then pass a sequence of transformations to a decoder at initialization:

decoder = VideoDecoder(
    "vid.mp4",
    transforms=[
        Resize(width=640, height=480),
        Crop(width=32, height=32, x=0, y=0),
    ]
)

In the actual implementation, we would call the _ffmpeg_str() method on each transformation when setting up FFmpeg.

The mitigation

The surface area of even a single FFmpeg transformation is large; see the links above. The first mitigation is that we control how much of that surface area we expose through our specs.

The second mitigation is that we only expose transformations that we can translate to a corresponding transform in torchvision.transforms.v2. Crop above would become the FFmpeg crop filter and maps to the TorchVision crop transform. Resize above would become FFmpeg's scale and TorchVision's Resize.

We would then have tests which actually ensure that the output from the decoder when it uses the FFmpeg filters is within a certain tolerance of what is returned when unchanged decoded frames are transformed by the TorchVision transforms.

I'm not sure the best way to maintain that mapping. We could have a similar method on the spec object, such as something like _get_tv2_spec() -> Callable. But that would require TorchCodec taking an explicit dependence on TorchVision. Alternatively, we could maintain the mapping only in testing. We would want the mapping to be a part of our documentation so that we can guide users.

This does mean that the cost of each new decoder-native transform we support is high.

@NicolasHug
Copy link
Member

Thanks for the proposal @scotts.

I think we should explore the alternative of accepting native torchvision.transforms.v2 transform classes as parameter. I.e. something like:

decoder = VideoDecoder(
    "vid.mp4",
    transforms=[
        torchvision.transforms.v2.Resize(...),
        torchvision.transforms.v2.RandomCrop(...),
    ]
)

This way, users can rely on the existing torchvision API which they're familiar with, and on our side we don't need to invent something new.

And importantly, this would allow us to enable a lot of critical transformation features for users, in particular random parameter sampling and random application of transforms. Let me explain using the suggested API from #526 (comment):

Crop(width=32, height=32, x=0, y=0),

Passing x and y for a crop is usually not what users want. Users want to pass width and height, but they expect the crop to be random. I.e. they want x and y to be randomly sampled. Similarly, with horizontal / vertical flips (which I suspect we'll want to support), those are only useful when they are randomly applied with a given probability. Another example is with Resize where users sometimes don't want to specify the output H and W, they instead just want to specify only H (or just W) and preserve the aspect ratio.

All of these scenarios are natively supported by torchvision, and they provide a lot of value to users, so I think we should try to support those in TorchCodec. But I definitely don't think Torchcodec should implement these logics. If we accept native torchvision transforms as parameters, I think we should be able to rely on the make_params() methods of the v2 transforms, where most (all?) of the logic is implemented. See e.g. the random sampling logic for x and y of crop: https://github.com/pytorch/vision/blob/dcd1e4213394883b688d75df110839cede1193f4/torchvision/transforms/v2/_geometry.py#L879-L888

Re dependency:

  • for us as dev it doesn't change much, torchvision is already a dev/test dependency
  • for users, torchvision would become an optional dependency. But I guess we could argue that currently, torchvision already is a somewhat non-optional dependency, since this is the only current way of running transforms.

@scotts
Copy link
Contributor Author

scotts commented Feb 28, 2025

@NicolasHug, I think that's a great idea! I didn't understand the TorchVision transforms well enough, and the objects themselves are already the "spec" that I want. Going with my approach would basically mean duplicating the interfaces in TorchCodec.

Roughly how I think this could work:

  • Users create the TorchVision transform object and pass it to a VideoDecoder.
  • At startup, the VideoDecoder call make_params() on each TorchVision transform.
  • We then map that TorchVision transform to an FFmpeg filter, and map the params to the filter params. I'm not sure exactly how this would work; we'll have to figure out if we want to have all of this logic on the TorchCodec side or maybe make some changes of the TorchVision side to make this easier. This is also where all error checking would happen, as there will probably be combinations of TorchVision transformation params that can't be expressed as an FFmpeg filter.

@scotts scotts added the enhancement New feature or request label Mar 8, 2025
@scotts
Copy link
Contributor Author

scotts commented Mar 22, 2025

Making a note that we've gotten a request to be able to apply the fps filter from FFmpeg which allows users to change the frame rate of the video.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants