-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Renderer not utilizing CPU cores even with concurrency 100% #4300
Comments
All the Empirically, we see a lot that setting a high concurrency value leads to diminishing or even worse results, probably because of the reason above. One thing that you might consider trying is to make multiple renders with separate browser instances by using the Then you have two separate Chrome instances and also two separate Chrome DevTools Protocol channels which is used for communication. If you find out more, we'd be happy to incorporate your experiences into the docs. |
Continued my investigation, wanted to rule out Docker preventing chrome from accessing the ressources. Tried running on a bare metal Debian bookworm server, same issue: only a small % of the system resources is used. Has anyone managed to get got multi-core performance using the latest version of remotion? |
@tzvc Building your own distributed renderer is challenging and not recommended for most. Maybe you are rendering an OffthreadVideo with an expensive embedded video? Extracting the frames from a video is a process that cannot be well parallelized. We could verify this theory if you get better utilization if you render something else (like images) I found some threading options in FFmpeg that we need to explore. If this is the bottleneck, we should try tweaking these params https://stackoverflow.com/a/74309843/986552 |
@JonnyBurger My production compositions are heavy on OffthreadVideos. But I also tried rendering compositions with only simple images and even tho the rendering is faster, I observe the same behavior: only a few % of the system resources are used. This is a snapshot of my system when rendering a composition comprised of images and text (concurrency set to 100% on a 32 core system): Render logs on startup:
|
For reference, rendering the same image and text only composition on my 8 core MacBook yield better performance than my 32 core server (previous message): 60 average fps vs 30fps |
This is somewhat understandable for me. If you run a Node.js server on 64 cores, 63 of them will do exactly nothing by default.
|
I get that the main node.js process is single-threaded, but the load on this main process should be fairly light if its only role is to orchestrate the others processes (chrome tabs, and compositor) responsible for the heavy lifting right? I'm digging into the renderer code now, I think I understand better how everything works (very cool btw!) and what the bottlenecks could be. From my testing here's the bottlenecks I could identify: Chrome Ressource Allocation: This is an example showing render time for a 1min composition of only 1 image for multiple concurrency on a 64 core CPU: Potential solution: Since each browser instance seems to have a cap on the ressource allocation, as @JonnyBurger suggested a simple solution is to multiply the number of concurrent browser instances. Simply chunk the composition into X range of frames and render each chunk on a separate browser instance concurrently (using openBrowser) then stich the resulting videos back together using ffmpeg. With this method, I was able to get more out of my system resources. Here are the results for the same 1min video, with the concurrency set to 4 for each browser instance (same 64 core CPU) Again, we see a logarithmic decay of the performance as the number of instances increases so there is still room for improvements. My guess is at this level of concurrency, the bottleneck is somewhere else, which leads me to: Frame extraction for OffThreadVideo Potential solution: Optimimistically extract and cache frames in batches: when a frame is requested at time X on a video, we could extract this frame, return it, and optimistically extract and cache the next 10 in a single operation as they are likely to be requested later. Or even, if the system allows, an option to pre cache all the frames that will be required for the composition in batches. @JonnyBurger what do you think about this? I'd like to experiment with optimistic caching of the offthread video frames. Do you think of a way I could hack together something to test this theory without having touch the renderer's code? Is there a way to start the offthread video server externally ? |
here's the result of testing combining tab concurrency and instance concurrency when rendering offthread videos: What's interesting though is that, when rendering composition with no OffthreadVideos there is no performance degradation when running multiple instances: whether I render 1 or 8 videos at the same time, the FPS stay stable. This would indicate that the bottleneck for OffthreadVideo is the compositor, not being able to serve frames fast enough. When I look at the processes running I see only 1 compositor process, even if I run 10 renderFrames() in parallel? @JonnyBurger is that by design? Is there a way to start multiple compositor processes? |
@tzvc I'm looking at the same, using const chromiumOptions: ChromiumOptions = {
disableWebSecurity: true,
enableMultiProcessOnLinux: true,
gl: "angle",
userAgent: RENDERER_USER_AGENT,
}; with const availableCpus = Math.min(os.cpus().length, 4);
export const OFFTHREAD_CACHE_SIZE_IN_BYTES = 2 * 1024 * 1024 * 1024 // 2GB how you test frames parallezation and what was the batch size? |
I tried the FFmpeg threading options I was talking about, but I could not find that it significantly changed the outcome. I'm open to refactoring the concurrency system in November to allow specifying the tabs + browser instances instead of just tabs if you say this works, although it doesn't sound conclusive.
The chart looks realistic - extracting frames from a video is a linear process, meaning a frame can only be extracted after the previous frame has been extracted. Hence multithreading possibilities are limited. Remotion will open the video multiple times if there are frames requested that are 15 seconds apart, because then a single stream would not suffice. I think opening multiple identical video streams with little time difference will lead to a lot of duplicate work. Can't think of a obvious solution for this. |
@JonnyBurger is it possible to test by our side as well? Thank you very much! 👍🏾 |
@tzvc thanks for the cool research! I experimented with renting a beefy VPS months ago and got stuck at the same problem as you: I noticed it's just Chrome throttling the tabs. I decided not to investigate it further, thinking it was a hard problem to spend a lot of time on (thinking about how to get around Chrome limitations). We're heavily using Three.js in our compositions, so I had to find perfect concurrency for faster rendering and not overload the GPU. The results of my experiments were close to yours – sometimes decreasing concurrency helped a lot, but not always. Your solution with multiple browsers is quite interesting, it could work I think. It feels like we'll never know if it's gonna work until we build a basic prototype for this. |
Hey there,
I'm trying to optimize my rendering service for speed by running it on a beefy VPS (48 cores, 350Gb RAM). The problem is, the renderer does not seem to utilize the available resources, even worse, the render becomes slower as I increase the concurrency in the
renderMedia()
call.Here's what my resource consumption looks like when running a render with
concurrency: 1
:Most cores are sitting idle, render is slow. As expected.
Now if I bump to
concurrency: "100%"
I expect the render to spread across all 48cores and RAM. But instead, this is what I get:Again, most of the cores are sitting idle. Even more than with
concurrency:1
leading to a even slower render.What is weird is if I run the same code (with
concurrency: "100%"
) directly on my local machine (Macbook air M2), the renderer seems to utilize my 8 cores, render is fast as expected:I'm using the latest version of Remotion (
4.0.211
).Here's the config I pass to
renderMedia()
Here's my Dockerfile
Here's my system specs:
Has anyone encountered similar issue? What am I missing here?
The text was updated successfully, but these errors were encountered: