-
Notifications
You must be signed in to change notification settings - Fork 57
Description
While analyzing the performance differences between the original dav1d code and the new Rust rav1d code I have observed that the heap usage between the two code bases is significantly different. This difference could be accounting for a significant different in the performance difference between the two code bases.
Analysis
To compare the heap usage between the two programs I have used the dhat tool from valgrind.
It is probably important to already note that the allocation time is often not visible in perf captures (flamegraphs) and these allocations can also be related to #1358.
dav1d output
$ valgrind --tool=dhat ./build/tools/dav1d -q -i ../Chimera-AV1-8bit-480x270-552kbps.ivf -o /dev/null --threads 1
==59905== DHAT, a dynamic heap analysis tool
==59905== Copyright (C) 2010-2018, and GNU GPL'd, by Mozilla Foundation
==59905== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==59905== Command: ./build/tools/dav1d -q -i ../Chimera-AV1-8bit-480x270-552kbps.ivf -o /dev/null --threads 1
==59905==
==59905==
==59905== Total: 29,432,246 bytes in 9,005 blocks
==59905== At t-gmax: 3,501,230 bytes in 74 blocks
==59905== At t-end: 0 bytes in 0 blocks
==59905== Reads: 42,247,201,877 bytes
==59905== Writes: 23,480,953,050 bytes
==59905==
rav1d output
valgrind --tool=dhat ./target/release/dav1d -q -i ../Chimera-AV1-8bit-480x270-552kbps.ivf -o /dev/null --threads 1
==59910== DHAT, a dynamic heap analysis tool
==59910== Copyright (C) 2010-2018, and GNU GPL'd, by Mozilla Foundation
==59910== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==59910== Command: ./target/release/dav1d -q -i ../Chimera-AV1-8bit-480x270-552kbps.ivf -o /dev/null --threads 1
==59910==
==59910==
==59910== Total: 351,912,041 bytes in 227,488 blocks
==59910== At t-gmax: 3,916,039 bytes in 9,018 blocks
==59910== At t-end: 428,862 bytes in 8,939 blocks
==59910== Reads: 46,744,808,495 bytes
==59910== Writes: 23,700,216,659 bytes
==59910==
Possible source code changes
These are the insights I gathered from the analysis of the dhat output using the dhat viewer tool.
Replace the Arc's
The most obvious solution to this problem would be to replace the Arc's by types which do not allocate.
Many of these allocation seem to come from the usage of Arc in the Rust objects which are newly allocated on each iteration.
Line 2307 in 1be76ea
state.seq_hdr = Some(Arc::new(DRav1d::from_rav1d(seq_hdr))); // TODO(kkysen) fallible allocation |
Also usages of mem::take and the fact that the default is doing many allocations seem to be a source of performance problems:
Line 2511 in 1be76ea
Rav1dITUTT35::to_immut(mem::take(&mut state.itut_t35)), |
Another place where many allocations are done is the Rav1dITUTT35::to_immut method:
rav1d/include/dav1d/headers.rs
Line 822 in 1be76ea
pub fn to_immut( |
Split out the DRav1d type
Maybe not directly related to performance issue but the DRav1d type is making the analysis difficult and should probably not really exist:
rav1d/include/dav1d/headers.rs
Line 21 in 1be76ea
pub struct DRav1d<R, D> { |
In general the conversion to dav1d should be done where they are needed instead of kept in parallel throughout the code. The other thing is the usage of deref which is only recommended if the type really represents a smart pointer:
rav1d/include/dav1d/headers.rs
Line 39 in 1be76ea
impl<R, D> Deref for DRav1d<R, D> { |
I have also done this in the past for usage ergonomics but it tends to end up being not such a great idea because it typically represents an incorrect software architectural decision.
I would be happy to contribute with some actual changes but as I am not familiar with the codebase and I don't know how much you want to deviate from the C code I thought opening an issue for discussing was the easiest way to move forward.