-
Notifications
You must be signed in to change notification settings - Fork 57
Description
This hasn't really yielded any concrete insights yet but I figured I'd share the results as they've just been sitting for a while.
All work was done single thread on Chimera 8 bit.
NOTE: wherever I say "frame" below, I really mean "single call to {r,d}avid_decode_frame
. I'm only just realizing the Chimera file has 8929 frames and my analysis shows "frame" (call) 9187 is the slowest. Did I mess something up for is that function called more than once per frame?
I wanted to get more granular performance indicators than just sampling / counters for the whole program, so I inserted some hacky code to measure per-frame metrics in both rav1d and dav1d for each call to {r,d}av1d_decode_frame
. Then comparing the two gives an idea of which frames are interesting and potentially causing a slowdown.
The simplest metric is wall clock time per frame. Initially I thought the ratio of rav1d / dav1d
was interesting because that gives the ratio of slowdown, but actually I think the difference rav1d - dav1d
is more interesting because we'd rather target a slow frame that takes 15ms in rav1d compared to 10ms in dav1d -- diff = 5ms, ratio = 1.5 -- rather than a slow frame in rav1d that takes 2ms vs 1ms -- diff = 1ms, ratio = 2. These results are pasted below, with some hypothetical scenarios of speeding up the slowest X% of frames by Y%. The slowest X% is given first by absolute slowest frames, then by the slowest compared to dav1d in time difference. Then it shows the slowest 1% frames compared to dav1d in time difference. Note these times are averaged over 10 runs. The interesting takeaway in the hypothetical speedups is improving the slowest 10% in time difference by 20% would reduce overall runtime by 4% points.
Frame time (output of `analyzeframetimes.py`)
--- rav1d ---
n=9831
mean 8215199.18ns
medi 10060897.00ns
p90 14742105.30ns
p95 15724595.45ns
p99 17899872.77ns
average std timing variation 261486.01
--- dav1d ---
n=9831
mean 7516056.93ns
medi 9230588.80ns
p90 13610381.10ns
p95 14609830.10ns
p99 16788979.33ns
average std timing variation 135148.00
--- ratio ---
n=9831
mean 1.09
medi 1.10
p90 1.13
p95 1.14
p99 1.16
-- Absolute speedup slowest frames are rav1d slowest frames --
improving the slowest 50% by 1% would save 627464976.13 (99.22%)
improving the slowest 50% by 5% would save 3137324880.65 (96.12%)
improving the slowest 50% by 10% would save 6274649761.30 (92.23%)
improving the slowest 50% by 20% would save 12549299522.60 (84.46%)
improving the slowest 25% by 1% would save 352693168.56 (99.56%)
improving the slowest 25% by 5% would save 1763465842.79 (97.82%)
improving the slowest 25% by 10% would save 3526931685.57 (95.63%)
improving the slowest 25% by 20% would save 7053863371.14 (91.27%)
improving the slowest 10% by 1% would save 159179283.66 (99.80%)
improving the slowest 10% by 5% would save 795896418.31 (99.01%)
improving the slowest 10% by 10% would save 1591792836.62 (98.03%)
improving the slowest 10% by 20% would save 3183585673.24 (96.06%)
improving the slowest 5% by 1% would save 84497686.67 (99.90%)
improving the slowest 5% by 5% would save 422488433.35 (99.48%)
improving the slowest 5% by 10% would save 844976866.69 (98.95%)
improving the slowest 5% by 20% would save 1689953733.38 (97.91%)
improving the slowest 1% by 1% would save 19710769.30 (99.98%)
improving the slowest 1% by 5% would save 98553846.48 (99.88%)
improving the slowest 1% by 10% would save 197107692.95 (99.76%)
improving the slowest 1% by 20% would save 394215385.90 (99.51%)
-- Difference speedup slowest frames are rav1d - dav1d --
improving the slowest 25% by 1% would save 30686228.72 (99.55%)
improving the slowest 25% by 5% would save 153431143.57 (97.77%)
improving the slowest 25% by 10% would save 306862287.15 (95.54%)
improving the slowest 25% by 20% would save 613724574.30 (91.07%)
improving the slowest 25% by 50% would save 1534311435.75 (77.68%)
improving the slowest 10% by 1% would save 14029421.62 (99.80%)
improving the slowest 10% by 5% would save 70147108.11 (98.98%)
improving the slowest 10% by 10% would save 140294216.22 (97.96%)
improving the slowest 10% by 20% would save 280588432.44 (95.92%)
improving the slowest 10% by 50% would save 701471081.10 (89.79%)
improving the slowest 5% by 1% would save 7609317.89 (99.89%)
improving the slowest 5% by 5% would save 38046589.43 (99.45%)
improving the slowest 5% by 10% would save 76093178.85 (98.89%)
improving the slowest 5% by 20% would save 152186357.70 (97.79%)
improving the slowest 5% by 50% would save 380465894.25 (94.46%)
improving the slowest 1% by 1% would save 1758226.73 (99.97%)
improving the slowest 1% by 5% would save 8791133.66 (99.87%)
improving the slowest 1% by 10% would save 17582267.33 (99.74%)
improving the slowest 1% by 20% would save 35164534.66 (99.49%)
improving the slowest 1% by 50% would save 87911336.65 (98.72%)
n frames in lowest 1.00% 98
slowest time diff frames rav1d
9187: rav1d= 16381141.8 dav1d= 14249326.5 ratio= 1.15, diff= 2131815.30 n_samples@100khz=1638
9177: rav1d= 16750423.1 dav1d= 14688030.8 ratio= 1.14, diff= 2062392.30 n_samples@100khz=1675
9186: rav1d= 16635372.1 dav1d= 14628069.6 ratio= 1.14, diff= 2007302.50 n_samples@100khz=1663
9176: rav1d= 16074932.2 dav1d= 14098569.2 ratio= 1.14, diff= 1976363.00 n_samples@100khz=1607
9195: rav1d= 16519818.2 dav1d= 14557388.9 ratio= 1.13, diff= 1962429.30 n_samples@100khz=1651
9183: rav1d= 16585604.4 dav1d= 14623517.8 ratio= 1.13, diff= 1962086.60 n_samples@100khz=1658
9196: rav1d= 15704063.8 dav1d= 13743497.9 ratio= 1.14, diff= 1960565.90 n_samples@100khz=1570
9178: rav1d= 16267450.5 dav1d= 14307055.7 ratio= 1.14, diff= 1960394.80 n_samples@100khz=1626
6865: rav1d= 18500798.6 dav1d= 16552558.0 ratio= 1.12, diff= 1948240.60 n_samples@100khz=1850
6285: rav1d= 15867283.1 dav1d= 13927933.1 ratio= 1.14, diff= 1939350.00 n_samples@100khz=1586
9174: rav1d= 15939393.7 dav1d= 14011077.2 ratio= 1.14, diff= 1928316.50 n_samples@100khz=1593
9193: rav1d= 16047787.1 dav1d= 14123608.6 ratio= 1.14, diff= 1924178.50 n_samples@100khz=1604
9201: rav1d= 16220579.4 dav1d= 14309487.5 ratio= 1.13, diff= 1911091.90 n_samples@100khz=1622
6866: rav1d= 15307081.8 dav1d= 13415563.3 ratio= 1.14, diff= 1891518.50 n_samples@100khz=1530
9194: rav1d= 16024488.5 dav1d= 14140632.3 ratio= 1.13, diff= 1883856.20 n_samples@100khz=1602
9185: rav1d= 16397133.0 dav1d= 14516647.0 ratio= 1.13, diff= 1880486.00 n_samples@100khz=1639
9192: rav1d= 16220258.7 dav1d= 14347078.0 ratio= 1.13, diff= 1873180.70 n_samples@100khz=1622
6484: rav1d= 17064982.4 dav1d= 15206699.6 ratio= 1.12, diff= 1858282.80 n_samples@100khz=1706
5183: rav1d= 11342964.7 dav1d= 9485317.0 ratio= 1.20, diff= 1857647.70 n_samples@100khz=1134
9184: rav1d= 15810618.0 dav1d= 13975471.3 ratio= 1.13, diff= 1835146.70 n_samples@100khz=1581
6476: rav1d= 14224412.7 dav1d= 12389343.1 ratio= 1.15, diff= 1835069.60 n_samples@100khz=1422
9204: rav1d= 16245757.7 dav1d= 14414383.3 ratio= 1.13, diff= 1831374.40 n_samples@100khz=1624
6859: rav1d= 11903174.4 dav1d= 10071824.1 ratio= 1.18, diff= 1831350.30 n_samples@100khz=1190
9169: rav1d= 15610712.9 dav1d= 13779669.9 ratio= 1.13, diff= 1831043.00 n_samples@100khz=1561
6475: rav1d= 17105298.0 dav1d= 15277185.7 ratio= 1.12, diff= 1828112.30 n_samples@100khz=1710
9264: rav1d= 16424634.7 dav1d= 14599886.3 ratio= 1.12, diff= 1824748.40 n_samples@100khz=1642
5543: rav1d= 16431777.2 dav1d= 14610115.7 ratio= 1.12, diff= 1821661.50 n_samples@100khz=1643
9168: rav1d= 16085764.4 dav1d= 14264471.9 ratio= 1.13, diff= 1821292.50 n_samples@100khz=1608
6326: rav1d= 14173021.1 dav1d= 12351892.9 ratio= 1.15, diff= 1821128.20 n_samples@100khz=1417
1738: rav1d= 15070580.8 dav1d= 13253411.8 ratio= 1.14, diff= 1817169.00 n_samples@100khz=1507
9175: rav1d= 15720511.8 dav1d= 13904022.2 ratio= 1.13, diff= 1816489.60 n_samples@100khz=1572
9237: rav1d= 16234274.3 dav1d= 14420144.3 ratio= 1.13, diff= 1814130.00 n_samples@100khz=1623
6874: rav1d= 17035022.3 dav1d= 15224567.3 ratio= 1.12, diff= 1810455.00 n_samples@100khz=1703
9200: rav1d= 15457874.8 dav1d= 13648365.4 ratio= 1.13, diff= 1809509.40 n_samples@100khz=1545
1920: rav1d= 15225706.2 dav1d= 13423026.3 ratio= 1.13, diff= 1802679.90 n_samples@100khz=1522
9181: rav1d= 18296001.4 dav1d= 16494613.2 ratio= 1.11, diff= 1801388.20 n_samples@100khz=1829
6740: rav1d= 16754291.3 dav1d= 14969318.7 ratio= 1.12, diff= 1784972.60 n_samples@100khz=1675
9308: rav1d= 16393914.0 dav1d= 14609544.5 ratio= 1.12, diff= 1784369.50 n_samples@100khz=1639
6862: rav1d= 12294920.7 dav1d= 10513683.5 ratio= 1.17, diff= 1781237.20 n_samples@100khz=1229
9167: rav1d= 15510178.1 dav1d= 13737766.3 ratio= 1.13, diff= 1772411.80 n_samples@100khz=1551
6870: rav1d= 11900472.3 dav1d= 10136742.6 ratio= 1.17, diff= 1763729.70 n_samples@100khz=1190
6480: rav1d= 12547323.6 dav1d= 10784544.8 ratio= 1.16, diff= 1762778.80 n_samples@100khz=1254
9191: rav1d= 15629694.5 dav1d= 13867475.7 ratio= 1.13, diff= 1762218.80 n_samples@100khz=1562
1751: rav1d= 14887199.4 dav1d= 13127492.6 ratio= 1.13, diff= 1759706.80 n_samples@100khz=1488
9317: rav1d= 16231630.4 dav1d= 14473621.4 ratio= 1.12, diff= 1758009.00 n_samples@100khz=1623
9199: rav1d= 17742668.4 dav1d= 15985142.7 ratio= 1.11, diff= 1757525.70 n_samples@100khz=1774
9320: rav1d= 16519434.2 dav1d= 14762244.4 ratio= 1.12, diff= 1757189.80 n_samples@100khz=1651
6284: rav1d= 18506325.2 dav1d= 16749223.2 ratio= 1.10, diff= 1757102.00 n_samples@100khz=1850
6867: rav1d= 12062536.4 dav1d= 10306431.2 ratio= 1.17, diff= 1756105.20 n_samples@100khz=1206
6861: rav1d= 12333730.6 dav1d= 10580966.6 ratio= 1.17, diff= 1752764.00 n_samples@100khz=1233
1569: rav1d= 15048133.7 dav1d= 13298266.5 ratio= 1.13, diff= 1749867.20 n_samples@100khz=1504
9321: rav1d= 15842388.7 dav1d= 14097614.8 ratio= 1.12, diff= 1744773.90 n_samples@100khz=1584
9152: rav1d= 15164263.8 dav1d= 13420003.8 ratio= 1.13, diff= 1744260.00 n_samples@100khz=1516
6561: rav1d= 15694681.4 dav1d= 13953587.1 ratio= 1.12, diff= 1741094.30 n_samples@100khz=1569
6597: rav1d= 15810137.1 dav1d= 14069158.1 ratio= 1.12, diff= 1740979.00 n_samples@100khz=1581
6736: rav1d= 12336682.9 dav1d= 10602805.7 ratio= 1.16, diff= 1733877.20 n_samples@100khz=1233
6869: rav1d= 11539205.0 dav1d= 9810908.2 ratio= 1.18, diff= 1728296.80 n_samples@100khz=1153
9202: rav1d= 15542952.5 dav1d= 13816646.3 ratio= 1.12, diff= 1726306.20 n_samples@100khz=1554
9235: rav1d= 17664942.8 dav1d= 15939932.7 ratio= 1.11, diff= 1725010.10 n_samples@100khz=1766
1660: rav1d= 15183188.3 dav1d= 13460438.7 ratio= 1.13, diff= 1722749.60 n_samples@100khz=1518
9190: rav1d= 18024098.9 dav1d= 16302028.7 ratio= 1.11, diff= 1722070.20 n_samples@100khz=1802
9173: rav1d= 15478870.3 dav1d= 13758195.7 ratio= 1.13, diff= 1720674.60 n_samples@100khz=1547
9268: rav1d= 15699709.9 dav1d= 13983821.0 ratio= 1.12, diff= 1715888.90 n_samples@100khz=1569
6601: rav1d= 13394198.3 dav1d= 11678414.6 ratio= 1.15, diff= 1715783.70 n_samples@100khz=1339
9258: rav1d= 16777138.2 dav1d= 15062249.6 ratio= 1.11, diff= 1714888.60 n_samples@100khz=1677
6773: rav1d= 12501258.9 dav1d= 10790214.3 ratio= 1.16, diff= 1711044.60 n_samples@100khz=1250
9182: rav1d= 15618810.0 dav1d= 13910518.5 ratio= 1.12, diff= 1708291.50 n_samples@100khz=1561
6875: rav1d= 13711879.3 dav1d= 12003729.1 ratio= 1.14, diff= 1708150.20 n_samples@100khz=1371
1894: rav1d= 14820272.2 dav1d= 13112558.7 ratio= 1.13, diff= 1707713.50 n_samples@100khz=1482
6382: rav1d= 21241665.4 dav1d= 19534148.6 ratio= 1.09, diff= 1707516.80 n_samples@100khz=2124
9220: rav1d= 15694992.1 dav1d= 13988952.7 ratio= 1.12, diff= 1706039.40 n_samples@100khz=1569
6741: rav1d= 13998722.6 dav1d= 12293335.9 ratio= 1.14, diff= 1705386.70 n_samples@100khz=1399
1582: rav1d= 14790423.1 dav1d= 13085234.3 ratio= 1.13, diff= 1705188.80 n_samples@100khz=1479
6863: rav1d= 11558143.7 dav1d= 9853388.8 ratio= 1.17, diff= 1704754.90 n_samples@100khz=1155
6856: rav1d= 17887305.3 dav1d= 16185632.4 ratio= 1.11, diff= 1701672.90 n_samples@100khz=1788
6478: rav1d= 12012756.0 dav1d= 10312156.7 ratio= 1.16, diff= 1700599.30 n_samples@100khz=1201
6771: rav1d= 12950217.7 dav1d= 11256301.3 ratio= 1.15, diff= 1693916.40 n_samples@100khz=1295
6682: rav1d= 13404503.6 dav1d= 11710972.6 ratio= 1.14, diff= 1693531.00 n_samples@100khz=1340
1244: rav1d= 15337421.8 dav1d= 13645752.5 ratio= 1.12, diff= 1691669.30 n_samples@100khz=1533
9221: rav1d= 15824731.5 dav1d= 14134069.1 ratio= 1.12, diff= 1690662.40 n_samples@100khz=1582
9318: rav1d= 15718866.8 dav1d= 14032765.0 ratio= 1.12, diff= 1686101.80 n_samples@100khz=1571
9205: rav1d= 15487603.5 dav1d= 13803014.7 ratio= 1.12, diff= 1684588.80 n_samples@100khz=1548
6868: rav1d= 11611749.2 dav1d= 9928509.6 ratio= 1.17, diff= 1683239.60 n_samples@100khz=1161
9228: rav1d= 15775589.1 dav1d= 14093263.5 ratio= 1.12, diff= 1682325.60 n_samples@100khz=1577
9214: rav1d= 15145493.5 dav1d= 13464096.5 ratio= 1.12, diff= 1681397.00 n_samples@100khz=1514
2128: rav1d= 15430232.0 dav1d= 13749780.9 ratio= 1.12, diff= 1680451.10 n_samples@100khz=1543
9326: rav1d= 15879450.0 dav1d= 14199993.8 ratio= 1.12, diff= 1679456.20 n_samples@100khz=1587
6745: rav1d= 12257767.0 dav1d= 10578696.4 ratio= 1.16, diff= 1679070.60 n_samples@100khz=1225
6560: rav1d= 18572310.0 dav1d= 16896700.0 ratio= 1.10, diff= 1675610.00 n_samples@100khz=1857
1699: rav1d= 15775983.9 dav1d= 14101787.6 ratio= 1.12, diff= 1674196.30 n_samples@100khz=1577
9265: rav1d= 15795137.0 dav1d= 14121877.4 ratio= 1.12, diff= 1673259.60 n_samples@100khz=1579
9347: rav1d= 16168011.8 dav1d= 14503087.7 ratio= 1.11, diff= 1664924.10 n_samples@100khz=1616
1764: rav1d= 14548468.9 dav1d= 12883636.9 ratio= 1.13, diff= 1664832.00 n_samples@100khz=1454
6602: rav1d= 12718329.3 dav1d= 11056896.8 ratio= 1.15, diff= 1661432.50 n_samples@100khz=1271
9319: rav1d= 15850951.6 dav1d= 14192264.5 ratio= 1.12, diff= 1658687.10 n_samples@100khz=1585
9236: rav1d= 15517922.5 dav1d= 13859411.6 ratio= 1.12, diff= 1658510.90 n_samples@100khz=1551
9203: rav1d= 15475966.7 dav1d= 13818568.1 ratio= 1.12, diff= 1657398.60 n_samples@100khz=1547
6738: rav1d= 11829454.9 dav1d= 10172663.3 ratio= 1.16, diff= 1656791.60 n_samples@100khz=1182
[1244,1569,1582,1660,1699,1738,1751,1764,1894,1920,2128,5183,5543,6284,6285,6326,6382,6475,6476,6478,6480,6484,6560,6561,6597,6601,6602,6682,6736,6738,6740,6741,6745,6771,6773,6856,6859,6861,6862,6863,6865,6866,6867,6868,6869,6870,6874,6875,9152,9167,9168,9169,9173,9174,9175,9176,9177,9178,9181,9182,9183,9184,9185,9186,9187,9190,9191,9192,9193,9194,9195,9196,9199,9200,9201,9202,9203,9204,9205,9214,9220,9221,9228,9235,9236,9237,9258,9264,9265,9268,9308,9317,9318,9319,9320,9321,9326,9347]
would get 150186 samples
And a graph of these times:
I then also recorded some CPU counters. The counters were: instructions, branch instructions, branch mispredictions, L1 dcm, L2 dcm, L2 icm, TLB dm, TLB im, branch not taken, and l1 dca. dcm/icm is data/instruction cache miss, dm/im is data/instruction miss. I then tried to look at how well those counters correlate with the time differences. I have to record groups of 4 counters at a time because of hardware limits. The output is pasted below for each group, and also calculates the correlation for the slowest X% of frames.
Frame counter output of `analyzeframestats.py`
============== ins_vec_br_miss ==========
------- rav1d ---------
frame elapsed PAPI_TOT_INS PAPI_VEC_INS PAPI_BR_INS PAPI_BR_MSP
count 9831.000000 9.831000e+03 9.831000e+03 9.831000e+03 9.831000e+03 9.831000e+03
mean 4915.000000 7.815855e+06 8.211349e+07 4.891480e+04 6.525285e+06 3.097026e+05
std 2838.109582 5.000538e+06 5.426068e+07 1.139955e+05 4.116764e+06 2.094043e+05
min 0.000000 2.206650e+05 1.272693e+06 0.000000e+00 1.491260e+05 6.100000e+02
25% 2457.500000 2.937480e+06 2.695083e+07 0.000000e+00 2.396874e+06 8.983950e+04
50% 4915.000000 9.506489e+06 1.018283e+08 0.000000e+00 7.800958e+06 3.766300e+05
75% 7372.500000 1.155520e+07 1.215945e+08 4.488000e+04 9.497878e+06 4.648815e+05
max 9830.000000 4.207895e+07 4.768892e+08 1.713824e+06 3.753090e+07 1.962799e+06
elapsed total time 76.837667799s
------- dav1d ---------
frame elapsed PAPI_TOT_INS PAPI_VEC_INS PAPI_BR_INS PAPI_BR_MSP
count 9831.000000 9.831000e+03 9.831000e+03 9.831000e+03 9.831000e+03 9.831000e+03
mean 4915.000000 7.496412e+06 8.211351e+07 4.891480e+04 6.525283e+06 3.069317e+05
std 2838.109582 4.838445e+06 5.426066e+07 1.139955e+05 4.116763e+06 2.077236e+05
min 0.000000 2.165070e+05 1.272790e+06 0.000000e+00 1.491260e+05 6.190000e+02
25% 2457.500000 2.591232e+06 2.695084e+07 0.000000e+00 2.396878e+06 8.819150e+04
50% 4915.000000 9.183322e+06 1.018283e+08 0.000000e+00 7.800956e+06 3.731340e+05
75% 7372.500000 1.108462e+07 1.215945e+08 4.488000e+04 9.497868e+06 4.605740e+05
max 9830.000000 4.170451e+07 4.768893e+08 1.713824e+06 3.753090e+07 1.969427e+06
------- diff ---------
frame elapsed PAPI_TOT_INS PAPI_VEC_INS PAPI_BR_INS PAPI_BR_MSP
count 9831.0 9.831000e+03 9831.000000 9831.0 9831.000000 9831.00000
mean 0.0 3.194431e+05 -13.404232 0.0 1.628013 2770.94914
std 0.0 4.765613e+05 42.183340 0.0 10.234954 6877.72657
min 0.0 -3.618311e+06 -503.000000 0.0 -103.000000 -56295.00000
25% 0.0 1.286550e+04 -27.000000 0.0 0.000000 -104.50000
50% 0.0 1.625270e+05 0.000000 0.0 0.000000 711.00000
75% 0.0 5.367530e+05 1.000000 0.0 1.000000 3976.50000
max 0.0 4.064331e+06 407.000000 0.0 123.000000 80797.00000
---- corr of everything ---
Correlation of elapsed to PAPI_TOT_INS = 0.15
Correlation of elapsed to PAPI_BR_INS = 0.21
Correlation of elapsed to PAPI_BR_MSP = 0.76
---- corr of slowest 20% ---
Correlation of elapsed to PAPI_TOT_INS = 0.18
Correlation of elapsed to PAPI_BR_INS = 0.22
Correlation of elapsed to PAPI_BR_MSP = 0.75
---- corr of slowest 10% ---
Correlation of elapsed to PAPI_TOT_INS = 0.19
Correlation of elapsed to PAPI_BR_INS = 0.22
Correlation of elapsed to PAPI_BR_MSP = 0.76
---- corr of slowest 5% ---
Correlation of elapsed to PAPI_TOT_INS = 0.18
Correlation of elapsed to PAPI_BR_INS = 0.22
Correlation of elapsed to PAPI_BR_MSP = 0.76
---- corr of slowest 1% ---
Correlation of elapsed to PAPI_TOT_INS = 0.16
Correlation of elapsed to PAPI_BR_INS = 0.21
Correlation of elapsed to PAPI_BR_MSP = 0.76
============== l1d_l2d_l2i_tlbd ==========
------- rav1d ---------
frame elapsed PAPI_L1_DCM PAPI_L2_DCM PAPI_L2_ICM PAPI_TLB_DM
count 9831.000000 9.831000e+03 9.831000e+03 9831.000000 9831.000000 9831.000000
mean 4915.000000 8.192587e+06 1.198464e+06 261203.978639 239563.264266 376528.263656
std 2838.109582 5.265061e+06 5.204459e+05 82804.093406 119197.818100 198517.593662
min 0.000000 2.335580e+05 2.049120e+05 58358.000000 10761.000000 35975.000000
25% 2457.500000 2.869926e+06 8.012055e+05 193242.500000 146015.000000 199737.000000
50% 4915.000000 1.000468e+07 1.344840e+06 278218.000000 288323.000000 429628.000000
75% 7372.500000 1.212662e+07 1.577246e+06 327860.500000 324695.000000 517036.500000
max 9830.000000 4.294083e+07 2.501230e+06 448502.000000 637931.000000 882615.000000
elapsed total time 80.54131886s
------- dav1d ---------
frame elapsed PAPI_L1_DCM PAPI_L2_DCM PAPI_L2_ICM PAPI_TLB_DM
count 9831.000000 9.831000e+03 9.831000e+03 9831.000000 9831.000000 9831.000000
mean 4915.000000 7.522485e+06 1.073697e+06 244132.600651 220122.325806 334866.711016
std 2838.109582 4.862919e+06 4.498037e+05 73069.677012 107827.584164 174699.292661
min 0.000000 2.157950e+05 1.968110e+05 60735.000000 10441.000000 34220.000000
25% 2457.500000 2.613770e+06 7.526455e+05 185243.000000 141525.500000 181827.500000
50% 4915.000000 9.203498e+06 1.202949e+06 260521.000000 256773.000000 378850.000000
75% 7372.500000 1.112855e+07 1.396412e+06 302753.500000 295655.500000 458210.500000
max 9830.000000 4.187731e+07 2.196017e+06 410406.000000 599332.000000 790612.000000
------- diff ---------
frame elapsed PAPI_L1_DCM PAPI_L2_DCM PAPI_L2_ICM PAPI_TLB_DM
count 9831.0 9.831000e+03 9831.000000 9831.000000 9831.000000 9831.000000
mean 0.0 6.701012e+05 124766.617740 17071.377988 19440.938460 41661.552640
std 0.0 4.978306e+05 75619.540246 16177.914195 14917.998467 25547.405588
min 0.0 -3.063915e+06 -25812.000000 -56229.000000 -45866.000000 -3116.000000
25% 0.0 2.464290e+05 47928.500000 4041.000000 6144.000000 15841.000000
50% 0.0 6.814990e+05 146434.000000 15519.000000 21926.000000 48846.000000
75% 0.0 1.032891e+06 181846.000000 28336.500000 31236.000000 61612.000000
max 0.0 3.724840e+06 355201.000000 83870.000000 89851.000000 135583.000000
---- corr of everything ---
Correlation of elapsed to PAPI_L1_DCM = 0.85
Correlation of elapsed to PAPI_L2_DCM = 0.58
Correlation of elapsed to PAPI_L2_ICM = 0.72
Correlation of elapsed to PAPI_TLB_DM = 0.85
---- corr of slowest 20% ---
Correlation of elapsed to PAPI_L1_DCM = 0.79
Correlation of elapsed to PAPI_L2_DCM = 0.47
Correlation of elapsed to PAPI_L2_ICM = 0.56
Correlation of elapsed to PAPI_TLB_DM = 0.80
---- corr of slowest 10% ---
Correlation of elapsed to PAPI_L1_DCM = 0.84
Correlation of elapsed to PAPI_L2_DCM = 0.54
Correlation of elapsed to PAPI_L2_ICM = 0.66
Correlation of elapsed to PAPI_TLB_DM = 0.85
---- corr of slowest 5% ---
Correlation of elapsed to PAPI_L1_DCM = 0.86
Correlation of elapsed to PAPI_L2_DCM = 0.57
Correlation of elapsed to PAPI_L2_ICM = 0.70
Correlation of elapsed to PAPI_TLB_DM = 0.87
---- corr of slowest 1% ---
Correlation of elapsed to PAPI_L1_DCM = 0.88
Correlation of elapsed to PAPI_L2_DCM = 0.58
Correlation of elapsed to PAPI_L2_ICM = 0.72
Correlation of elapsed to PAPI_TLB_DM = 0.88
============== tlbi_brntk_l1a ==========
------- rav1d ---------
frame elapsed PAPI_TLB_IM PAPI_BR_NTK PAPI_L1_DCA
count 9831.000000 9.831000e+03 9831.000000 9.831000e+03 9.831000e+03
mean 4915.000000 8.169632e+06 4020.198250 4.579056e+06 3.970638e+07
std 2838.109582 5.193572e+06 2927.850272 2.892591e+06 2.632451e+07
min 0.000000 2.361530e+05 46.000000 7.088500e+04 7.944690e+05
25% 2457.500000 2.977706e+06 998.000000 1.656762e+06 1.306605e+07
50% 4915.000000 1.002562e+07 4964.000000 5.498826e+06 4.877599e+07
75% 7372.500000 1.208258e+07 6264.000000 6.725635e+06 5.924107e+07
max 9830.000000 4.114268e+07 31393.000000 2.329288e+07 2.209375e+08
elapsed total time 80.315647572s
------- dav1d ---------
frame elapsed PAPI_TLB_IM PAPI_BR_NTK PAPI_L1_DCA
count 9831.000000 9.831000e+03 9831.000000 9.831000e+03 9.831000e+03
mean 4915.000000 7.550308e+06 686.844980 2.162614e+06 3.529227e+07
std 2838.109582 4.877032e+06 410.031248 1.283278e+06 2.396374e+07
min 0.000000 2.156850e+05 28.000000 3.507100e+04 6.919480e+05
25% 2457.500000 2.628189e+06 342.000000 8.737255e+05 1.085096e+07
50% 4915.000000 9.289040e+06 784.000000 2.532062e+06 4.338576e+07
75% 7372.500000 1.115961e+07 940.000000 3.071350e+06 5.282891e+07
max 9830.000000 4.153261e+07 4872.000000 1.165186e+07 2.152608e+08
------- diff ---------
frame elapsed PAPI_TLB_IM PAPI_BR_NTK PAPI_L1_DCA
count 9831.0 9.831000e+03 9831.000000 9.831000e+03 9.831000e+03
mean 0.0 6.193239e+05 3333.353270 2.416442e+06 4.414108e+06
std 0.0 4.756888e+05 2600.994883 1.621798e+06 2.474237e+06
min 0.0 -6.300309e+06 -452.000000 3.581400e+04 9.934700e+04
25% 0.0 2.691650e+05 634.500000 7.863975e+05 2.236618e+06
50% 0.0 6.108070e+05 3971.000000 3.021118e+06 5.278153e+06
75% 0.0 9.334125e+05 5317.000000 3.656908e+06 6.333852e+06
max 0.0 6.330616e+06 30386.000000 1.164102e+07 1.324842e+07
---- corr of everything ---
Correlation of elapsed to PAPI_TLB_IM = 0.69
Correlation of elapsed to PAPI_BR_NTK = 0.65
Correlation of elapsed to PAPI_L1_DCA = 0.72
---- corr of slowest 20% ---
Correlation of elapsed to PAPI_TLB_IM = 0.62
Correlation of elapsed to PAPI_BR_NTK = 0.57
Correlation of elapsed to PAPI_L1_DCA = 0.67
---- corr of slowest 10% ---
Correlation of elapsed to PAPI_TLB_IM = 0.70
Correlation of elapsed to PAPI_BR_NTK = 0.66
Correlation of elapsed to PAPI_L1_DCA = 0.75
---- corr of slowest 5% ---
Correlation of elapsed to PAPI_TLB_IM = 0.73
Correlation of elapsed to PAPI_BR_NTK = 0.70
Correlation of elapsed to PAPI_L1_DCA = 0.78
---- corr of slowest 1% ---
Correlation of elapsed to PAPI_TLB_IM = 0.75
Correlation of elapsed to PAPI_BR_NTK = 0.72
Correlation of elapsed to PAPI_L1_DCA = 0.79
This counter work was done in this rav1d branch and this dav1d branch
So then I wanted to see why these frames were slower, so I hacked in a way to enable+disable perf record for each of the slow frames in this branch. It gets used like this:
mkfifo fifo
perf record -g --delay=-1 --control=fifo:fifo target/release-with-debug/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
and the perf record can be adjusted to record whatever you want. This outputs a single perf.data with only samples from the slow frames (hardcoded into the code).
This still isn't granular enough though so I've been thinking of more targeted analysis that could better pinpoint exactly which points are slow. I have a few rough ideas and most revolve around being able to repeatedly run a specific frame (or call) with slightly different perf options (either via perf record out of process or directly with perf_event_open in process). Is there a (hopefully easy) way to hack in a flag that could repeatedly decode a single frame, or perhaps also the N previous frames to allow for realistic warmup of caches? Otherwise I can use the --skip
flag which I just discovered.
Ok I think that's it, let me know if I forgot to include something somewhere