Skip to content

Analyzing per frame performance #1437

@aconz2

Description

@aconz2

This hasn't really yielded any concrete insights yet but I figured I'd share the results as they've just been sitting for a while.

All work was done single thread on Chimera 8 bit.

NOTE: wherever I say "frame" below, I really mean "single call to {r,d}avid_decode_frame. I'm only just realizing the Chimera file has 8929 frames and my analysis shows "frame" (call) 9187 is the slowest. Did I mess something up for is that function called more than once per frame?

I wanted to get more granular performance indicators than just sampling / counters for the whole program, so I inserted some hacky code to measure per-frame metrics in both rav1d and dav1d for each call to {r,d}av1d_decode_frame. Then comparing the two gives an idea of which frames are interesting and potentially causing a slowdown.

The simplest metric is wall clock time per frame. Initially I thought the ratio of rav1d / dav1d was interesting because that gives the ratio of slowdown, but actually I think the difference rav1d - dav1d is more interesting because we'd rather target a slow frame that takes 15ms in rav1d compared to 10ms in dav1d -- diff = 5ms, ratio = 1.5 -- rather than a slow frame in rav1d that takes 2ms vs 1ms -- diff = 1ms, ratio = 2. These results are pasted below, with some hypothetical scenarios of speeding up the slowest X% of frames by Y%. The slowest X% is given first by absolute slowest frames, then by the slowest compared to dav1d in time difference. Then it shows the slowest 1% frames compared to dav1d in time difference. Note these times are averaged over 10 runs. The interesting takeaway in the hypothetical speedups is improving the slowest 10% in time difference by 20% would reduce overall runtime by 4% points.

Frame time (output of `analyzeframetimes.py`)
--- rav1d ---
n=9831
mean   8215199.18ns
medi  10060897.00ns
p90   14742105.30ns
p95   15724595.45ns
p99   17899872.77ns
average std timing variation 261486.01

--- dav1d ---
n=9831
mean   7516056.93ns
medi   9230588.80ns
p90   13610381.10ns
p95   14609830.10ns
p99   16788979.33ns
average std timing variation 135148.00

--- ratio ---
n=9831
mean         1.09
medi         1.10
p90          1.13
p95          1.14
p99          1.16

-- Absolute speedup slowest frames are rav1d slowest frames --
improving the slowest  50% by   1% would save    627464976.13 (99.22%)
improving the slowest  50% by   5% would save   3137324880.65 (96.12%)
improving the slowest  50% by  10% would save   6274649761.30 (92.23%)
improving the slowest  50% by  20% would save  12549299522.60 (84.46%)
improving the slowest  25% by   1% would save    352693168.56 (99.56%)
improving the slowest  25% by   5% would save   1763465842.79 (97.82%)
improving the slowest  25% by  10% would save   3526931685.57 (95.63%)
improving the slowest  25% by  20% would save   7053863371.14 (91.27%)
improving the slowest  10% by   1% would save    159179283.66 (99.80%)
improving the slowest  10% by   5% would save    795896418.31 (99.01%)
improving the slowest  10% by  10% would save   1591792836.62 (98.03%)
improving the slowest  10% by  20% would save   3183585673.24 (96.06%)
improving the slowest   5% by   1% would save     84497686.67 (99.90%)
improving the slowest   5% by   5% would save    422488433.35 (99.48%)
improving the slowest   5% by  10% would save    844976866.69 (98.95%)
improving the slowest   5% by  20% would save   1689953733.38 (97.91%)
improving the slowest   1% by   1% would save     19710769.30 (99.98%)
improving the slowest   1% by   5% would save     98553846.48 (99.88%)
improving the slowest   1% by  10% would save    197107692.95 (99.76%)
improving the slowest   1% by  20% would save    394215385.90 (99.51%)

-- Difference speedup slowest frames are rav1d - dav1d --
improving the slowest  25% by   1% would save     30686228.72 (99.55%)
improving the slowest  25% by   5% would save    153431143.57 (97.77%)
improving the slowest  25% by  10% would save    306862287.15 (95.54%)
improving the slowest  25% by  20% would save    613724574.30 (91.07%)
improving the slowest  25% by  50% would save   1534311435.75 (77.68%)
improving the slowest  10% by   1% would save     14029421.62 (99.80%)
improving the slowest  10% by   5% would save     70147108.11 (98.98%)
improving the slowest  10% by  10% would save    140294216.22 (97.96%)
improving the slowest  10% by  20% would save    280588432.44 (95.92%)
improving the slowest  10% by  50% would save    701471081.10 (89.79%)
improving the slowest   5% by   1% would save      7609317.89 (99.89%)
improving the slowest   5% by   5% would save     38046589.43 (99.45%)
improving the slowest   5% by  10% would save     76093178.85 (98.89%)
improving the slowest   5% by  20% would save    152186357.70 (97.79%)
improving the slowest   5% by  50% would save    380465894.25 (94.46%)
improving the slowest   1% by   1% would save      1758226.73 (99.97%)
improving the slowest   1% by   5% would save      8791133.66 (99.87%)
improving the slowest   1% by  10% would save     17582267.33 (99.74%)
improving the slowest   1% by  20% would save     35164534.66 (99.49%)
improving the slowest   1% by  50% would save     87911336.65 (98.72%)

n frames in lowest 1.00% 98
slowest time diff frames rav1d
 9187: rav1d=  16381141.8 dav1d=  14249326.5 ratio= 1.15, diff=  2131815.30 n_samples@100khz=1638
 9177: rav1d=  16750423.1 dav1d=  14688030.8 ratio= 1.14, diff=  2062392.30 n_samples@100khz=1675
 9186: rav1d=  16635372.1 dav1d=  14628069.6 ratio= 1.14, diff=  2007302.50 n_samples@100khz=1663
 9176: rav1d=  16074932.2 dav1d=  14098569.2 ratio= 1.14, diff=  1976363.00 n_samples@100khz=1607
 9195: rav1d=  16519818.2 dav1d=  14557388.9 ratio= 1.13, diff=  1962429.30 n_samples@100khz=1651
 9183: rav1d=  16585604.4 dav1d=  14623517.8 ratio= 1.13, diff=  1962086.60 n_samples@100khz=1658
 9196: rav1d=  15704063.8 dav1d=  13743497.9 ratio= 1.14, diff=  1960565.90 n_samples@100khz=1570
 9178: rav1d=  16267450.5 dav1d=  14307055.7 ratio= 1.14, diff=  1960394.80 n_samples@100khz=1626
 6865: rav1d=  18500798.6 dav1d=  16552558.0 ratio= 1.12, diff=  1948240.60 n_samples@100khz=1850
 6285: rav1d=  15867283.1 dav1d=  13927933.1 ratio= 1.14, diff=  1939350.00 n_samples@100khz=1586
 9174: rav1d=  15939393.7 dav1d=  14011077.2 ratio= 1.14, diff=  1928316.50 n_samples@100khz=1593
 9193: rav1d=  16047787.1 dav1d=  14123608.6 ratio= 1.14, diff=  1924178.50 n_samples@100khz=1604
 9201: rav1d=  16220579.4 dav1d=  14309487.5 ratio= 1.13, diff=  1911091.90 n_samples@100khz=1622
 6866: rav1d=  15307081.8 dav1d=  13415563.3 ratio= 1.14, diff=  1891518.50 n_samples@100khz=1530
 9194: rav1d=  16024488.5 dav1d=  14140632.3 ratio= 1.13, diff=  1883856.20 n_samples@100khz=1602
 9185: rav1d=  16397133.0 dav1d=  14516647.0 ratio= 1.13, diff=  1880486.00 n_samples@100khz=1639
 9192: rav1d=  16220258.7 dav1d=  14347078.0 ratio= 1.13, diff=  1873180.70 n_samples@100khz=1622
 6484: rav1d=  17064982.4 dav1d=  15206699.6 ratio= 1.12, diff=  1858282.80 n_samples@100khz=1706
 5183: rav1d=  11342964.7 dav1d=   9485317.0 ratio= 1.20, diff=  1857647.70 n_samples@100khz=1134
 9184: rav1d=  15810618.0 dav1d=  13975471.3 ratio= 1.13, diff=  1835146.70 n_samples@100khz=1581
 6476: rav1d=  14224412.7 dav1d=  12389343.1 ratio= 1.15, diff=  1835069.60 n_samples@100khz=1422
 9204: rav1d=  16245757.7 dav1d=  14414383.3 ratio= 1.13, diff=  1831374.40 n_samples@100khz=1624
 6859: rav1d=  11903174.4 dav1d=  10071824.1 ratio= 1.18, diff=  1831350.30 n_samples@100khz=1190
 9169: rav1d=  15610712.9 dav1d=  13779669.9 ratio= 1.13, diff=  1831043.00 n_samples@100khz=1561
 6475: rav1d=  17105298.0 dav1d=  15277185.7 ratio= 1.12, diff=  1828112.30 n_samples@100khz=1710
 9264: rav1d=  16424634.7 dav1d=  14599886.3 ratio= 1.12, diff=  1824748.40 n_samples@100khz=1642
 5543: rav1d=  16431777.2 dav1d=  14610115.7 ratio= 1.12, diff=  1821661.50 n_samples@100khz=1643
 9168: rav1d=  16085764.4 dav1d=  14264471.9 ratio= 1.13, diff=  1821292.50 n_samples@100khz=1608
 6326: rav1d=  14173021.1 dav1d=  12351892.9 ratio= 1.15, diff=  1821128.20 n_samples@100khz=1417
 1738: rav1d=  15070580.8 dav1d=  13253411.8 ratio= 1.14, diff=  1817169.00 n_samples@100khz=1507
 9175: rav1d=  15720511.8 dav1d=  13904022.2 ratio= 1.13, diff=  1816489.60 n_samples@100khz=1572
 9237: rav1d=  16234274.3 dav1d=  14420144.3 ratio= 1.13, diff=  1814130.00 n_samples@100khz=1623
 6874: rav1d=  17035022.3 dav1d=  15224567.3 ratio= 1.12, diff=  1810455.00 n_samples@100khz=1703
 9200: rav1d=  15457874.8 dav1d=  13648365.4 ratio= 1.13, diff=  1809509.40 n_samples@100khz=1545
 1920: rav1d=  15225706.2 dav1d=  13423026.3 ratio= 1.13, diff=  1802679.90 n_samples@100khz=1522
 9181: rav1d=  18296001.4 dav1d=  16494613.2 ratio= 1.11, diff=  1801388.20 n_samples@100khz=1829
 6740: rav1d=  16754291.3 dav1d=  14969318.7 ratio= 1.12, diff=  1784972.60 n_samples@100khz=1675
 9308: rav1d=  16393914.0 dav1d=  14609544.5 ratio= 1.12, diff=  1784369.50 n_samples@100khz=1639
 6862: rav1d=  12294920.7 dav1d=  10513683.5 ratio= 1.17, diff=  1781237.20 n_samples@100khz=1229
 9167: rav1d=  15510178.1 dav1d=  13737766.3 ratio= 1.13, diff=  1772411.80 n_samples@100khz=1551
 6870: rav1d=  11900472.3 dav1d=  10136742.6 ratio= 1.17, diff=  1763729.70 n_samples@100khz=1190
 6480: rav1d=  12547323.6 dav1d=  10784544.8 ratio= 1.16, diff=  1762778.80 n_samples@100khz=1254
 9191: rav1d=  15629694.5 dav1d=  13867475.7 ratio= 1.13, diff=  1762218.80 n_samples@100khz=1562
 1751: rav1d=  14887199.4 dav1d=  13127492.6 ratio= 1.13, diff=  1759706.80 n_samples@100khz=1488
 9317: rav1d=  16231630.4 dav1d=  14473621.4 ratio= 1.12, diff=  1758009.00 n_samples@100khz=1623
 9199: rav1d=  17742668.4 dav1d=  15985142.7 ratio= 1.11, diff=  1757525.70 n_samples@100khz=1774
 9320: rav1d=  16519434.2 dav1d=  14762244.4 ratio= 1.12, diff=  1757189.80 n_samples@100khz=1651
 6284: rav1d=  18506325.2 dav1d=  16749223.2 ratio= 1.10, diff=  1757102.00 n_samples@100khz=1850
 6867: rav1d=  12062536.4 dav1d=  10306431.2 ratio= 1.17, diff=  1756105.20 n_samples@100khz=1206
 6861: rav1d=  12333730.6 dav1d=  10580966.6 ratio= 1.17, diff=  1752764.00 n_samples@100khz=1233
 1569: rav1d=  15048133.7 dav1d=  13298266.5 ratio= 1.13, diff=  1749867.20 n_samples@100khz=1504
 9321: rav1d=  15842388.7 dav1d=  14097614.8 ratio= 1.12, diff=  1744773.90 n_samples@100khz=1584
 9152: rav1d=  15164263.8 dav1d=  13420003.8 ratio= 1.13, diff=  1744260.00 n_samples@100khz=1516
 6561: rav1d=  15694681.4 dav1d=  13953587.1 ratio= 1.12, diff=  1741094.30 n_samples@100khz=1569
 6597: rav1d=  15810137.1 dav1d=  14069158.1 ratio= 1.12, diff=  1740979.00 n_samples@100khz=1581
 6736: rav1d=  12336682.9 dav1d=  10602805.7 ratio= 1.16, diff=  1733877.20 n_samples@100khz=1233
 6869: rav1d=  11539205.0 dav1d=   9810908.2 ratio= 1.18, diff=  1728296.80 n_samples@100khz=1153
 9202: rav1d=  15542952.5 dav1d=  13816646.3 ratio= 1.12, diff=  1726306.20 n_samples@100khz=1554
 9235: rav1d=  17664942.8 dav1d=  15939932.7 ratio= 1.11, diff=  1725010.10 n_samples@100khz=1766
 1660: rav1d=  15183188.3 dav1d=  13460438.7 ratio= 1.13, diff=  1722749.60 n_samples@100khz=1518
 9190: rav1d=  18024098.9 dav1d=  16302028.7 ratio= 1.11, diff=  1722070.20 n_samples@100khz=1802
 9173: rav1d=  15478870.3 dav1d=  13758195.7 ratio= 1.13, diff=  1720674.60 n_samples@100khz=1547
 9268: rav1d=  15699709.9 dav1d=  13983821.0 ratio= 1.12, diff=  1715888.90 n_samples@100khz=1569
 6601: rav1d=  13394198.3 dav1d=  11678414.6 ratio= 1.15, diff=  1715783.70 n_samples@100khz=1339
 9258: rav1d=  16777138.2 dav1d=  15062249.6 ratio= 1.11, diff=  1714888.60 n_samples@100khz=1677
 6773: rav1d=  12501258.9 dav1d=  10790214.3 ratio= 1.16, diff=  1711044.60 n_samples@100khz=1250
 9182: rav1d=  15618810.0 dav1d=  13910518.5 ratio= 1.12, diff=  1708291.50 n_samples@100khz=1561
 6875: rav1d=  13711879.3 dav1d=  12003729.1 ratio= 1.14, diff=  1708150.20 n_samples@100khz=1371
 1894: rav1d=  14820272.2 dav1d=  13112558.7 ratio= 1.13, diff=  1707713.50 n_samples@100khz=1482
 6382: rav1d=  21241665.4 dav1d=  19534148.6 ratio= 1.09, diff=  1707516.80 n_samples@100khz=2124
 9220: rav1d=  15694992.1 dav1d=  13988952.7 ratio= 1.12, diff=  1706039.40 n_samples@100khz=1569
 6741: rav1d=  13998722.6 dav1d=  12293335.9 ratio= 1.14, diff=  1705386.70 n_samples@100khz=1399
 1582: rav1d=  14790423.1 dav1d=  13085234.3 ratio= 1.13, diff=  1705188.80 n_samples@100khz=1479
 6863: rav1d=  11558143.7 dav1d=   9853388.8 ratio= 1.17, diff=  1704754.90 n_samples@100khz=1155
 6856: rav1d=  17887305.3 dav1d=  16185632.4 ratio= 1.11, diff=  1701672.90 n_samples@100khz=1788
 6478: rav1d=  12012756.0 dav1d=  10312156.7 ratio= 1.16, diff=  1700599.30 n_samples@100khz=1201
 6771: rav1d=  12950217.7 dav1d=  11256301.3 ratio= 1.15, diff=  1693916.40 n_samples@100khz=1295
 6682: rav1d=  13404503.6 dav1d=  11710972.6 ratio= 1.14, diff=  1693531.00 n_samples@100khz=1340
 1244: rav1d=  15337421.8 dav1d=  13645752.5 ratio= 1.12, diff=  1691669.30 n_samples@100khz=1533
 9221: rav1d=  15824731.5 dav1d=  14134069.1 ratio= 1.12, diff=  1690662.40 n_samples@100khz=1582
 9318: rav1d=  15718866.8 dav1d=  14032765.0 ratio= 1.12, diff=  1686101.80 n_samples@100khz=1571
 9205: rav1d=  15487603.5 dav1d=  13803014.7 ratio= 1.12, diff=  1684588.80 n_samples@100khz=1548
 6868: rav1d=  11611749.2 dav1d=   9928509.6 ratio= 1.17, diff=  1683239.60 n_samples@100khz=1161
 9228: rav1d=  15775589.1 dav1d=  14093263.5 ratio= 1.12, diff=  1682325.60 n_samples@100khz=1577
 9214: rav1d=  15145493.5 dav1d=  13464096.5 ratio= 1.12, diff=  1681397.00 n_samples@100khz=1514
 2128: rav1d=  15430232.0 dav1d=  13749780.9 ratio= 1.12, diff=  1680451.10 n_samples@100khz=1543
 9326: rav1d=  15879450.0 dav1d=  14199993.8 ratio= 1.12, diff=  1679456.20 n_samples@100khz=1587
 6745: rav1d=  12257767.0 dav1d=  10578696.4 ratio= 1.16, diff=  1679070.60 n_samples@100khz=1225
 6560: rav1d=  18572310.0 dav1d=  16896700.0 ratio= 1.10, diff=  1675610.00 n_samples@100khz=1857
 1699: rav1d=  15775983.9 dav1d=  14101787.6 ratio= 1.12, diff=  1674196.30 n_samples@100khz=1577
 9265: rav1d=  15795137.0 dav1d=  14121877.4 ratio= 1.12, diff=  1673259.60 n_samples@100khz=1579
 9347: rav1d=  16168011.8 dav1d=  14503087.7 ratio= 1.11, diff=  1664924.10 n_samples@100khz=1616
 1764: rav1d=  14548468.9 dav1d=  12883636.9 ratio= 1.13, diff=  1664832.00 n_samples@100khz=1454
 6602: rav1d=  12718329.3 dav1d=  11056896.8 ratio= 1.15, diff=  1661432.50 n_samples@100khz=1271
 9319: rav1d=  15850951.6 dav1d=  14192264.5 ratio= 1.12, diff=  1658687.10 n_samples@100khz=1585
 9236: rav1d=  15517922.5 dav1d=  13859411.6 ratio= 1.12, diff=  1658510.90 n_samples@100khz=1551
 9203: rav1d=  15475966.7 dav1d=  13818568.1 ratio= 1.12, diff=  1657398.60 n_samples@100khz=1547
 6738: rav1d=  11829454.9 dav1d=  10172663.3 ratio= 1.16, diff=  1656791.60 n_samples@100khz=1182
[1244,1569,1582,1660,1699,1738,1751,1764,1894,1920,2128,5183,5543,6284,6285,6326,6382,6475,6476,6478,6480,6484,6560,6561,6597,6601,6602,6682,6736,6738,6740,6741,6745,6771,6773,6856,6859,6861,6862,6863,6865,6866,6867,6868,6869,6870,6874,6875,9152,9167,9168,9169,9173,9174,9175,9176,9177,9178,9181,9182,9183,9184,9185,9186,9187,9190,9191,9192,9193,9194,9195,9196,9199,9200,9201,9202,9203,9204,9205,9214,9220,9221,9228,9235,9236,9237,9258,9264,9265,9268,9308,9317,9318,9319,9320,9321,9326,9347]
would get 150186 samples

And a graph of these times:

Image

I then also recorded some CPU counters. The counters were: instructions, branch instructions, branch mispredictions, L1 dcm, L2 dcm, L2 icm, TLB dm, TLB im, branch not taken, and l1 dca. dcm/icm is data/instruction cache miss, dm/im is data/instruction miss. I then tried to look at how well those counters correlate with the time differences. I have to record groups of 4 counters at a time because of hardware limits. The output is pasted below for each group, and also calculates the correlation for the slowest X% of frames.

Frame counter output of `analyzeframestats.py`
============== ins_vec_br_miss ==========
------- rav1d ---------
             frame       elapsed  PAPI_TOT_INS  PAPI_VEC_INS   PAPI_BR_INS   PAPI_BR_MSP
count  9831.000000  9.831000e+03  9.831000e+03  9.831000e+03  9.831000e+03  9.831000e+03
mean   4915.000000  7.815855e+06  8.211349e+07  4.891480e+04  6.525285e+06  3.097026e+05
std    2838.109582  5.000538e+06  5.426068e+07  1.139955e+05  4.116764e+06  2.094043e+05
min       0.000000  2.206650e+05  1.272693e+06  0.000000e+00  1.491260e+05  6.100000e+02
25%    2457.500000  2.937480e+06  2.695083e+07  0.000000e+00  2.396874e+06  8.983950e+04
50%    4915.000000  9.506489e+06  1.018283e+08  0.000000e+00  7.800958e+06  3.766300e+05
75%    7372.500000  1.155520e+07  1.215945e+08  4.488000e+04  9.497878e+06  4.648815e+05
max    9830.000000  4.207895e+07  4.768892e+08  1.713824e+06  3.753090e+07  1.962799e+06
elapsed total time 76.837667799s

------- dav1d ---------
             frame       elapsed  PAPI_TOT_INS  PAPI_VEC_INS   PAPI_BR_INS   PAPI_BR_MSP
count  9831.000000  9.831000e+03  9.831000e+03  9.831000e+03  9.831000e+03  9.831000e+03
mean   4915.000000  7.496412e+06  8.211351e+07  4.891480e+04  6.525283e+06  3.069317e+05
std    2838.109582  4.838445e+06  5.426066e+07  1.139955e+05  4.116763e+06  2.077236e+05
min       0.000000  2.165070e+05  1.272790e+06  0.000000e+00  1.491260e+05  6.190000e+02
25%    2457.500000  2.591232e+06  2.695084e+07  0.000000e+00  2.396878e+06  8.819150e+04
50%    4915.000000  9.183322e+06  1.018283e+08  0.000000e+00  7.800956e+06  3.731340e+05
75%    7372.500000  1.108462e+07  1.215945e+08  4.488000e+04  9.497868e+06  4.605740e+05
max    9830.000000  4.170451e+07  4.768893e+08  1.713824e+06  3.753090e+07  1.969427e+06

------- diff ---------
        frame       elapsed  PAPI_TOT_INS  PAPI_VEC_INS  PAPI_BR_INS  PAPI_BR_MSP
count  9831.0  9.831000e+03   9831.000000        9831.0  9831.000000   9831.00000
mean      0.0  3.194431e+05    -13.404232           0.0     1.628013   2770.94914
std       0.0  4.765613e+05     42.183340           0.0    10.234954   6877.72657
min       0.0 -3.618311e+06   -503.000000           0.0  -103.000000 -56295.00000
25%       0.0  1.286550e+04    -27.000000           0.0     0.000000   -104.50000
50%       0.0  1.625270e+05      0.000000           0.0     0.000000    711.00000
75%       0.0  5.367530e+05      1.000000           0.0     1.000000   3976.50000
max       0.0  4.064331e+06    407.000000           0.0   123.000000  80797.00000
---- corr of everything --- 
Correlation of elapsed to    PAPI_TOT_INS =  0.15
Correlation of elapsed to     PAPI_BR_INS =  0.21
Correlation of elapsed to     PAPI_BR_MSP =  0.76

---- corr of slowest 20% --- 
Correlation of elapsed to    PAPI_TOT_INS =  0.18
Correlation of elapsed to     PAPI_BR_INS =  0.22
Correlation of elapsed to     PAPI_BR_MSP =  0.75
---- corr of slowest 10% --- 
Correlation of elapsed to    PAPI_TOT_INS =  0.19
Correlation of elapsed to     PAPI_BR_INS =  0.22
Correlation of elapsed to     PAPI_BR_MSP =  0.76
---- corr of slowest 5% --- 
Correlation of elapsed to    PAPI_TOT_INS =  0.18
Correlation of elapsed to     PAPI_BR_INS =  0.22
Correlation of elapsed to     PAPI_BR_MSP =  0.76
---- corr of slowest 1% --- 
Correlation of elapsed to    PAPI_TOT_INS =  0.16
Correlation of elapsed to     PAPI_BR_INS =  0.21
Correlation of elapsed to     PAPI_BR_MSP =  0.76

============== l1d_l2d_l2i_tlbd ==========
------- rav1d ---------
             frame       elapsed   PAPI_L1_DCM    PAPI_L2_DCM    PAPI_L2_ICM    PAPI_TLB_DM
count  9831.000000  9.831000e+03  9.831000e+03    9831.000000    9831.000000    9831.000000
mean   4915.000000  8.192587e+06  1.198464e+06  261203.978639  239563.264266  376528.263656
std    2838.109582  5.265061e+06  5.204459e+05   82804.093406  119197.818100  198517.593662
min       0.000000  2.335580e+05  2.049120e+05   58358.000000   10761.000000   35975.000000
25%    2457.500000  2.869926e+06  8.012055e+05  193242.500000  146015.000000  199737.000000
50%    4915.000000  1.000468e+07  1.344840e+06  278218.000000  288323.000000  429628.000000
75%    7372.500000  1.212662e+07  1.577246e+06  327860.500000  324695.000000  517036.500000
max    9830.000000  4.294083e+07  2.501230e+06  448502.000000  637931.000000  882615.000000
elapsed total time 80.54131886s

------- dav1d ---------
             frame       elapsed   PAPI_L1_DCM    PAPI_L2_DCM    PAPI_L2_ICM    PAPI_TLB_DM
count  9831.000000  9.831000e+03  9.831000e+03    9831.000000    9831.000000    9831.000000
mean   4915.000000  7.522485e+06  1.073697e+06  244132.600651  220122.325806  334866.711016
std    2838.109582  4.862919e+06  4.498037e+05   73069.677012  107827.584164  174699.292661
min       0.000000  2.157950e+05  1.968110e+05   60735.000000   10441.000000   34220.000000
25%    2457.500000  2.613770e+06  7.526455e+05  185243.000000  141525.500000  181827.500000
50%    4915.000000  9.203498e+06  1.202949e+06  260521.000000  256773.000000  378850.000000
75%    7372.500000  1.112855e+07  1.396412e+06  302753.500000  295655.500000  458210.500000
max    9830.000000  4.187731e+07  2.196017e+06  410406.000000  599332.000000  790612.000000

------- diff ---------
        frame       elapsed    PAPI_L1_DCM   PAPI_L2_DCM   PAPI_L2_ICM    PAPI_TLB_DM
count  9831.0  9.831000e+03    9831.000000   9831.000000   9831.000000    9831.000000
mean      0.0  6.701012e+05  124766.617740  17071.377988  19440.938460   41661.552640
std       0.0  4.978306e+05   75619.540246  16177.914195  14917.998467   25547.405588
min       0.0 -3.063915e+06  -25812.000000 -56229.000000 -45866.000000   -3116.000000
25%       0.0  2.464290e+05   47928.500000   4041.000000   6144.000000   15841.000000
50%       0.0  6.814990e+05  146434.000000  15519.000000  21926.000000   48846.000000
75%       0.0  1.032891e+06  181846.000000  28336.500000  31236.000000   61612.000000
max       0.0  3.724840e+06  355201.000000  83870.000000  89851.000000  135583.000000
---- corr of everything --- 
Correlation of elapsed to     PAPI_L1_DCM =  0.85
Correlation of elapsed to     PAPI_L2_DCM =  0.58
Correlation of elapsed to     PAPI_L2_ICM =  0.72
Correlation of elapsed to     PAPI_TLB_DM =  0.85

---- corr of slowest 20% --- 
Correlation of elapsed to     PAPI_L1_DCM =  0.79
Correlation of elapsed to     PAPI_L2_DCM =  0.47
Correlation of elapsed to     PAPI_L2_ICM =  0.56
Correlation of elapsed to     PAPI_TLB_DM =  0.80
---- corr of slowest 10% --- 
Correlation of elapsed to     PAPI_L1_DCM =  0.84
Correlation of elapsed to     PAPI_L2_DCM =  0.54
Correlation of elapsed to     PAPI_L2_ICM =  0.66
Correlation of elapsed to     PAPI_TLB_DM =  0.85
---- corr of slowest 5% --- 
Correlation of elapsed to     PAPI_L1_DCM =  0.86
Correlation of elapsed to     PAPI_L2_DCM =  0.57
Correlation of elapsed to     PAPI_L2_ICM =  0.70
Correlation of elapsed to     PAPI_TLB_DM =  0.87
---- corr of slowest 1% --- 
Correlation of elapsed to     PAPI_L1_DCM =  0.88
Correlation of elapsed to     PAPI_L2_DCM =  0.58
Correlation of elapsed to     PAPI_L2_ICM =  0.72
Correlation of elapsed to     PAPI_TLB_DM =  0.88

============== tlbi_brntk_l1a ==========
------- rav1d ---------
             frame       elapsed   PAPI_TLB_IM   PAPI_BR_NTK   PAPI_L1_DCA
count  9831.000000  9.831000e+03   9831.000000  9.831000e+03  9.831000e+03
mean   4915.000000  8.169632e+06   4020.198250  4.579056e+06  3.970638e+07
std    2838.109582  5.193572e+06   2927.850272  2.892591e+06  2.632451e+07
min       0.000000  2.361530e+05     46.000000  7.088500e+04  7.944690e+05
25%    2457.500000  2.977706e+06    998.000000  1.656762e+06  1.306605e+07
50%    4915.000000  1.002562e+07   4964.000000  5.498826e+06  4.877599e+07
75%    7372.500000  1.208258e+07   6264.000000  6.725635e+06  5.924107e+07
max    9830.000000  4.114268e+07  31393.000000  2.329288e+07  2.209375e+08
elapsed total time 80.315647572s

------- dav1d ---------
             frame       elapsed  PAPI_TLB_IM   PAPI_BR_NTK   PAPI_L1_DCA
count  9831.000000  9.831000e+03  9831.000000  9.831000e+03  9.831000e+03
mean   4915.000000  7.550308e+06   686.844980  2.162614e+06  3.529227e+07
std    2838.109582  4.877032e+06   410.031248  1.283278e+06  2.396374e+07
min       0.000000  2.156850e+05    28.000000  3.507100e+04  6.919480e+05
25%    2457.500000  2.628189e+06   342.000000  8.737255e+05  1.085096e+07
50%    4915.000000  9.289040e+06   784.000000  2.532062e+06  4.338576e+07
75%    7372.500000  1.115961e+07   940.000000  3.071350e+06  5.282891e+07
max    9830.000000  4.153261e+07  4872.000000  1.165186e+07  2.152608e+08

------- diff ---------
        frame       elapsed   PAPI_TLB_IM   PAPI_BR_NTK   PAPI_L1_DCA
count  9831.0  9.831000e+03   9831.000000  9.831000e+03  9.831000e+03
mean      0.0  6.193239e+05   3333.353270  2.416442e+06  4.414108e+06
std       0.0  4.756888e+05   2600.994883  1.621798e+06  2.474237e+06
min       0.0 -6.300309e+06   -452.000000  3.581400e+04  9.934700e+04
25%       0.0  2.691650e+05    634.500000  7.863975e+05  2.236618e+06
50%       0.0  6.108070e+05   3971.000000  3.021118e+06  5.278153e+06
75%       0.0  9.334125e+05   5317.000000  3.656908e+06  6.333852e+06
max       0.0  6.330616e+06  30386.000000  1.164102e+07  1.324842e+07
---- corr of everything --- 
Correlation of elapsed to     PAPI_TLB_IM =  0.69
Correlation of elapsed to     PAPI_BR_NTK =  0.65
Correlation of elapsed to     PAPI_L1_DCA =  0.72

---- corr of slowest 20% --- 
Correlation of elapsed to     PAPI_TLB_IM =  0.62
Correlation of elapsed to     PAPI_BR_NTK =  0.57
Correlation of elapsed to     PAPI_L1_DCA =  0.67
---- corr of slowest 10% --- 
Correlation of elapsed to     PAPI_TLB_IM =  0.70
Correlation of elapsed to     PAPI_BR_NTK =  0.66
Correlation of elapsed to     PAPI_L1_DCA =  0.75
---- corr of slowest 5% --- 
Correlation of elapsed to     PAPI_TLB_IM =  0.73
Correlation of elapsed to     PAPI_BR_NTK =  0.70
Correlation of elapsed to     PAPI_L1_DCA =  0.78
---- corr of slowest 1% --- 
Correlation of elapsed to     PAPI_TLB_IM =  0.75
Correlation of elapsed to     PAPI_BR_NTK =  0.72
Correlation of elapsed to     PAPI_L1_DCA =  0.79

This counter work was done in this rav1d branch and this dav1d branch


So then I wanted to see why these frames were slower, so I hacked in a way to enable+disable perf record for each of the slow frames in this branch. It gets used like this:

mkfifo fifo
perf record -g --delay=-1 --control=fifo:fifo target/release-with-debug/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1

and the perf record can be adjusted to record whatever you want. This outputs a single perf.data with only samples from the slow frames (hardcoded into the code).


This still isn't granular enough though so I've been thinking of more targeted analysis that could better pinpoint exactly which points are slow. I have a few rough ideas and most revolve around being able to repeatedly run a specific frame (or call) with slightly different perf options (either via perf record out of process or directly with perf_event_open in process). Is there a (hopefully easy) way to hack in a flag that could repeatedly decode a single frame, or perhaps also the N previous frames to allow for realistic warmup of caches? Otherwise I can use the --skip flag which I just discovered.

Ok I think that's it, let me know if I forgot to include something somewhere

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions