You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The basic loop structure of Snabb is the following (simplified)
repeat
for i = 1, #breathe_pull_order do
breathe_pull_order[i]:pull()
end
for i = 1, #breathe_push_order do
breathe_push_order[i]:push()
end
until done and done()
The actual workload is processed in the pull() and push() methods, which, in general, contain loops of the kind
for _ = 1, link.nreadable(link) do
local p = link.receive(link)
-- Process packet
end
Those are the loops that we would like to be compiled most efficiently, i.e. containing an actual loop in the trace itself (a "looping" trace). Let's assume that the compiler actually does find those traces (but note that this is by no means guaranteed in a complex Snabb program). Due to the design of the tracing compiler, the outer loops cannot form looping traces themselves and end up being implemented by connecting distinct "straight" traces. Those traces are considerably less optimized and the number and inner structure of them can have a significant impact on the overall performance of the program.
I would like to argue that it is beneficial to replace the explicite second-level loops over the breathe_pull_order and breathe_push_order arrays by implicit loops using goto:
repeat
local i = 1
::PULL_LOOP::
do
if i > #breathe_pull_order then goto PULL_EXIT
breathe_pull_order[i]:pull()
i = i + 1
goto PULL_LOOP
end
::PULL_EXIT::
i = 1
::PUSH_LOOP::
do
if i > #breathe_push_order then goto PUSH_EXIT
breathe_push_order[i]:push()
i = i + 1
goto PUSH_LOOP
end
::PUSH_EXIT::
until done and done()
To study the impact of this, consider the following simplified model of the breathe() loop
local C = require("ffi").C
local napps = 5
local iterations = 1e8
local apps = {}
local mode = "std"
if #main.parameters >= 1 then
mode = main.parameters[1]
end
count = 0
threshold = 1000
if #main.parameters >= 2 then
threshold = tonumber(main.parameters[2])
end
local function new_App ()
local App = {}
App.pull = loadstring([[
return function (self)
if count > threshold then
for _ = 1, 100 do
self.i = self.i+1
end
end
end]])()
return App
end
for i = 1, napps do
apps[i] = setmetatable({ i = 0 }, { __index = new_App() })
end
local start = C.get_time_ns()
if mode == "std" then
repeat
for i = 1, napps do
apps[i]:pull()
end
count = count + 1
until count > iterations
else
repeat
local i = 1
::LOOP::
do
if i > napps then goto DONE end
apps[i]:pull()
i = i + 1
goto LOOP
end
::DONE::
count = count + 1
until count > iterations
end
local now = C.get_time_ns()
print("elapsed "..(tonumber(now-start)/1e9).." seconds")
The program instantiates 5 pseudo-apps, each of which has a trivial pull method that simply increments a counter. Those methods are called from a nested loop in the style of a breathe loop. For simplicity, we omit the push loop, which will not change our findings in a fundamental way because the pull and push loops are sequential (i.e. the overall loop structure is the same in a "topological" sense).
The program takes two arguments. The first one is a string that selects which style of loop should be used. std will use the traditional nested loops while any other value will use the "flattened" loop using goto. The main repeat loop keeps track of its iterations in the global variable counter and the program terminates after 1e8 iterations, printing the time in seconds that has elapsed while the loop was being executed.The second parameter to the program is a number that determines after how many iterations of of the main loop the loops in the pull methods should be executed. This allows us to simulate a Snabb program which is idle for a while right after starting. I believe that this is actually the case in most real-world applications.
To simulate diversity in the apps being run, the actual functions that implement the pull methods are created via loadstring(). This makes sure that they will be implemented by different prototypes.
Let's start by studying a completely idle system with the standard loops
This looks like a good trace, but note the guard for the address of the pull function
0019 int FLOAD 0018 tab.hmask
0020 > int EQ 0019 +1
0021 r9 p32 FLOAD 0018 tab.node
0022 > p32 HREFK 0021 "pull" @1
0023 > fun HLOAD 0022
0024 > fun EQ 0023 0x40ce9090:1
This address is different for each pseudo-app. What happens right after the compiler has finished recording this loop iteration? Since this was a successful trace, the resulting code is optimized and compiled to a proper looping root trace and the system continues by executing the next iteration of the loop from that freshly compiled trace. However, in that very next iteration the mentioned guard fails, because the pull method of the next app is different from the previous one. As a consequence, the trace is left to the interpreter via exit #0 right away and before the actual loop has even be reached.
After this has happened 10 times (the default number of passes through a side exit after which it gets hot) the compiler starts recording a side trace
You can see that the address of the pull method is now 0x40ce8c18 when it was 0x40ce9090 in trace #2. Normally, a side trace cannot form a loop but this case is an exception, because the exit happens so early that the code has not yet been specialized enough to prohibit the formation of a new loop. The loop that is being recorded in trace #3 looks exactly like the previous trace and, in fact, the exact same thing is happening with it, i.e. the trace is left right in the next iteration.
The same game is repeated in traces #4 and #5. Sometimes it will happen that one of those traces is entered with the expected pull method and execution proceeds into the actual loop but since the next iteration will always cause the same failure of the guard, it will never reach the PHI at the end of the loop. Eventually, such an exit will become hot as well, which is the case for trace #6, which is started from exit #3 of trace #3. It executes the pull method and returns to the loop which has been compiled in trace #2 and therefore links directly to that trace.
Trace #7 is again like #2, #3, #4 and #5 and traces #8#9 and #10 are of the same kind as #6.
The good news is that everything is running compiled. The bad thing is that none of the optimized loops are actually executed.
Essentially, the flattened loop is like an explicit unrolling which removes the guards for the individual function prototypes. The loop is super compact and contains only the actual function calls, one after the other.
In this extreme case, where non of the apps share the same pull functions, this is actually the optimum. But any real-world Snabb program will probably also have a significant number of distinct apps in their networks, which makes this example relevant.
Now let's look what happens if the apps get busy after a short while, e.g. 1000 iterations of the main loop. During the first 1000 iterations, the compiler will find the exact same traces as just discussed. But what happens once the pull methods start executing their own loops?
As expected, we see the traces #2, #3, #4, #5 and #7 from the initial phase before the innermost loops become active. The five most active traces (#11-#15) all look like this
---- TRACE 11 start 0x41779eb0:3
0009 TGETS 5 0 2 ; "i"
0010 ADDVN 5 5 0 ; 1
0011 TSETS 5 0 2 ; "i"
0012 FORL 1 => 0009
---- TRACE 11 IR
.... SNAP #0 [ ---- ]
0001 rbp int SLOAD #2 CI
0002 rdx > tab SLOAD #1 T
0003 int FLOAD 0002 tab.hmask
0004 > int EQ 0003 +1
0005 rcx p32 FLOAD 0002 tab.node
0006 rax > p32 HREFK 0005 "i" @0
0007 xmm7 > num HLOAD 0006
0008 xmm7 + num ADD 0007 +1
0009 num HSTORE 0006 0008
0010 rbp + int ADD 0001 +1
.... SNAP #1 [ ---- ---- ]
0011 > int LE 0010 +100
.... SNAP #2 [ ---- ---- 0010 ---- ---- 0010 ]
0012 ------------ LOOP ------------
0013 xmm7 + num ADD 0008 +1
0014 num HSTORE 0006 0013
0015 rbp + int ADD 0010 +1
.... SNAP #3 [ ---- ---- ]
0016 > int LE 0015 +100
0017 rbp int PHI 0010 0015
0018 xmm7 num PHI 0008 0013
Clearly, those are the pull loops for each of the 5 apps and they are all compiled nicely, which is good. All remaining traces are simple "straight" traces that connect these loops with each other and with the initial traces.
We can see trace #2 from the "emtpy" phase and traces #3, #4, #5, #6 and #7 that contain the exact same looping traces for the pull methods. Also here, there are a handful of straight traces connecting all the loops.
The difference is that in the standard scenario, there is a significantly larger part of relatively inefficient code which translates to a running time which is 20% lower than in the standard case.
As a final test, consider the case when none of the apps is idle when the program starts.
Here, the innermost loops are naturally compiled first and none of the side-traces observed in the "empty" scenario occur. In fact, the trace structure is essentially the same as the one in the "flat + empty" case.
In the "full" scenario, both looping mechanisms are equivalent.
The conclusion is that we can significantly improve the performance in case the system is idle for a while when it starts up without any negative effects for the case when the system is loaded immediately. I believe the former case is actually more common in realistic scenarios.
Based on these findings I propose to apply the "flattening" of the breathe_pull_order and breathe_push_order loops in core.app.
The text was updated successfully, but these errors were encountered:
I have tested this modification with the l2vpn program in a production-like environment. The engine contained about 30 apps (20 different app classes) and about 60 links. In any realistic scenario, changes are very high that the system is idle after a regular start. I confirmed that it exhibits exactly the behavior described in this issue.
The "flattening" of the loops has reduced the noise of the compiler significantly, in particular when confronted with changing workloads (e.g. direction of bulk traffic from encap-heavy to decap-heavy, from ipv4 only to ipv6 only, from unfragmented to fragmented traffic). The observed effect is that there are much fewer aborted traces and much fewer implicit blacklistings of side traces. Performance has become much more stable across all workloads and "catastrophic JIT failures" have practically vanished. Those failures would result in highly suboptimal traces or even substantial interpreter fallback (>50%) when the system was exposed to a wide range of different workloads.
I am confident enough to include this change in the next production release of l2vpn. I would be very much interested to see the effect this will have on other Snabb programs.
The goal of this change is to simplify the loop structure of the Snabb
engine to make it more likely that the compiler can find and optimize
the packet-processing loops. In particular, it can help in the
situation when the program starts in an idle state with no packets to
process. In that case, we want to avoid that traces compiled for this
initial workload have a negative impact on the creation of traces for
the actual (non-idle) workload.
See snabbco#1414 for an analysis with
a simplified engine.
The basic loop structure of Snabb is the following (simplified)
The actual workload is processed in the
pull()
andpush()
methods, which, in general, contain loops of the kindThose are the loops that we would like to be compiled most efficiently, i.e. containing an actual loop in the trace itself (a "looping" trace). Let's assume that the compiler actually does find those traces (but note that this is by no means guaranteed in a complex Snabb program). Due to the design of the tracing compiler, the outer loops cannot form looping traces themselves and end up being implemented by connecting distinct "straight" traces. Those traces are considerably less optimized and the number and inner structure of them can have a significant impact on the overall performance of the program.
I would like to argue that it is beneficial to replace the explicite second-level loops over the
breathe_pull_order
andbreathe_push_order
arrays by implicit loops usinggoto
:To study the impact of this, consider the following simplified model of the
breathe()
loopThe program instantiates 5 pseudo-apps, each of which has a trivial
pull
method that simply increments a counter. Those methods are called from a nested loop in the style of abreathe
loop. For simplicity, we omit thepush
loop, which will not change our findings in a fundamental way because thepull
andpush
loops are sequential (i.e. the overall loop structure is the same in a "topological" sense).The program takes two arguments. The first one is a string that selects which style of loop should be used.
std
will use the traditional nested loops while any other value will use the "flattened" loop usinggoto
. The mainrepeat
loop keeps track of its iterations in the global variablecounter
and the program terminates after1e8
iterations, printing the time in seconds that has elapsed while the loop was being executed.The second parameter to the program is a number that determines after how many iterations of of the main loop the loops in thepull
methods should be executed. This allows us to simulate a Snabb program which is idle for a while right after starting. I believe that this is actually the case in most real-world applications.To simulate diversity in the apps being run, the actual functions that implement the
pull
methods are created vialoadstring()
. This makes sure that they will be implemented by different prototypes.Let's start by studying a completely idle system with the standard loops
Using
perf top
we see 7 active tracessim-empty-std.txt
The only loop that gets compiled is the
for i = 1, napps do
loop. Let's look at trace #2This looks like a good trace, but note the guard for the address of the
pull
functionThis address is different for each pseudo-app. What happens right after the compiler has finished recording this loop iteration? Since this was a successful trace, the resulting code is optimized and compiled to a proper looping root trace and the system continues by executing the next iteration of the loop from that freshly compiled trace. However, in that very next iteration the mentioned guard fails, because the
pull
method of the next app is different from the previous one. As a consequence, the trace is left to the interpreter via exit #0 right away and before the actual loop has even be reached.After this has happened 10 times (the default number of passes through a side exit after which it gets hot) the compiler starts recording a side trace
You can see that the address of the
pull
method is now0x40ce8c18
when it was0x40ce9090
in trace #2. Normally, a side trace cannot form a loop but this case is an exception, because the exit happens so early that the code has not yet been specialized enough to prohibit the formation of a new loop. The loop that is being recorded in trace #3 looks exactly like the previous trace and, in fact, the exact same thing is happening with it, i.e. the trace is left right in the next iteration.The same game is repeated in traces #4 and #5. Sometimes it will happen that one of those traces is entered with the expected
pull
method and execution proceeds into the actual loop but since the next iteration will always cause the same failure of the guard, it will never reach thePHI
at the end of the loop. Eventually, such an exit will become hot as well, which is the case for trace #6, which is started from exit #3 of trace #3. It executes thepull
method and returns to the loop which has been compiled in trace #2 and therefore links directly to that trace.Trace #7 is again like #2, #3, #4 and #5 and traces #8 #9 and #10 are of the same kind as #6.
The good news is that everything is running compiled. The bad thing is that none of the optimized loops are actually executed.
Let's compare this with the "flattened" loop
We can already see that the elapsed time is smaller by a factor of almost 30. The program compiles to a single trace
Essentially, the flattened loop is like an explicit unrolling which removes the guards for the individual function prototypes. The loop is super compact and contains only the actual function calls, one after the other.
In this extreme case, where non of the apps share the same
pull
functions, this is actually the optimum. But any real-world Snabb program will probably also have a significant number of distinct apps in their networks, which makes this example relevant.Now let's look what happens if the apps get busy after a short while, e.g. 1000 iterations of the main loop. During the first 1000 iterations, the compiler will find the exact same traces as just discussed. But what happens once the
pull
methods start executing their own loops?sim-std.txt
There are 20 active traces
As expected, we see the traces #2, #3, #4, #5 and #7 from the initial phase before the innermost loops become active. The five most active traces (#11-#15) all look like this
Clearly, those are the
pull
loops for each of the 5 apps and they are all compiled nicely, which is good. All remaining traces are simple "straight" traces that connect these loops with each other and with the initial traces.Compare this with the flattened loop
sim-flat.txt
We can see trace #2 from the "emtpy" phase and traces #3, #4, #5, #6 and #7 that contain the exact same looping traces for the
pull
methods. Also here, there are a handful of straight traces connecting all the loops.The difference is that in the standard scenario, there is a significantly larger part of relatively inefficient code which translates to a running time which is 20% lower than in the standard case.
As a final test, consider the case when none of the apps is idle when the program starts.
sim-full-std.txt
Here, the innermost loops are naturally compiled first and none of the side-traces observed in the "empty" scenario occur. In fact, the trace structure is essentially the same as the one in the "flat + empty" case.
How does it compare to the modified loop case?
sim-full-flat.txt
In the "full" scenario, both looping mechanisms are equivalent.
The conclusion is that we can significantly improve the performance in case the system is idle for a while when it starts up without any negative effects for the case when the system is loaded immediately. I believe the former case is actually more common in realistic scenarios.
Based on these findings I propose to apply the "flattening" of the
breathe_pull_order
andbreathe_push_order
loops incore.app
.The text was updated successfully, but these errors were encountered: