Skip to content

Crash in SNAT during reference counter increase #621

Open
@PlagueCZ

Description

@PlagueCZ
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7faf267026c0 (LWP 16))]
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007faf290dbe9f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007faf2908cfb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007faf29077472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00007faf29a09a4d in __rte_panic () from /usr/local/lib/x86_64-linux-gnu/librte_eal.so.24
#5  0x0000564aee2df5d3 in dp_ref_inc (ref=0x1249d94b08) at ../include/dp_refcount.h:36
#6  0x0000564aee2dfb0d in dp_process_ipv4_snat (snat_data=0x1249f49180, port=0x564afe4b4d40, cntrack=0x1249d94a40, df=0x11ea640a80, m=0x11ea640a00) at ../src/nodes/snat_node.c:74
#7  get_next_index (node=0x125a37f540, m=0x11ea640a00) at ../src/nodes/snat_node.c:175
#8  0x0000564aee2e0455 in dp_foreach_graph_packet (get_next_index=0x564aee2df60c <get_next_index>, speculated_node=1, nb_objs=1, objs=0x124833db80, node=0x125a37f540, graph=0x125a367700) at ../include/nodes/common_node.h:45
#9  snat_node_process (graph=0x125a367700, node=0x125a37f540, objs=0x124833db80, nb_objs=1) at ../src/nodes/snat_node.c:248
#10 0x0000564aee39b0d1 in __rte_node_process (node=0x125a37f540, graph=0x125a367700) at /usr/local/include/rte_graph_worker_common.h:186
#11 rte_graph_walk_rtc (graph=0x125a367700) at /usr/local/include/rte_graph_model_rtc.h:42
#12 0x0000564aee39b41d in rte_graph_walk (graph=0x125a367700) at /usr/local/include/rte_graph_worker.h:38
#13 0x0000564aee39b88a in graph_main_loop (arg=0x0) at ../src/dpdk_layer.c:117
#14 0x00007faf29a1e1b6 in eal_thread_loop () from /usr/local/lib/x86_64-linux-gnu/librte_eal.so.24
#15 0x00007faf29a2fe09 in eal_worker_thread_loop () from /usr/local/lib/x86_64-linux-gnu/librte_eal.so.24
#16 0x00007faf290da144 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#17 0x00007faf2915a7dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

This has happened multiple times in OSC. Looking at the code, this is caused by improper order of operations in snat_node.c:74

  • dp_delete_flow() causes dp_ref_dec() which can possibly go to zero
  • counter being zero causes freeing of resources
  • a new flow is create to replace the (already deleted) one
  • dp_ref_inc() is then called on a freed-up reference

I have created a temporary fix for OSC that simply changes the order to:

  • dp_ref_inc()
  • dp_delete_flow()
  • only then create and replace the flow
  • if this creation fails, dp_ref_dec() is needed to revert the previous increase

Now I stand by this order of operations, but I am also aware, that the situation should never happen, as there should always be at least 2 references for a flow. But from a local code review the order simply should be done this way to avoid confusion.

The next question is, why the situation has arisen, because I am simply curing the symptom and not a cause. This is still ongoing in OSC.

I have not yet created a PR because I think this can have better solutions and some discussion is surely needed before doing any big changes.

Metadata

Metadata

Labels

Type

No type

Projects

Status

OnHold

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions