Direct AMO mode still shows about 2x overhead than that of SOS
Benchmark: osu_oshm_atomics_all2one (shmem_int_finc -> MPI FOP)
#SOS #direct-amo
Theta/np=2 14.88 39.22
Cori/np=2 2.67 4.21
The current atomics check in OFI can be the cause. We need analyze the atomics path and apply optimizations similer to previous RMA (IPO inline, reduce instructions on AMO path)