cpu: reorder: tentatively turn ref direct copy code for gcc

Fomenko, Evarist M · tprimak · commit 567dfb523261 · 2018-11-28T12:25:57.000-08:00
Rationale: jitted code is typically faster than reference code compiled with old GCC (4.8.3). However jitted code requires significant creation time, so if someone always creates reorders prior to its execution jitted code might become slower than simple reference code. This commit is tentative. Intel MKL-DNN team needs to find out a way to make jitting less expensive... especially for such auxiliary and quite popular stuff like direct copy and other reorders. (cherry picked from commit 44b09b8)
diff --git a/src/cpu/cpu_reorder.cpp b/src/cpu/cpu_reorder.cpp
@@ -50,9 +50,17 @@ static const rpd_create_f cpu_reorder_impl_list[] = {
     wino_reorder_t<f32, f32>::pd_t::create,
     wino_reorder_t<f32, s8>::pd_t::create,
 
+#if defined(__INTEL_COMPILER) || (defined(__GNUC__) && !defined(__clang__))
+    /* Direct copy for icc which is faster than jitted code;
+     * Direct copy for gcc which might or might not be faster than jitted
+     * code, but still worth it because doesn't require jitting, i.e. much
+     * faster creation time. This is tentative solution and should be removed
+     * later (when we will cache jitted code?...). */
+    REG_SR_DIRECT_COPY(f32, f32),
+#endif
+
 #ifdef __INTEL_COMPILER
     /* direct copy for icc, which is faster than jitted code */
-    REG_SR_DIRECT_COPY(f32, f32),
     REG_SR_DIRECT_COPY(f32, s32),
     REG_SR_DIRECT_COPY(f32, s8),
     REG_SR_DIRECT_COPY(f32, u8),