By introducing concurrent data-level parallelism in calc_frenet_paths and calc_global_paths functions, and by using vectorized instructions in calc, calcd, and __search_index, a major speedup is obtained. In my application, execution is up to 5 times faster.
Here are the changes: cubic_spline_planner.py, frenet_optimal_trajectory.py.
Here (just section "Preliminary Work: Parallelization") a detailed explanation of the changes and their impact on the execution time.
If @AtsushiSakai is interested in the changes, I am available to integrate them in this repository and perform some further testing.