The 3-layer MLP that we use generally does not utilize the entire GPU bandwidth, which means that we might be able to run multiple models on the same GPU in parallel and get some speedup. I'm not sure if this is feasible with Tensorflow and its Graphs / Sessions but I'm guessing that each MLP instance would probably need its own TF Graph and TF Session.
Assuming that all works, in phase 1 we can add parallelism easily with the n_jobs parameter of cross_val_score(), and for phase 2 we'd probably have to do it ourselves with multiprocess.