In trying to run my problem with Float32's (because I'm interesting in GPU execution), the regularization is failing in the computation of the starting point. Rather than backing off and increasing the regularization parameters, this just fails. In particular, it always chooses a fixed regD of 1e-6 which is small for Float32. My suggestion is that the starting point computation backoff and increase the regularization.