This in practice suggest to use a "genetic algorithm" guided by the metrics (fitness function).
One metric we could implement is execution time on an interpreter and try to optimise code for that. The optimisation passes form binaryen (and others) can be used.