Thanks for the wonderful write up! We can cache the layer output and intermediate to make this implementation fast.