A classical cache-oriented optimization is loop interchange (see lectures). We need a good example so IPC change can be visible for that trace.