This function is a bottleneck for applications that require computing expectation values of Hamiltonians. The current implementation uses contract_2e from PySCF's FCI module. I wonder if we can write a faster version. We should also try the GPU version in gpu4pyscf.