-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Not surprisingly, the bottleneck is computing atomic to molecular orbitals transformation via matrix multiplication
In the case of 786 basis-functions (dipeptide) and 1 million points this is a 786x786 matrix multiplication with 786 x 1,000,000 matrix. This array will end up being a largely sparse matrix, and thus it would motivate to break up the matrices into sub-matrices.
This is in all cases more expensive (50-70-186- ms) than computing the atomic orbitals (their derivs as well), which is around 16-48 ms).
The following algorithm is probably the most efficient.
Let M be the number of basis-functions, N=Number of points and K=number of atoms
-
It takes 10 ms to compute atomic orbitals, so while it is doing so fill up an array of boolean of size NxK, if a point is non-zero at an atomic orbital for an atom then sets to one its entry in NxK array. This can be done with one instruction using ternary operators, which has a PTX assembly command [slct]https://docs.nvidia.com/cuda/parallel-thread-execution/#comparison-and-selection-instructions-slct). My guess it is a fixed-latency operation.
You can do this by assuming most atoms don't go past couple bond orders, 5-7 angstrom. -
Then split up the points and boolean array N_1xK, N_2xK, ..., N_7xK. I choose seven because most atoms wouldn't bond past seven, but this should be optimized. For each subset
N_1, N_2 ...
, figure out the atoms where the atomic orbitals are positive. Allocate the smaller MO coefficient and Atomic orbitals, Write up a quick copy kernel that transfers based on which atoms should be placed into the smaller should be around (20 ms), then calculate the electron density as usual.